AI Infrastructure Category - MarkTechPost

Serverless MCP Brings AI-Assisted Debugging to AWS Workflows Within Modern IDEs

Asif Razzaq — Mon, 21 Apr 2025 18:19:31 +0000

Serverless computing has significantly streamlined how developers build and deploy applications on cloud platforms like AWS. However, debugging and managing complex architectures—comprising services such as Lambda, DynamoDB, API Gateway, and IAM—often requires developers to jump between logs, dashboards, and local tooling. To address these challenges, Serverless Inc. has introduced Serverless MCP (Model Context Protocol), a powerful new protocol that enables seamless, AI-assisted debugging directly inside intelligent IDEs like Cursor.

The Serverless MCP builds upon a foundational idea: developers should be able to query, introspect, and resolve serverless application issues from where they write code—without the overhead of context switching or manually navigating AWS dashboards. This integration makes serverless development more accessible, especially for developers aiming to reduce the operational friction of cloud-native applications.

Solving the Debugging Dilemma in Serverless Architectures

Working with AWS serverless architectures involves interacting with various managed services. A typical application might use Lambda for compute, DynamoDB for storage, API Gateway to expose endpoints, and IAM for permissions. These services produce logs, metrics, and configuration data scattered across multiple consoles.

The debugging experience for developers often includes:

Manually finding CloudWatch logs tied to specific Lambda executions.
Tracing failed API Gateway requests across multiple services.
Tracking down misconfigured IAM roles and permissions.
Cross-referencing AWS documentation with real-time code behavior.

This fragmented experience is where Serverless MCP steps in.

What is Serverless MCP?

Serverless MCP (Model Context Protocol) is a developer-facing protocol that allows AI-assisted IDEs to communicate with AWS infrastructure resources via the Serverless Framework. Once installed and configured, MCP unlocks deep telemetry from deployed services and surfaces them directly in tools like Cursor and Windsurf.

The protocol enables these IDEs to:

Pull logs and metrics relevant to the current file or function.
Highlight failed invocations and error traces contextually.
Visualize service relationships (e.g., how a Lambda function connects to an API route or a DynamoDB table).
Recommend fixes for common issues like IAM misconfigurations or timeout errors.

The Serverless Framework CLI (v3.38+) now supports serverless dev, which activates the MCP interface. Once enabled, AI coding environments can query your infrastructure and assist in debugging without requiring manual log exploration or infrastructure navigation.

How MCP Works with IDEs like Cursor and Windsurf

In IDEs integrated with MCP, developers can hover over a line of code—say, an AWS Lambda function handler—and see the logs from its last execution, error messages, or even the duration and cold start metrics. This contextual debugging model reduces cognitive load and allows real-time understanding of production behavior.

Cursor, for example, uses AI models that are MCP-aware. When a developer writes or edits code, the AI agent queries the MCP interface to fetch infrastructure state, recent logs, and performance metrics relevant to the code segment. It then suggests improvements, flags misconfigurations, or explains recent failures.

This makes the MCP integration not just a log viewer, but an AI-assisted debugging assistant.

Security and Operational Considerations

Serverless MCP is designed with least-privilege principles in mind. The setup process involves creating a minimal set of IAM policies required for MCP access. This ensures that IDEs only fetch diagnostic data scoped to the developer’s workflow.

Moreover, since all the debugging insights are surfaced locally in the IDE, there is no need to expose your cloud dashboard or give third-party plugins blanket access to your AWS environment.

Conclusion

With the release of Serverless MCP, the debugging workflow for AWS serverless applications gets a much-needed upgrade. By embedding operational intelligence into AI-driven IDEs, Serverless bridges the gap between code and cloud, offering a smoother and more intuitive development experience.

As serverless architectures grow in complexity, tools like MCP will likely become foundational in modern DevOps pipelines—especially for teams seeking to minimize downtime and maximize iteration speed without diving deep into the AWS console. For developers already using the Serverless Framework, enabling MCP is a simple upgrade that promises significant productivity gains.

Check out the technical details and documentation. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Serverless MCP Brings AI-Assisted Debugging to AWS Workflows Within Modern IDEs appeared first on MarkTechPost.

Allen Institute for AI (Ai2) Launches OLMoTrace: Real-Time Tracing of LLM Outputs Back to Training Data

Asif Razzaq — Fri, 11 Apr 2025 22:11:57 +0000

Understanding the Limits of Language Model Transparency

As large language models (LLMs) become central to a growing number of applications—ranging from enterprise decision support to education and scientific research—the need to understand their internal decision-making becomes more pressing. A core challenge remains: how can we determine where a model’s response comes from? Most LLMs are trained on massive datasets consisting of trillions of tokens, yet there has been no practical tool to map model outputs back to the data that shaped them. This opacity complicates efforts to evaluate trustworthiness, trace factual origins, and investigate potential memorization or bias.

OLMoTrace – A Tool for Real-Time Output Tracing

The Allen Institute for AI (Ai2) recently introduced OLMoTrace, a system designed to trace segments of LLM-generated responses back to their training data in real time. The system is built on top of Ai2’s open-source OLMo models and provides an interface for identifying verbatim overlaps between generated text and the documents used during model training. Unlike retrieval-augmented generation (RAG) approaches, which inject external context during inference, OLMoTrace is designed for post-hoc interpretability—it identifies connections between model behavior and prior exposure during training.

OLMoTrace is integrated into the Ai2 Playground, where users can examine specific spans in an LLM output, view matched training documents, and inspect those documents in extended context. The system supports OLMo models including OLMo-2-32B-Instruct and leverages their full training data—over 4.6 trillion tokens across 3.2 billion documents.

Technical Architecture and Design Considerations

At the heart of OLMoTrace is infini-gram, an indexing and search engine built for extreme-scale text corpora. The system uses a suffix array-based structure to efficiently search for exact spans from the model’s outputs in the training data. The core inference pipeline comprises five stages:

Span Identification: Extracts all maximal spans from a model’s output that match verbatim sequences in the training data. The algorithm avoids spans that are incomplete, overly common, or nested.
Span Filtering: Ranks spans based on “span unigram probability,” which prioritizes longer and less frequent phrases, as a proxy for informativeness.
Document Retrieval: For each span, the system retrieves up to 10 relevant documents containing the phrase, balancing precision and runtime.
Merging: Consolidates overlapping spans and duplicates to reduce redundancy in the user interface.
Relevance Ranking: Applies BM25 scoring to rank the retrieved documents based on their similarity to the original prompt and response.

This design ensures that tracing results are not only accurate but also surfaced within an average latency of 4.5 seconds for a 450-token model output. All processing is performed on CPU-based nodes, using SSDs to accommodate the large index files with low-latency access.

Evaluation, Insights, and Use Cases

Ai2 benchmarked OLMoTrace using 98 LLM-generated conversations from internal usage. Document relevance was scored both by human annotators and by a model-based “LLM-as-a-Judge” evaluator (gpt-4o). The top retrieved document received an average relevance score of 1.82 (on a 0–3 scale), and the top-5 documents averaged 1.50—indicating reasonable alignment between model output and retrieved training context.

Three illustrative use cases demonstrate the system’s utility:

Fact Verification: Users can determine whether a factual statement was likely memorized from the training data by inspecting its source documents.
Creative Expression Analysis: Even seemingly novel or stylized language (e.g., Tolkien-like phrasing) can sometimes be traced back to fan fiction or literary samples in the training corpus.
Mathematical Reasoning: OLMoTrace can surface exact matches for symbolic computation steps or structured problem-solving examples, shedding light on how LLMs learn mathematical tasks.

These use cases highlight the practical value of tracing model outputs to training data in understanding memorization, data provenance, and generalization behavior.

Implications for Open Models and Model Auditing

OLMoTrace underscores the importance of transparency in LLM development, particularly for open-source models. While the tool only surfaces lexical matches and not causal relationships, it provides a concrete mechanism to investigate how and when language models reuse training material. This is especially relevant in contexts involving compliance, copyright auditing, or quality assurance.

The system’s open-source foundation, built under the Apache 2.0 license, also invites further exploration. Researchers may extend it to approximate matching or influence-based techniques, while developers can integrate it into broader LLM evaluation pipelines.

In a landscape where model behavior is often opaque, OLMoTrace sets a precedent for inspectable, data-grounded LLMs—raising the bar for transparency in model development and deployment

Check out Paper and Playground. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Allen Institute for AI (Ai2) Launches OLMoTrace: Real-Time Tracing of LLM Outputs Back to Training Data appeared first on MarkTechPost.

LLMs No Longer Require Powerful Servers: Researchers from MIT, KAUST, ISTA, and Yandex Introduce a New AI Approach to Rapidly Compress Large Language Models without a Significant Loss of Quality

Asif Razzaq — Fri, 11 Apr 2025 17:12:40 +0000

HIGGS — the innovative method for compressing large language models was developed in collaboration with teams at Yandex Research, MIT, KAUST and ISTA.
HIGGS makes it possible to compress LLMs without additional data or resource-intensive parameter optimization.
Unlike other compression methods, HIGGS does not require specialized hardware and powerful GPUs. Models can be quantized directly on a smartphone or laptop in just a few minutes with no significant quality loss.
The method has already been used to quantize popular LLaMA 3.1 and 3.2-family models, as well as DeepSeek and Qwen-family models.

The Yandex Research team, together with researchers from the Massachusetts Institute of Technology (MIT), the Austrian Institute of Science and Technology (ISTA) and the King Abdullah University of Science and Technology (KAUST), developed a method to rapidly compress large language models without a significant loss of quality.

Previously, deploying large language models on mobile devices or laptops involved a quantization process — taking anywhere from hours to weeks and it had to be run on industrial servers — to maintain good quality. Now, quantization can be completed in a matter of minutes right on a smartphone or laptop without industry-grade hardware or powerful GPUs.

HIGGS lowers the barrier to entry for testing and deploying new models on consumer-grade devices, like home PCs and smartphones by removing the need for industrial computing power.

The innovative compression method furthers the company’s commitment to making large language models accessible to everyone, from major players, SMBs, and non-profit organizations to individual contributors, developers, and researchers. Last year, Yandex researchers collaborated with major science and technology universities to introduce two novel LLM compression methods: Additive Quantization of Large Language Models (AQLM) and PV-Tuning. Combined, these methods can reduce model size by up to 8 times while maintaining 95% response quality.

Breaking Down LLM Adoption Barriers

Large language models require substantial computational resources, which makes them inaccessible and cost-prohibitive for most. This is also the case for open-source models, like the popular DeepSeek R1, which can’t be easily deployed on even the most advanced servers designed for model training and other machine learning tasks.

As a result, access to these powerful models has traditionally been limited to a select few organizations with the necessary infrastructure and computing power, despite their public availability.

However, HIGGS can pave the way for broader accessibility. Developers can now reduce model size without sacrificing quality and run them on more affordable devices. For example, this method can be used to compress LLMs like DeepSeek R1 with 671B parameters and Llama 4 Maverick with 400B parameters, which previously could only be quantized (compressed) with a significant loss in quality. This quantization technique unlocks new ways to use LLMs across various fields, especially in resource-constrained environments. Now, startups and independent developers can leverage compressed models to build innovative products and services, while cutting costs on expensive equipment.

Yandex is already using HIGGS to prototype and accelerate product development, and idea testing, as compressed models enable faster testing than their full-scale counterparts.

About the Method

HIGGS (Hadamard Incoherence with Gaussian MSE-optimal GridS) compresses large language models without requiring additional data or gradient descent methods, making quantization more accessible and efficient for a wide range of applications and devices. This is particularly valuable when there’s a lack of suitable data for calibrating the model. The method offers a balance between model quality, size, and quantization complexity, making it possible to use the models on a wide range of devices like smartphones and consumer laptops.

HIGGS was tested on the LLaMA 3.1 and 3.2-family models, as well as on Qwen-family models. Experiments show that HIGGS outperforms other data-free quantization methods, including NF4 (4-bit NormalFloat) and HQQ (Half-Quadratic Quantization), in terms of quality-to-size ratio.

Developers and researchers can already access the method on Hugging Face or explore the research paper, which is available on arXiv. At the end of this month, the team will present their paper at NAACL, one of the world’s top conferences on AI.

Continuous Commitment to Advancing Science and Optimization

This is one of several papers Yandex Research presented on large language model quantization. For example, the team presented AQLM and PV-Tuning, two methods of LLM compression that can reduce a company’s computational budget by up to 8 times without significant loss in AI response quality. The team also built a service that lets users run an 8B model on a regular PC or smartphone via a browser-based interface, even without high computing power.

Beyond LLM quantization, Yandex has open-sourced several tools that optimize resources used in LLM training. For example, the YaFSDP library accelerates LLM training by as much as 25% and reduces GPU resources for training by up to 20%.

Earlier this year, Yandex developers open-sourced Perforator, a tool for continuous real-time monitoring and analysis of servers and apps. Perforator highlights code inefficiencies and provides actionable insights, which helps companies reduce infrastructure costs by up to 20%. This could translate to potential savings in millions or even billions of dollars per year, depending on company size.

Check out Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit. Note: Thanks to the Yandex team for the thought leadership/ Resources for this article. Yandex team has financially supported us for this content/article.

The post LLMs No Longer Require Powerful Servers: Researchers from MIT, KAUST, ISTA, and Yandex Introduce a New AI Approach to Rapidly Compress Large Language Models without a Significant Loss of Quality appeared first on MarkTechPost.

Google AI Introduces Ironwood: A Google TPU Purpose-Built for the Age of Inference

Nishant N — Thu, 10 Apr 2025 19:28:48 +0000

At the 2025 Google Cloud Next event, Google introduced Ironwood, its latest generation of Tensor Processing Units (TPUs), designed specifically for large-scale AI inference workloads. This release marks a strategic shift toward optimizing infrastructure for inference, reflecting the increasing operational focus on deploying AI models rather than training them.

Ironwood is the seventh generation in Google’s TPU architecture and brings substantial improvements in compute performance, memory capacity, and energy efficiency. Each chip delivers a peak throughput of 4,614 teraflops (TFLOPs) and includes 192 GB of high-bandwidth memory (HBM), supporting bandwidths up to 7.4 terabits per second (Tbps). Ironwood can be deployed in configurations of 256 or 9,216 chips, with the larger cluster offering up to 42.5 exaflops of compute, making it one of the most powerful AI accelerators in the industry.

Unlike previous TPU generations that balanced training and inference workloads, Ironwood is engineered specifically for inference. This reflects a broader industry trend where inference, particularly for large language and generative models, is emerging as the dominant workload in production environments. Low-latency and high-throughput performance are critical in such scenarios, and Ironwood is designed to meet those demands efficiently.

A key architectural advancement in Ironwood is the enhanced SparseCore, which accelerates sparse operations commonly found in ranking and retrieval-based workloads. This targeted optimization reduces the need for excessive data movement across the chip and improves both latency and power consumption for specific inference-heavy use cases.

Ironwood also improves energy efficiency significantly, offering more than double the performance-per-watt compared to its predecessor. As AI model deployment scales, energy usage becomes an increasingly important constraint—both economically and environmentally. The improvements in Ironwood contribute toward addressing these challenges in large-scale cloud infrastructure.

The TPU is integrated into Google’s broader AI Hypercomputer framework, a modular compute platform combining high-speed networking, custom silicon, and distributed storage. This integration simplifies the deployment of resource-intensive models, enabling developers to serve real-time AI applications without extensive configuration or tuning.

This launch also signals Google’s intent to remain competitive in the AI infrastructure space, where companies such as Amazon and Microsoft are developing their own in-house AI accelerators. While industry leaders have traditionally relied on GPUs, particularly from Nvidia, the emergence of custom silicon solutions is reshaping the AI compute landscape.

Ironwood’s release reflects the growing maturity of AI infrastructure, where efficiency, reliability, and deployment readiness are now as important as raw compute power. By focusing on inference-first design, Google aims to meet the evolving needs of enterprises running foundation models in production—whether for search, content generation, recommendation systems, or interactive applications.

In summary, Ironwood represents a targeted evolution in TPU design. It prioritizes the needs of inference-heavy workloads with enhanced compute capabilities, improved efficiency, and tighter integration with Google Cloud’s infrastructure. As AI transitions into an operational phase across industries, hardware purpose-built for inference will become increasingly central to scalable, responsive, and cost-effective AI systems.

Check out the Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Google AI Introduces Ironwood: A Google TPU Purpose-Built for the Age of Inference appeared first on MarkTechPost.

This AI Paper Introduces a Machine Learning Framework to Estimate the Inference Budget for Self-Consistency and GenRMs (Generative Reward Models)

Mohammad Asjad — Thu, 10 Apr 2025 14:17:18 +0000

Large Language Models (LLMs) have demonstrated significant advancements in reasoning capabilities across diverse domains, including mathematics and science. However, improving these reasoning abilities at test time remains a challenge researchers are actively addressing. The primary focus lies in developing methods to scale test-time compute effectively while maximising reasoning performance. Current methodologies include generating multiple chains-of-thought (CoTs) solutions for problems and implementing voting or selection mechanisms to identify the best solutions. Although these approaches have shown promise, they often require considerable computational resources and may not consistently identify optimal solutions when incorrect reasoning pathways dominate. Finding efficient ways to enhance LLM reasoning while minimizing computational overhead represents a critical challenge for the field’s advancement.

Previous research has explored various approaches to enhance LLM reasoning capabilities. Generative Reward Models (GenRM) have emerged as a promising technique, framing verification as a next-token prediction task. These models enable test-time scaling by generating multiple verification chains-of-thought and aggregating their verdicts to score solutions. Initial comparisons between GenRM with Best-of-N (BoN) selection and Self-Consistency (SC) showed that GenRM appeared more efficient, achieving comparable performance with fewer solution candidates. However, these evaluations were conducted with fixed numbers of solutions rather than fixed computational budgets. This methodology creates misleading conclusions in practical scenarios where inference compute is limited, as it fails to account for the substantial computational costs associated with generating multiple verifications for each candidate solution. The key limitation of existing approaches is their failure to consider the true computational efficiency when comparing verification-based methods with simpler majority voting techniques.

The proposed method introduces a comprehensive framework for accurately estimating the inference computational budget required by Self-Consistency and GenRMs. This framework enables a fair, compute-matched analysis that compares these test-time scaling strategies under fixed computational constraints. The approach assumes a single Large Language Model serves dual functions as both the solution generator and generative verifier, with verification capabilities activated either through specialized prompting or task-specific fine-tuning. By establishing this unified framework, researchers can systematically analyze the performance trade-offs between generating more solution candidates for Self-Consistency versus allocating compute resources to verification processes in GenRMs. The comparative analysis focuses on measuring effectiveness based on the total number of solutions and verifications generated by the LLM, providing clear metrics for computational efficiency across different reasoning approaches.

The methodology employs a compute-matched analysis framework with a detailed architectural design for comparing test-time scaling strategies. For an autoregressive LLM with P parameters performing 2P FLOPs per output token, the total inference compute is calculated using the formula C(S, V) = S(1+λV), where S represents the number of solutions, V the number of verifications, and λ the ratio of tokens per verification to tokens per solution. This framework enables systematic evaluation of both Self-Consistency and Generative Reward Models under equivalent computational constraints. The architecture includes scaling solutions for SC across S ∈ {2^0, 2^1, …, 2^N} and evaluating GenRM across combinations of solutions and verifications S, V ∈ {S × V}. Also, the research introduces inference scaling laws for GenRM through a six-step methodology that determines optimal allocation between solutions and verifications. This process involves computing success rates across increasing verification counts, plotting results against compute budgets, and fitting power laws to establish relationships between optimal solution counts (S_opt ∝ C^a) and verification counts (V_opt ∝ C^b).

The results demonstrate a clear pattern when comparing the performance of Generative Reward Models against Self-Consistency across different computational budgets. SC exhibits superior performance in low-compute scenarios, making it the more efficient choice when computational resources are limited. Conversely, GenRM begins to outperform SC only after reaching approximately 8× the computational budget, requiring an additional 128× inference compute to achieve a modest performance improvement of 3.8% over SC. These findings prove robust across diverse experimental conditions, including various model families such as Llama and Qwen, different model sizes ranging from 7B to 70B parameters, specialized thinking models like QwQ-32B, and different reasoning tasks, including mathematics. The performance patterns remain consistent regardless of the specific LLM architecture employed, indicating the broad applicability of these comparative insights across the spectrum of language models and reasoning tasks.

The study introduces GenRMs as an innovative approach to scaling test-time compute through verification processes. Previous research demonstrated that scaling both solutions and verifications could outperform SC, but often neglected to account for the computational costs of verification. This comprehensive investigation reveals a clear pattern: SC proves more effective at lower computational budgets, while GenRMs deliver superior performance when higher computational resources are available. These findings maintain consistency across multiple model families, including specialized thinking models, various parameter sizes from 7B to 70B, and diverse reasoning tasks. In addition, the research establishes robust inference scaling laws that optimize budget allocation between solution generation and verification processes within GenRM frameworks. These insights provide valuable practical guidance for researchers and practitioners seeking to implement compute-efficient scaling strategies to maximize reasoning performance in large language models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper Introduces a Machine Learning Framework to Estimate the Inference Budget for Self-Consistency and GenRMs (Generative Reward Models) appeared first on MarkTechPost.

This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving

Nikhil — Wed, 09 Apr 2025 03:44:22 +0000

Large language models are built on transformer architectures and power applications like chat, code generation, and search, but their growing scale with billions of parameters makes efficient computation increasingly challenging. Scaling such systems while maintaining low latency and high throughput puts pressure on algorithm design and system-level optimization. Effectively serving these models now requires careful orchestration of memory, communication, and compute resources.

A critical challenge in this space is how sparsity, introduced through Mixture-of-Experts (MoE) models, affects inference performance. These models selectively activate a subset of feed-forward networks per input, reducing computational load. However, this selective activation leads to underutilization of hardware. During inference, attention modules become bottlenecks due to frequent memory access to key-value caches, while the FFN modules become idle because each receives a small fraction of tokens. As a result, GPU utilization drops significantly, especially during decoding, creating inefficiencies and inflating operational costs.

While some methods like vLLM and TensorRT-LLM have attempted to address inference scaling through parallelism and optimized kernels, these solutions remain constrained. They process the model holistically, meaning they cannot independently adjust scaling for different components. As MoE models grow in size and sparsity, this approach leads to smaller active batches per expert, weakening the benefits of batching for FFNs. Moreover, tensor and pipeline parallelism approaches add communication overhead, especially across nodes, which becomes a limiting factor in multi-GPU environments.

ByteDance and Peking University researchers have introduced MegaScale-Infer, a system that rethinks the architecture of MoE serving. Instead of serving the model as a monolithic block, the researchers disaggregate the attention and FFN modules, deploying them on separate GPUs. This separation enables customized scaling and parallelism strategies tailored to the specific needs of each module. Attention modules, which are memory-intensive, are replicated to aggregate requests, while FFN modules are scaled using expert parallelism. The system also supports heterogeneous GPU deployment, assigning cost-effective memory-heavy GPUs to attention tasks and compute-optimized GPUs to FFNs. This disaggregation dramatically improves resource usage and flexibility in deployment.

To further optimize performance, MegaScale-Infer employs a ping-pong pipeline parallelism strategy. The idea is to break down batches of requests into smaller micro-batches that alternate between attention and FFN modules, ensuring that neither component sits idle. The system determines the optimal number of micro-batches required to maintain high utilization, considering compute time, communication latency, and hardware setup. For example, if the communication time is less than half the compute time, at least three micro-batches are used. Further, the system integrates a high-performance M2N communication library that avoids unnecessary GPU-to-CPU data copies, reducing latency and instability. This library replaces the traditional All-to-All routing with a more efficient sender-receiver model designed specifically for MoE’s token dispatch pattern.

MegaScale-Infer was tested on multiple large-scale MoE models, including Mixtral 8×22B, DBRX, and a scaled custom model with 317 billion parameters. In experiments on homogeneous setups using NVIDIA Ampere GPUs, MegaScale-Infer improved per-GPU decoding throughput by up to 2.56× compared to vLLM and 1.28× over TensorRT-LLM. The scaled model achieved a 7.11× gain over vLLM and a 1.90× gain over TensorRT-LLM. On heterogeneous clusters with H20 GPUs for attention and L40S for FFNs, the system achieved up to 3.24× and 1.86× higher throughput per dollar than the baselines. Its M2N communication library delivered up to 4.2× higher throughput and 68.2% lower latency than NCCL.

This paper presents a clear problem of underutilized GPUs during MoE inference and offers a practical solution by modularizing the architecture. The proposed disaggregation strategy, combined with micro-batch pipelining and a custom communication protocol, substantially impacts serving efficiency and cost.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving appeared first on MarkTechPost.

Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization

Sana Hassan — Mon, 07 Apr 2025 03:49:37 +0000

Reinforcement Learning RL has become a widely used post-training method for LLMs, enhancing capabilities like human alignment, long-term reasoning, and adaptability. A major challenge, however, is generating accurate reward signals in broad, less structured domains, as current high-quality reward models are largely built on rule-based systems or verifiable tasks such as math and coding. In general applications, reward criteria are more diverse and subjective, lacking clear ground truths. To address this, generalist reward models (RMs) are being explored for broader applicability. However, these models must balance input flexibility and scalability during inference, particularly in producing reliable, high-quality rewards across varied tasks and domains.

Existing reward modeling approaches include scalar, semi-scalar, and generative techniques, each with flexibility and inference-time performance trade-offs. For instance, pairwise models are limited to relative comparisons, while scalar models struggle with producing diverse feedback. Generative reward models (GRMs) offer richer, more flexible outputs, making them more suited for evaluating various responses. Recent work has explored training GRMs through offline RL, integrating tools and external knowledge to improve reward quality. However, few methods directly address how RMs can scale efficiently during inference. This has led to research on methods like sampling-based scaling, chain-of-thought prompting, and reward-guided aggregation, aiming to co-scale policy models and reward models during inference. These developments hold promise for more robust, general-purpose reward systems in LLMs.

DeepSeek-AI and Tsinghua University researchers explore enhancing reward models RM for general queries by improving inference-time scalability using increased computing and better learning techniques. They employ pointwise GRM for flexible input handling and propose a learning method—Self-Principled Critique Tuning (SPCT)—which helps GRMs generate adaptive principles and accurate critiques during online reinforcement learning. They apply parallel sampling and introduce a meta RM to scale effectively and refine the voting process. Their DeepSeek-GRM models outperform existing benchmark methods, offering higher reward quality and scalability, with plans for open-sourcing despite challenges in some complex tasks.

The researchers introduce SPCT, a method designed to enhance pointwise GRMs by enabling them to generate adaptive principles and accurate critiques. SPCT consists of two stages: rejective fine-tuning for initializing principle and critique generation and rule-based RL for refinement. Instead of treating principles as preprocessing, they are generated dynamically during inference. This promotes scalability by improving reward granularity. Additionally, inference-time performance is boosted through parallel sampling and voting, supported by a meta reward model (meta RM) that filters out low-quality outputs. Overall, SPCT improves reward accuracy, robustness, and scalability in GRMs.

Using standard metrics, the study evaluates various RM methods across benchmarks like Reward Bench, PPE, RMB, and ReaLMistake. DeepSeek-GRM-27B consistently outperforms baselines and rivals strong public models like GPT-4o. Inference-time scaling, especially with voting and meta reward models, significantly boosts performance—achieving results comparable to much larger models. Ablation studies highlight the importance of components like principle generation and non-hinted sampling. Training-time scaling shows diminishing returns compared to inference-time strategies. Overall, DeepSeek-GRM, enhanced with SPCT and meta RM, offers robust, scalable reward modeling with reduced domain bias and strong generalization.

In conclusion, the study presents SPCT, a method that improves inference-time scalability for GRMs through rule-based online reinforcement learning. SPCT enables adaptive principle and critique generation, enhancing reward quality across diverse tasks. DeepSeek-GRM models outperform several baselines and strong public models, especially when paired with a meta reward model for inference-time scaling. Using parallel sampling and flexible input handling, these GRMs achieve strong performance without relying on larger model sizes. Future work includes integrating GRMs into RL pipelines, co-scaling with policy models, and serving as reliable offline evaluators.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization appeared first on MarkTechPost.

UB-Mesh: A Cost-Efficient, Scalable Network Architecture for Large-Scale LLM Training

Sana Hassan — Thu, 03 Apr 2025 20:08:33 +0000

As LLMs scale, their computational and bandwidth demands increase significantly, posing challenges for AI training infrastructure. Following scaling laws, LLMs improve comprehension, reasoning, and generation by expanding parameters and datasets, necessitating robust computing systems. Large-scale AI clusters now require tens of thousands of GPUs or NPUs, as seen in LLAMA-3’s 16K GPU training setup, which took 54 days. With AI data centers deploying over 100K GPUs, scalable infrastructure is essential. Additionally, interconnect bandwidth requirements surpass 3.2 Tbps per node, far exceeding traditional CPU-based systems. The rising costs of symmetrical Clos network architectures make cost-effective solutions critical, alongside optimizing operational expenses such as energy and maintenance. Moreover, high availability is a key concern, as massive training clusters experience frequent hardware failures, demanding fault-tolerant network designs.

Addressing these challenges requires rethinking AI data center architecture. First, network topologies should align with LLM training’s structured traffic patterns, which differ from traditional workloads. Tensor parallelism, responsible for most data transfers, operates within small clusters, while data parallelism involves minimal but long-range communication. Second, computing and networking systems must be co-optimized, ensuring effective parallelism strategies and resource distribution to avoid congestion and underutilization. Lastly, AI clusters must feature self-healing mechanisms for fault tolerance, automatically rerouting traffic or activating backup NPUs when failures occur. These principles—localized network architectures, topology-aware computation, and self-healing systems—are essential for building efficient, resilient AI training infrastructures.

Huawei researchers introduced UB-Mesh, an AI data center network architecture designed for scalability, efficiency, and reliability. Unlike traditional symmetrical networks, UB-Mesh employs a hierarchically localized nD-FullMesh topology, optimizing short-range interconnects to minimize switch dependency. Based on a 4D-FullMesh design, its UB-Mesh-Pod integrates specialized hardware and a Unified Bus (UB) technique for flexible bandwidth allocation. The All-Path Routing (APR) mechanism enhances data traffic management, while a 64+1 backup system ensures fault tolerance. Compared to Clos networks, UB-Mesh reduces switch usage by 98% and optical module reliance by 93%, achieving 2.04× cost efficiency with minimal performance trade-offs in LLM training.

UB-Mesh is a high-dimensional full-mesh interconnect architecture designed to enhance efficiency in large-scale AI training. It employs an nD-FullMesh topology, minimizing reliance on costly switches and optical modules by maximizing direct electrical connections. The system is built on modular hardware components linked through a UB interconnect, streamlining communication across CPUs, NPUs, and switches. A 2D full-mesh structure connects 64 NPUs within a rack, extending to a 4D full-mesh at the Pod level. For scalability, a SuperPod structure integrates multiple Pods using a hybrid Clos topology, balancing performance, flexibility, and cost-efficiency in AI data centers.

To enhance the efficiency of UB-Mesh in large-scale AI training, we employ topology-aware strategies for optimizing collective communication and parallelization. For AllReduce, a Multi-Ring algorithm minimizes congestion by efficiently mapping paths and utilizing idle links to enhance bandwidth. In all-to-all communication, a multi-path approach boosts data transmission rates, while hierarchical methods optimize bandwidth for broadcasting and reduce operations. Additionally, the study refines parallelization through a systematic search, prioritizing high-bandwidth configurations. Comparisons with Clos architecture reveal that UB-Mesh maintains competitive performance while significantly reducing hardware costs, making it a cost-effective alternative for large-scale model training.

In conclusion, the UB IO controller incorporates a specialized co-processor, the Collective Communication Unit (CCU), to optimize collective communication tasks. The CCU manages data transfers, inter-NPU transmissions, and in-line data reduction using an on-chip SRAM buffer, minimizing redundant memory copies and reducing HBM bandwidth consumption. It also enhances computer-communication overlap. Additionally, UB-Mesh efficiently supports massive-expert MoE models by leveraging hierarchical all-to-all optimization and load/store-based data transfer. The study introduces UB-Mesh, an nD-FullMesh network architecture for LLM training, offering cost-efficient, high-performance networking with 95%+ linearity, 7.2% improved availability, and 2.04× better cost efficiency than Clos networks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post UB-Mesh: A Cost-Efficient, Scalable Network Architecture for Large-Scale LLM Training appeared first on MarkTechPost.

This AI Paper Unveils a Reverse-Engineered Simulator Model for Modern NVIDIA GPUs: Enhancing Microarchitecture Accuracy and Performance Prediction

Nikhil — Thu, 03 Apr 2025 19:39:21 +0000

GPUs are widely recognized for their efficiency in handling high-performance computing workloads, such as those found in artificial intelligence and scientific simulations. These processors are designed to execute thousands of threads simultaneously, with hardware support for features like register file access optimization, memory coalescing, and warp-based scheduling. Their structure allows them to support extensive data parallelism and achieve high throughput on complex computational tasks increasingly prevalent across diverse scientific and engineering domains.

A major challenge in academic research involving GPU microarchitectures is the dependence on outdated architecture models. Many studies still use the Tesla-based pipeline as their baseline, which was released more than fifteen years ago. Since then, GPU architectures have evolved significantly, including introducing sub-core components, new control bits for compiler-hardware coordination, and enhanced cache mechanisms. Continuing to simulate modern workloads on obsolete architectures misguides performance evaluations and hinders innovation in architecture-aware software design.

Some simulators have tried to keep pace with these architectural changes. Tools like GPGPU-Sim and Accel-sim are commonly used in academia. Still, their updated versions lack fidelity in modeling key aspects of modern architectures such as Ampere or Turing. These tools often fail to accurately represent instruction fetch mechanisms, register file cache behaviors, and the coordination between compiler control bits and hardware components. A simulator that fails to represent such features can result in gross errors in estimated cycle counts and execution bottlenecks.

Research introduced by a team from the Universitat Politècnica de Catalunya seeks to close this gap by reverse engineering the microarchitecture of modern NVIDIA GPUs. Their work dissects architectural features in detail, including the design of the issue and fetch stages, the behavior of the register file and its cache, and a refined understanding of how warps are scheduled based on readiness and dependencies. They also studied the effect of hardware control bits, revealing how these compiler hints influence hardware behavior and instruction scheduling.

To build their simulation model, the researchers created microbenchmarks composed of carefully selected SASS instructions. These were executed on actual Ampere GPUs while recording clock counters to determine latency. Experiments used stream buffers to test specific behaviors such as read-after-write hazards, register bank conflicts, and instruction prefetching behavior. They also evaluated the operation of the dependence management mechanism, which uses a scoreboard to track in-flight consumers and prevent write-after-read hazards. This granular measurement enabled them to propose a model that reflects internal execution details far more precisely than existing simulators.

In terms of accuracy, the model developed by the researchers significantly outperformed existing tools. Compared with real hardware using the NVIDIA RTX A6000, the model achieved a mean absolute percentage error (MAPE) of 13.98%, which is 18.24% better than Accel-sim. The worst-case error in the proposed model never exceeded 62%, while Accel-sim reached errors up to 543% in some applications. Furthermore, their simulation showed a 90th percentile error of 31.47%, compared to 82.64% for Accel-sim. These results underline the enhanced precision of the proposed simulation framework in predicting GPU performance characteristics. The researchers verified that the model works effectively with other NVIDIA architectures like Turing, proving its portability and adaptability.

The paper highlights a clear mismatch between academic tools and modern GPU hardware and presents a practical solution to bridge that gap. The proposed simulation model improves performance prediction accuracy and helps understand modern GPUs’ detailed design. This contribution can support future innovations in both GPU architecture and software optimization.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper Unveils a Reverse-Engineered Simulator Model for Modern NVIDIA GPUs: Enhancing Microarchitecture Accuracy and Performance Prediction appeared first on MarkTechPost.

PilotANN: A Hybrid CPU-GPU System For Graph-based ANNS

Sajjad Ansari — Sun, 30 Mar 2025 18:14:24 +0000

Approximate Nearest Neighbor Search (ANNS) is a fundamental vector search technique that efficiently identifies similar items in high-dimensional vector spaces. Traditionally, ANNS has served as the backbone for retrieval engines and recommendation systems, however, it struggles to keep pace with modern Transformer architectures that employ higher-dimensional embeddings and larger datasets. Unlike deep learning systems that can be horizontally scaled due to their stateless nature, ANNS remains centralized, creating a severe single-machine throughput bottleneck. Empirical testing with 100-million scale datasets reveals that even state-of-the-art CPU implementations of the Hierarchical Navigable Small World (HNSW) algorithm can’t maintain adequate performance as vector dimensions increase.

Previous research on large-scale ANNS has explored two optimization paths: index structure improvements and hardware acceleration. The Inverted MultiIndex (IMI) enhanced space partitioning through multi-codebook quantization, while PQFastScan improved performance with SIMD and cache-aware optimizations. DiskANN and SPANN introduced disk-based indexing for billion-scale datasets, addressing memory hierarchy challenges through different approaches. SONG and CAGRA achieved impressive speedups through GPU parallelization but remain constrained by GPU memory capacity. BANG handled billion-scale datasets via hybrid CPU-GPU processing but lacked critical CPU baseline comparisons. These methods frequently sacrifice compatibility, accuracy or require specialized hardware.

Researchers from the Chinese University of Hong Kong, Centre for Perceptual and Interactive Intelligence, and Theory Lab of Huawei Technologies have proposed PilotANN, a hybrid CPU-GPU system designed to overcome the limitations of existing ANNS implementations. PilotANN addresses the challenge: CPU-only implementations struggle with computational demands, while GPU-only solutions are constrained by limited memory capacity. It solves this issue by utilizing both the abundant RAM of CPUs and the parallel processing capabilities of GPUs. Moreover, it employs a three-stage graph traversal process, GPU-accelerated subgraph traversal using dimensionally-reduced vectors, CPU refinement, and precise search with complete vectors.

PilotANN fundamentally reimagines the vector search process through a “staged data ready processing” paradigm. It minimizes data movement across processing stages rather than adhering to traditional “move data for computation” models. It also consists of three stages: GPU piloting with subgraph and dimensionally-reduced vectors, residual refinement using subgraph with full vectors, and final traversal employing full graph and complete vectors. The design shows cost-effectiveness with only a single commodity GPU while scaling effectively across vector dimensions and graph complexity. Data transfer overhead is minimized to just the initial query vector movement to GPU and a small candidate set returning to CPU after GPU piloting.

Experimental results show PilotANN’s performance advantages across diverse large-scale datasets. PilotANN achieves a 3.9 times throughput speedup on the 96-dimensional DEEP dataset compared to the HNSW-CPU baseline, with even more impressive gains of 5.1-5.4 times on higher-dimensional datasets. PilotANN delivers significant speedups even on the notoriously challenging T2I dataset despite no specific optimizations for this benchmark. Moreover, it shows remarkable cost-effectiveness despite utilizing more expensive hardware. While the GPU-based platform costs 2.81 USD/hour compared to the CPU-only solution at 1.69 USD/hour, PilotANN achieves 2.3 times cost-effectiveness for DEEP and 3.0-3.2 times for T2I, WIKI, and LAION datasets when measuring throughput per dollar.

In conclusion, researchers introduced PilotANN, an advancement in graph-based ANNS that effectively utilizes CPU and GPU resources for emerging workloads. It shows great performance over existing CPU-only approaches through the intelligent decomposition of top-k search into a multi-stage CPU-GPU pipeline and implementation of efficient entry selection. It democratizes high-performance nearest neighbor search by achieving competitive results with a single commodity GPU, making advanced search capabilities accessible to researchers and organizations with limited computing resources. Unlike alternative solutions requiring expensive high-end GPUs, PilotANN enables efficient ANNS deployment on common hardware configurations while maintaining search accuracy.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post PilotANN: A Hybrid CPU-GPU System For Graph-based ANNS appeared first on MarkTechPost.