Sajjad Ansari, Author at MarkTechPost

RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity

Sajjad Ansari — Mon, 05 May 2025 18:09:19 +0000

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Methods like Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, representing a fundamental challenge for this class of models.

Linear complexity language models have emerged as alternatives to transformer-based architectures that suffer from quadratic computational demands when processing long sequences. The RWKV model series combines transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, from the foundational RWKV-4 to RWKV-5 to RWKV-6 to RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Further, Native Sparse Attention organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention includes SeerAttention and Block Attention (MoBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

RWKV-X is a hybrid architecture that integrates RWKV-7 blocks with sparse attention blocks. Rather than training from scratch, RWKV-X builds upon existing models using an interleaved block expansion approach and zero-initialization mechanism inspired by LLaMA Pro. The training follows a two-stage process:

First, the model trains on short 1024-token contexts from the MiniPile dataset while freezing all parameters except the newly added blocks.
The second stage involves long-context continual pretraining using the ProLong-64K dataset and a context length of 64K tokens, processing approximately 1 billion tokens total. During this phase, all parameters are unfrozen and jointly optimized. The training employs Long-context Cross-Entropy (LongCE) loss, which dynamically weights tokens based on their importance.

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller RWKV-X (0.22B) achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity appeared first on MarkTechPost.

Oversight at Scale Isn’t Guaranteed: MIT Researchers Quantify the Fragility of Nested AI Supervision with New Elo-Based Framework

Sajjad Ansari — Sat, 03 May 2025 19:44:02 +0000

Frontier AI companies show advancement toward artificial general intelligence (AGI), creating a need for techniques to ensure these powerful systems remain controllable and beneficial. A major approach to this challenge involves methods like Recursive Reward Modeling, Iterated Amplification, and Scalable Oversight. They aim to enable weaker systems to oversee stronger ones effectively. A key idea is that scalable oversight can be bootstrapped recursively, which is termed Nested Scalable Oversight (NSO). However, while discussions around NSO focus on qualitative guarantees and conceptual frameworks, other high-risk technologies are held to quantitative safety standards, e.g., civilian aircraft must maintain fatality rates below 10^–5 per flight hour, and nuclear reactors must keep core damage frequency under 10^-4 per year.

Scalable oversight processes where weaker AI systems monitor stronger ones include iterated amplification, recursive reward modeling, AI Safety via debate, market making, consultancy, self-critique, and doubly-efficient debate. Research on forecasting and scaling laws has focused on how model performance improves with size, data, and computational resources. However, these approaches can also apply to behaviors like oversight and deception. Further, Long-term oversight roadmaps include OpenAI’s SuperAlignment plan for “human-level automated alignment researchers” and a comprehensive AGI safety framework. Moreover, previous studies have examined language model performance in oversight-related scenarios like social deduction games, AI debate, backdoor code detection, and wargame simulations.

Researchers from MIT have proposed a framework to quantify the probability of successful oversight as a function of overseer and overseen system capabilities. This framework models oversight as a game between capability-mismatched players, where each has oversight-specific and deception-specific Elo scores that follow a piecewise-linear function of their general intelligence. A theoretical study of Nested Scalable Oversight (NSO) is conducted, where trusted models oversee stronger untrusted models. This identifies conditions for NSO success and derives the optimal number of oversight levels to maximize success probability. The examples show NSO success rates below 52% when overseeing systems 400 Elo points stronger than the baseline overseer.

The research examines four oversight games:

In the Mafia game, five villagers (Guards) attempt to identify one Mafia player (Houdini), creating a natural environment for testing deception and collaboration skills.
In the Debate, two adversarial agents argue opposing viewpoints while a judge (Guard) determines the correct side. Test cases are derived from TruthfulQA, BoolQ, QuaLITY, and PrOntoQA datasets.
The Backdoor Code tests language models’ abilities to insert and detect hidden behaviors in code. A pipeline is built for generating backdoors using 225 Python samples from the APPS dataset at the competition difficulty level.
The Wargames simulates a fictional company named OpenMind approaching AGI development. It involves three players: the OpenMind AI (Houdini), the OpenMind Monitor (Guard), and a Narrator.

Mafia performance scales approximately linearly with general intelligence for both Mafia (Houdini) and Villager (Guard) roles, where Guards show steeper improvement slopes. In Debate evaluations using TruthfulQA, Guard performance scales with general intelligence while Houdini performance remains relatively constant. In Backdoor Code tests, Guard and Houdini Elos scale with general intelligence but asymmetrically. Guards maintain higher intercepts, indicating the task is inherently more difficult for Houdinis, even with the strongest deceptive models failing to exceed 50% success rates against weaker Guards. In Wargames, systematic analysis of 100 random Narrator summaries by Gemini 2.5 identified eight strategy categories, with most models showing similar strategy profiles regardless of capability level.

In conclusion, researchers introduced a framework to quantify the probability of successful oversight as a function of overseer and overseen system capabilities. While the oversight settings aim for realism, they remain simplified approximations. Each game has a fundamental constraint of simulating only single-step oversight without immediate generalizability to multiple steps. Future research could explore multi-stage oversight chains, similar to strong models that are successively fine-tuned on weaker models’ behaviors. Another significant limitation involves the modeling of deceptive behavior. The current approach prompts models to act deceptively rather than using fine-tuning or reinforcement learning to develop genuinely deceptive models.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Oversight at Scale Isn’t Guaranteed: MIT Researchers Quantify the Fragility of Nested AI Supervision with New Elo-Based Framework appeared first on MarkTechPost.

Subject-Driven Image Evaluation Gets Simpler: Google Researchers Introduce REFVNLI to Jointly Score Textual Alignment and Subject Consistency Without Costly APIs

Sajjad Ansari — Fri, 02 May 2025 19:57:41 +0000

Text-to-image (T2I) generation has evolved to include subject-driven approaches, which enhance standard T2I models by incorporating reference images alongside text prompts. This advancement allows for more precise subject representation in generated images. Despite the promising applications, subject-driven T2I generation faces a significant challenge of lacking reliable automatic evaluation methods. Current metrics focus either on text-prompt alignment or subject consistency, when both are essential for effective subject-driven generation. While more correlative evaluation methods exist, they rely on costly API calls to models like GPT-4, limiting their practicality for extensive research applications.

Evaluation approaches for Visual Language Models (VLMs) include various frameworks, with text-to-image (T2I) assessments focusing on image quality, diversity, and text alignment. Researchers utilize embedding-based metrics like CLIP and DINO for subject-driven generation evaluation to measure subject preservation. Complex metrics such as VIEScore and DreamBench++ utilize GPT-4o to evaluate textual alignment and subject consistency, but at a higher computational cost. Subject-driven T2I methods have developed along two main paths: fine-tuning general models into specialized versions capturing specific subjects and styles, or enabling broader applicability through one-shot examples. These one-shot approaches include adapter-based and adapter-free techniques.

Researchers from Google Research and Ben Gurion University have proposed REFVNLI, a cost-efficient metric that simultaneously evaluates textual alignment and subject preservation in subject-driven T2I generation. It predicts two scores, textual alignment and subject consistency, in a single classification based on a triplet ref, prompt, image_tgt>. It is trained on an extensive dataset derived from video-reasoning benchmarks and image perturbations, outperforming or matching existing baselines across multiple benchmarks and subject categories. REFVNLI shows improvements of up to 6.4 points in textual alignment and 8.5 points in subject consistency. It is effective with lesser-known concepts, where it aligns with human preferences at over 87% accuracy.

For training REFVNLI, a large-scale dataset of triplets ref, prompt, image_tgt>, labeled with , is curated automatically. REFVNLI is evaluated on multiple human-labeled test sets for subject-driven generation, including DreamBench++, ImagenHub, and KITTEN. The evaluation spans diverse categories such as Humans, Animals, Objects, Landmarks, and multi-subject settings. The training process involves fine-tuning PaliGemma, a 3B Vision-Language Model, focusing on a variant adapted for multi-image inputs. During inference, the model takes two images and a prompt with special markups around the referenced subject, performing sequential binary classifications for textual alignment and subject preservation.

For subject consistency, REFVNLI ranks among the top two metrics across all categories and performs best in the Object category, exceeding the GPT4o-based DreamBench++ by 6.3 points. On ImagenHub, REFVNLI achieves top-two rankings for textual alignment in the Animals category and the highest score for Objects, outperforming the best non-finetuned model by 4 points. It also performs well in Multi-subject settings, ranking in the top three. REFVNLI achieves the highest textual alignment score on KITTEN, but has limitations in subject consistency due to its identity-sensitive training that penalizes even minor mismatches in identity-defining traits. Ablation studies reveal that joint training provides complementary benefits, with single-task training resulting in performance drops.

In this paper, researchers introduced REFVNLI, a reliable, cost-effective metric for subject-driven T2I generation that addresses both textual alignment and subject preservation challenges. Trained on an extensive auto-generated dataset, REFVNLI effectively balances robustness to identity-agnostic variations such as pose, lighting, and background with sensitivity to identity-specific traits, including facial features, object shape, and unique details. Future research directions include enhancing REFVNLI’s evaluation capabilities across artistic styles, handling textual modifications that explicitly alter identity-defining attributes, and improving the processing of multiple reference images for single and distinct subjects.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Subject-Driven Image Evaluation Gets Simpler: Google Researchers Introduce REFVNLI to Jointly Score Textual Alignment and Subject Consistency Without Costly APIs appeared first on MarkTechPost.

ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

Sajjad Ansari — Tue, 29 Apr 2025 17:40:03 +0000

Reasoning with LLMs can benefit from utilizing more test compute, which depends on high-quality process reward models (PRMs) to select promising paths for search or ranking. PRMs score problem-solution pairs to indicate whether the solution is correct, and have been implemented as discriminative classifiers. However, these models require extensive resources, including human annotation, gold step-by-step solutions, or computationally intensive rollouts. LLM-as-a-judge approaches offer advantages in data efficiency and interpretability, but they perform poorly compared to specialized reward models for complex reasoning tasks, failing to recognize incorrect reasoning. This creates a challenge to maintain data-efficiency and interpretability advantages while achieving the superior performance of discriminative PRMs.

Research approaches to solve process verification challenges have followed three main paths. Discriminative PRMs function as classifiers that predict numerical correctness scores for each reasoning step, requiring extensive step-level annotations. Generative PRMs frame verification as a language-generation task, producing correctness decisions as natural language tokens accompanied by verification chain-of-thought (CoT). These models compute correctness scores through conditional token probabilities like P(“correct”), making them inherently interpretable and scalable. Test-time scaling techniques like Best-of-N selection and tree-based search improve reasoning performance using additional inference-time compute. The effectiveness of these approaches depends heavily on verifier quality for scoring solutions.

Researchers from the University of Michigan, Mila, LG AI Research, and the University of Illinois Urbana-Champaign have proposed THINKPRM, a long CoT verifier fine-tuned on significantly fewer process labels than those required by discriminative PRMs. It uses the inherent reasoning abilities of long CoT models to outperform both LLM-as-a-Judge and discriminative verifiers while using only 1% of process labels in PRM800K across several challenging benchmarks. Under equal token budgets, THINKPRM scales verification compute more effectively than LLM-as-a-Judge, outperforming it by 7.2% on a ProcessBench subset, highlighting the value of generative, long CoT PRMs for scaling test-time verification compute with minimal supervision.

The THINKPRM is evaluated against DiscPRM, the same base model finetuned with binary cross-entropy on the entire PRM800K dataset containing 712K process labels from 98K problem-solution pairs. Additional comparisons include unweighted majority voting and verifier-weighted majority for best-of-N experiments. The results are shown on three math reasoning tasks: 100 problems from MATH-500 covering all difficulty levels, 2024 American Invitational Mathematics Examination (AIME) problems, and out-of-domain tasks including physics problems from GPQA-Diamond and a 200-problem subset from LiveCodeBench v5. For MATH-500, researchers used THINKPRM-1.5B and THINKPRM-14B with two different generator models.

On best-of-N selection with MATH500, THINKPRM achieves higher or comparable reasoning accuracy to DiscPRM across all sampling budgets. Under verifier-guided search on MATH-500, THINKPRM-1.5B outperforms discPRM by approximately 5 percentage points and surpasses LLM-as-a-judge using the same base model (R1-Qwen-1.5B). THINKPRM-1.5B’s scaling curve exceeds all baselines when compared to strong off-the-shelf PRMs like RLHFFlow-Deepseek-PRM and MATH-Shepherd-PRM, outperforming RLHFFlow-Deepseek-PRM by over 7% at 16 beams. For out-of-domain evaluation, THINKPRM shows better scaling than DiscPRM on GPQA-physics, outperforming it by 8%, while on LiveCodeBench, THINKPRM surpasses DiscPRM by 4.5%.

In conclusion, researchers introduced THINKPRM, a generative process reward model trained with minimal supervision on synthetic data, allowing efficient and scalable verification of step-by-step reasoning. Researchers show that lightweight fine-tuning of generative PRMs on as few as 8K process labels can improve upon zero-shot LLM-as-a-judge baselines. THINKPRM also surpasses discriminative PRMs trained with orders of magnitude more process labels, highlighting the advantages of utilizing generative language-modeling objectives for interpretability, scalability, and data efficiency. The results underscore the potential of generative PRMs to scale verification compute at test-time effectively, benefiting challenging domains such as mathematical and scientific reasoning.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification appeared first on MarkTechPost.

Researchers from Sea AI Lab, UCAS, NUS, and SJTU Introduce FlowReasoner: a Query-Level Meta-Agent for Personalized System Generation

Sajjad Ansari — Sun, 27 Apr 2025 20:28:05 +0000

LLM-based multi-agent systems characterized by planning, reasoning, tool use, and memory capabilities form the foundation of applications like chatbots, code generation, mathematics, and robotics. However, these systems face significant challenges as they are manually designed, leading to high human resource costs and limited scalability. Graph-based methods have attempted to automate workflow designs by formulating workflows as networks, but their structural complexity restricts scalability. State-of-the-art approaches represent multi-agent systems as programming code and use advanced LLMs as meta-agents to optimize workflows, but focus on task-level solutions that generate single task-specific systems. This one-size-fits-all approach lacks the capability for automatic adaptation to individual user queries.

LLM-based multi-agent systems are the foundation for various real-world applications, including code intelligence, computer use, and deep research. These systems feature LLM-based agents equipped with planning capabilities, database access, and tool function invocation that collaborate to achieve promising performance. Early approaches focused on optimizing prompts or hyperparameters through evolution algorithms to automate agent profiling. ADAS introduced code representation for agents and workflows with a meta-agent to generate workflows. Moreover, OpenAI has advanced reasoning in LLMs by developing the o1 model. Models like QwQ, QvQ, DeepSeek, and Kimi have followed suit, developing o1-like reasoning architectures. OpenAI’s o3 model achieves promising results on the ARG-AGI benchmark.

Researchers from the Sea AI Lab, Singapore, the University of Chinese Academy of Sciences, the National University of Singapore, and Shanghai Jiao Tong University have proposed FlowReasoner, a query-level meta-agent designed to automate the creation of query-level multi-agent systems, generating one customized system per user query. The researchers distilled DeepSeek R1 to supply FlowReasoner with the fundamental reasoning capabilities needed to create multi-agent systems, and then enhanced it through reinforcement learning with external execution feedback. A multi-purpose reward mechanism is developed to optimize training across three critical dimensions: performance, complexity, and efficiency. This enables FlowReasoner to generate personalized multi-agent systems through deliberative reasoning for each unique user query.

The researchers select three datasets: BigCodeBench for engineering-oriented tasks, HumanEval, and MBPP for algorithmic challenges for detailed evaluation across diverse code generation scenarios. FlowReasoner is evaluated against three categories of baselines:

Single-model direct invocation using standalone LLMs
Manually designed workflows including Self-Refine, LLM-Debate, and LLM-Blender with human-crafted reasoning strategies
Automated workflow optimization methods like Aflow, ADAS, and MaAS that construct workflows through search or optimization.

Both o1-mini and GPT-4o-mini are used as worker models for manually designed workflows. FlowReasoner is implemented with two variants of DeepSeek-R1-Distill-Qwen (7B and 14B parameters) using o1-mini as the worker model.

FlowReasoner-14B outperforms all competing approaches, achieving an overall improvement of 5 percentage points compared to the strongest baseline, MaAS. It exceeds the performance of its underlying worker model, o1-mini, by a substantial margin of 10%. These results show the effectiveness of the workflow-based reasoning framework in enhancing code generation accuracy. To evaluate generalization capabilities, experiments are conducted replacing the o1-mini worker with models like Qwen2.5-Coder, Claude, and GPT-4o-mini, while keeping the meta-agent fixed as either FLOWREASONER-7B or FLOWREASONER-14B. FLOWREASONER exhibits notable transferability, maintaining consistent performance across different worker models on the same tasks.

In this paper, researchers present FlowReasoner, a query-level meta-agent designed to automate the creation of personalized multi-agent systems for individual user queries. FlowReasoner utilizes external execution feedback and reinforcement learning with multi-purpose rewards focusing on performance, complexity, and efficiency to generate optimized workflows without relying on complex search algorithms or carefully designed search sets. This approach reduces human resource costs while enhancing scalability by enabling more adaptive and efficient multi-agent systems that dynamically optimize their structure based on specific user queries rather than relying on fixed workflows for entire task categories.

Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Researchers from Sea AI Lab, UCAS, NUS, and SJTU Introduce FlowReasoner: a Query-Level Meta-Agent for Personalized System Generation appeared first on MarkTechPost.

Optimizing Reasoning Performance: A Comprehensive Analysis of Inference-Time Scaling Methods in Language Models

Sajjad Ansari — Sun, 27 Apr 2025 06:06:25 +0000

Language models have shown great capabilities across various tasks. However, complex reasoning remains challenging as it often requires additional computational resources and specialized techniques. This challenge has motivated the development of inference-time compute (ITC) scaling methods, which allocate additional computational resources to enhance model outputs during inference. The landscape of language model reasoning has evolved along two primary dimensions: approaches that boost reasoning capabilities during inference, and a new class of “reasoning models”. However, they introduce significant computational overhead, raising critical questions about efficiency and the optimal trade-off between computational resources and reasoning performance.

Inference-time scaling has emerged as a promising alternative to costly model pretraining. Inference-time architectures combining techniques such as generation ensembling, sampling, ranking, and fusion exceed individual model performance, as demonstrated by approaches like Mixture-of-Agents, LLM Blender, and orchestration frameworks like DSPy. Even techniques like chain-of-thought and branch-solve-merge enhance reasoning capabilities for single models. To reduce computational cost, methods like Confidence-Informed Self-Consistency (CISC) use confidence-weighted voting, cutting required samples significantly. Another technique, DivSampling, injects prompt perturbations to increase answer diversity, boosting performance across various tasks.

Researchers from Duke University, Together AI, the University of Chicago, and Stanford University have proposed a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models on challenging reasoning tasks. By constructing the Pareto frontier of quality and efficiency, the researchers discovered that non-reasoning models, even with extremely high inference budgets, still fall substantially behind reasoning models. For reasoning models, majority voting is a robust inference strategy, competitive with or outperforming other more complex ITC methods like best-of-N and sequential revisions. The researchers performed in-depth analyses of the association between key response features and response quality.

Researchers observed that R1-Distilled versions of Llama-3.3-70B significantly outperform their original Instruct counterparts. Despite using complex inference-time scaling methods, non-reasoning models fail to match the performance of purpose-built reasoning models. This empirical evidence suggests that for compute-optimal approaches, investing in training specialized reasoning models may provide substantially better long-term efficiency compared to repeated inference-time scaling of general models. Methods, including training-free, verifier-free inference-time scaling methods, offer minimal improvements for reasoning models. Almost all methods underperform majority voting for both DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-Distill-Qwen-32 B.

Non-reasoning models show the clear absence of correlation between response length and correctness across most tasks, with response length gaps being consistently low. The only exception is Llama-3.1-8 B-Instruct, which displays a non-negligible gap for the AIME task. In contrast, reasoning models demonstrate a clearer trend where shorter, more precise responses tend to be more accurate, providing evidence of an inverse relationship between response length and accuracy. This phenomenon reflects the complex reasoning mechanisms inherent in these models. Moreover, analysis of the MATH dataset, with its natural difficulty gradient, confirms that reasoning models tend to generate more accurate responses with shorter lengths for high-difficulty problems.

In conclusion, researchers thoroughly evaluate verifier-free inference-time scaling methods for LLMs, emphasizing their efficiency and effectiveness in reasoning tasks. Despite using advanced scaling techniques and significant computational resources, non-reasoning models consistently lag behind specialized reasoning models like R1-Distilled Models. For reasoning models, simpler strategies such as majority voting often surpass more intricate methods like best-of-N or sequential revisions in performance. Moreover, the correct responses are shorter and feature fewer linguistic markers, indicating these traits could serve as predictors of accuracy. Utilizing these response characteristics and linguistic marker features to enhance inference methods can be an intriguing future direction.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Optimizing Reasoning Performance: A Comprehensive Analysis of Inference-Time Scaling Methods in Language Models appeared first on MarkTechPost.

LLMs Can Now Simulate Massive Societies: Researchers from Fudan University Introduce SocioVerse, an LLM-Agent-Driven World Model for Social Simulation with a User Pool of 10 Million Real Individuals

Sajjad Ansari — Sat, 26 Apr 2025 17:31:21 +0000

Human behavior research strives to comprehend how individuals and groups act in social contexts, forming a foundational social science element. Traditional methodologies like surveys, interviews, and observations face significant challenges, including high costs, limited sample sizes, and ethical concerns. These challenges have pushed researchers toward alternative approaches for studying human behavior. For example, Social simulation is an effective method to solve the problem of studying human behaviour. This method utilizes agents to model human behavior, observe reactions, and translate findings into meaningful insights.

Recent studies have explored social simulation across various levels, from mimicking specific individuals to modeling large-scale social dynamics. However, these simulations consistently face a critical challenge of maintaining alignment between the simulated environment and the real world. This alignment issue manifests across multiple dimensions and raises the following questions:

How should the simulated environment be aligned with the real world?
How should the simulated agents be aligned with target users, precisely?
How should the interaction mechanism be aligned with the real world among different scenarios?
How should the behavioral pattern be aligned with the real-world groups?

Researchers from Fudan University, Shanghai Innovation Institute, University of Rochester, Indiana University, and Xiaohongshu Inc. have proposed SocioVerse, a world model for social simulation powered by LLM-based agents built upon a large-scale real-world user pool. Modular components are designed to address the above four questions. The Social Environment component incorporates up-to-date external real-world information into simulations, while the User Engine and Scenario Engine reconstruct realistic user contexts and arrange simulation processes to align with reality. Based on this rich contextual setup, the Behavior Engine drives agents to reproduce human behaviors. To support this framework, researchers have constructed a massive user pool containing 10 million individuals based on real social media data, comparable to the entire populations of Hungary or Greece.

The SocioVerse is validated through three simulations: presidential election prediction, breaking news feedback, and national economic survey. Researchers designed a questionnaire based on established polls from various media and research institutes for the presidential election prediction in America. Its evaluation metrics are Accuracy rate and Root Mean Square Error (RMSE). The breaking news feedback simulation utilizes the ABC attitude model (Affect, Behavior, Cognition) combined with a 5-point Likert scale, and its evaluation metrics are Normalized RMSE and KL-divergence. For the national economic survey of China, spending details from the China Statistical Yearbook 2024 are categorized into eight parts, including food, clothing, housing, etc. The evaluation metrics are NRMSE and KL-divergence.

For the presidential election prediction, GPT-4o-mini and Qwen2.5-72b show competitive performance in the Accuracy and RMSE metrics. Following the winner-takes-all rule, over 90% of state voting results are predicted correctly, achieving high-precision macroscopic alignment with real-world election outcomes. In the breaking news feedback scenario, GPT-4o and Qwen2.5-72b most closely aligned with real-world perspectives in KL-Divergence and NRMSE, successfully capturing public trends and opinions. For the national economic survey, Llama3-70b shows superior performance. Models generally perform better in developed regions (top 10 GDP regions) than overall, showing SocioVerse’s ability to reproduce individual spending habits accurately.

In conclusion, researchers introduce a generalized social simulation framework called SocioVerse and evaluate its performance across three distinct real-world scenarios. Their findings indicate that state-of-the-art LLMs show a notable ability to simulate human responses in complex social contexts. Future research needs to incorporate a broader range of scenarios and develop more fine-grained evaluations built upon the current analytic engine to explore and expand the boundaries of LLMs’ simulation capabilities further. Such efforts could pave the way for establishing LLMs as reliable tools for large-scale social simulation, transforming how researchers approach the study of human behavior in diverse social environments.

Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Now Simulate Massive Societies: Researchers from Fudan University Introduce SocioVerse, an LLM-Agent-Driven World Model for Social Simulation with a User Pool of 10 Million Real Individuals appeared first on MarkTechPost.

LLMs Can Now Retain High Accuracy at 2-Bit Precision: Researchers from UNC Chapel Hill Introduce TACQ, a Task-Aware Quantization Approach that Preserves Critical Weight Circuits for Compression Without Performance Loss

Sajjad Ansari — Tue, 22 Apr 2025 07:02:05 +0000

LLMs show impressive capabilities across numerous applications, yet they face challenges due to computational demands and memory requirements. This challenge is acute in scenarios requiring local deployment for privacy concerns, such as processing sensitive patient records, or compute-constrained environments like real-time customer service systems and edge devices. Post-training quantization (PTQ) is a promising solution that allows efficient compression of pre-trained models, reducing memory consumption by 2-4 times. However, current processes have a bottleneck at 4-bit compression, with substantial performance degradation when attempting 2- or 3-bit precision. Most PTQ methods rely on small mini-batches of general-purpose pre-training data to account for activation changes resulting from quantization.

Current methods for LLM compression primarily fall into three categories. Uniform quantization represents the most basic approach, where weights stored as 16-bit float tensors are compressed by treating each row independently, mapping floats to integers based on maximum and minimum values within each channel. GPTQ-based quantization techniques advance this concept by focusing on layerwise reconstruction, aiming to minimize reconstruction loss after quantization. Further, Mixed-precision quantization methods offer a more nuanced strategy, moving beyond fixed precision for all weights. These techniques assign bit-width based on weight importance to maintain performance, with some approaches preserving high-sensitivity “outlier” weights at higher precision.

Researchers from UNC Chapel Hill have proposed a novel mixed-precision post-training quantization approach called TaskCircuit Quantization (TACQ). The method shows similarities to automated circuit discovery by directly conditioning the quantization process on specific weight circuits, defined as sets of weights associated with downstream task performance. TACQ compares unquantized model weights with uniformly quantized ones to estimate expected weight changes from quantization, then uses gradient information to predict impacts on task performance, enabling preservation of task-specific weights. TACQ consistently outperforms baselines with the same calibration data and lower weight budgets, and achieves significant improvements in the challenging 2-bit and 3-bit regimes.

TACQ is defined by a saliency metric that identifies critical weights to preserve during quantization, building on concepts from model interpretability like automatic circuit discovery, knowledge localization, and input attribution. This metric uses two components:

Quantization-aware Localization (QAL): Trace how model performance is affected by estimating expected weight changes due to quantization.
Magnitude-sharpened Gradient (MSG): A generalized metric for absolute weight importance adapted from input attribution techniques.

MSG helps stabilize TACQ and addresses biases from QAL’s estimations. These factors combine into a unified saliency metric that can be efficiently evaluated for every weight in a single backward pass, allowing preservation of the top p% highest-scoring weights at 16-bit precision.

In the challenging 2-bit setting, TACQ outperforms SliM-LLM with absolute margin improvements of 16.0% (from 20.1% to 36.1%) on GSM8k, 14.1% (from 34.8% to 49.2%) on MMLU, and 21.9% (from 0% to 21.9%) on Spider. Other baseline methods like GPTQ, SqueezeLLM, and SPQR deteriorate to near-random performance at this compression level. At 3-bit precision, TACQ preserves approximately 91%, 96%, and 89% of the unquantized accuracy on GSM8k, MMLU, and Spider, respectively, while outperforming the strongest baseline, SliM-LLM, by 1-2% across most datasets. TACQ’s advantages become evident in generation tasks requiring sequential token outputs, where it is the only method capable of recovering non-negligible performance in the 2-bit setting for the Spider text-to-SQL task.

In conclusion, researchers introduced TACQ, a significant advancement in task-aware post-training quantization. It improves model performance at ultra-low bit-widths (2- to 3-bits) where previous methods degrade to near-random outputs. TACQ aligns with automatic circuit discovery research by selectively preserving only a small fraction of salient weights at 16-bit precision, indicating that sparse weight “circuits” disproportionately influence specific tasks. Moreover, experiments on Spider show that TACQ better preserves model generation capabilities, making it suitable for program-prediction tasks. This also applies to situations involving agents, where models frequently generate many executable outputs, and where efficiency is a concern.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Now Retain High Accuracy at 2-Bit Precision: Researchers from UNC Chapel Hill Introduce TACQ, a Task-Aware Quantization Approach that Preserves Critical Weight Circuits for Compression Without Performance Loss appeared first on MarkTechPost.

ReTool: A Tool-Augmented Reinforcement Learning Framework for Optimizing LLM Reasoning with Computational Tools

Sajjad Ansari — Mon, 21 Apr 2025 06:34:32 +0000

Reinforcement learning (RL) is a powerful technique for enhancing the reasoning capabilities of LLMs, enabling them to develop and refine long Chain-of-Thought (CoT). Models like OpenAI o1 and DeepSeek R1 have shown great performance in text-based reasoning tasks, however, they face limitations on tasks that require precise numerical calculations or symbolic manipulations, such as geometric reasoning, complex computations, or equation solving. Recent research has explored prompting and supervised fine-tuning methods to equip LLMs with tool-use capabilities, but they are constrained by their reliance on imitating curated data distributions. This often results in poor generalization beyond seen patterns and an inability to determine when and how to invoke external tools.

Recent advancements in LLMs show progress toward human-like metacognition through CoT prompting. Research has evolved from train-time scaling to test-time scaling, allocating additional computational resources during inference to generate intermediate reasoning steps. Techniques like stepwise preference optimization, Monte Carlo Tree Search, and RL have improved multi-step mathematical reasoning, as evidenced by models like OpenAI-o1 and DeepSeek-R1. In addition to CoT, Program-of-Thought reasoning integrates external computational tools such as Python interpreters to simplify complex reasoning steps. Further, Tool-integrated reasoning was initially introduced to help LLMs solve computationally intensive problems through programming strategies.

Researchers from ByteDance Seed have proposed ReTool, a CI-powered RL framework designed to address math problem-solving tasks. It enhances long-form reasoning with tool-integrated learning through two key features. First, it enables dynamic interleaving of real-time code execution within natural language reasoning processes. Second, it implements an automated RL technique that allows policy rollouts with multi-turn real-time code execution, teaching the model when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework that begins with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models.

The ReTool consists of two primary stages, cold-start supervised fine-tuning followed by RL with interleaved code execution rollout. The pipeline designed for collecting and curating high-quality data begins with collecting high-quality mathematical reasoning data from diverse sources, including open-source datasets like OpenThoughts. A dual-verification approach combining human expert curation and Deepseek-R1 evaluation filters invalid data. From this foundation, code-integrated reasoning data is automatically constructed. The VeRL framework is employed with PPO as the RL method for training. The maximum sequence length is set to 16384 tokens, with a 512 mini-batch size and a KL coefficient of 0.0, using Qwen2.5-32B-Instruct as the main backbone.

ReTool enables the LLM to utilize the code interpreter flexibly during the RL stage, leading to substantial performance improvements. ReTool (Qwen2.5-32B-Instruct) achieves accuracies of 67.0% on AIME2024 and 49.3% on AIME2025 with only 400 training steps. This outperforms the text-based RL baseline (Qwen2.5-32B-Instruct), which attains 40.0% and 36.7% on the respective benchmarks despite using over 1000 training steps. Moreover, on AIME2024, ReTool (Qwen2.5-32B-Instruct) surpasses the competitive baseline s1-32B by 10.3%. Similarly, on AIME2025, it achieves an 11.4% gain over OpenAI’s o1-preview. When combined with a more advanced backbone, ReTool (DeepSeek-R1-Distill-Qwen-32B) further improves performance with scores of 72.5% on AIME2024 and 54.3% on AIME2025.

In conclusion, researchers introduced ReTool, a novel RL framework that empowers LLMs to self-enhance their mathematical reasoning capabilities through effective Code Interpreter utilization. Experiments on AIME2024 and AIME2025 show that ReTool achieves superior accuracy compared to conventional text-based RL approaches and converges with significantly fewer training steps. Through careful data curation and a specialized tool-using pipeline, ReTool enables models to develop complex computational intervention strategies, paving the way for more efficient and powerful tool-augmented reasoning in LLMs. The results demonstrate that tool-integrated RL represents a promising direction for advancing mathematical reasoning capabilities in LLMs for tasks requiring precise computation and symbolic manipulation.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post ReTool: A Tool-Augmented Reinforcement Learning Framework for Optimizing LLM Reasoning with Computational Tools appeared first on MarkTechPost.

Fourier Neural Operators Just Got a Turbo Boost: Researchers from UC Riverside Introduce TurboFNO, a Fully Fused FFT-GEMM-iFFT Kernel Achieving Up to 150% Speedup over PyTorch

Sajjad Ansari — Sun, 20 Apr 2025 20:16:44 +0000

Fourier Neural Operators (FNO) are powerful tools for learning partial differential equation solution operators, but lack architecture-aware optimizations, with their Fourier layer executing FFT, filtering, GEMM, zero padding, and iFFT as separate stages, resulting in multiple kernel launches and excessive global memory traffic. The FFT -> GEMM -> iFFT computational pattern has received inadequate attention regarding GPU kernel fusion and memory layout optimization. Current methods like Quantum ESPRESSO, Octopus, and CP2K make separate calls to FFT and BLAS routines. However, they have three limitations: partial frequency utilization with additional memory copy operations, lack of native frequency filtering capabilities in cuFFT, and excessive memory transactions between processing stages.

FNO implements a pipeline that begins with a forward FFT on input feature maps, applies spectral filtering, and reconstructs output through inverse FFT. This process necessitates frequency domain truncation and zero-padding steps, which current frameworks like PyTorch execute as separate memory-copy kernels due to cuFFT’s limitations in native input/output trimming support. Leading FFT libraries such as cuFFT and VkFFT lack built-in data truncation capabilities. Traditional 2D FFTs apply both 1D-FFT stages along spatial dimensions, but FNO applies spectral weights across the channel dimension, suggesting an opportunity for decoupling the FFT stages by keeping the first 1D FFT along spatial axes while reinterpreting the second FFT stage along the hidden dimension.

Researchers from the University of California, Riverside, CA, USA, have proposed TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. The approach begins with developing FFT and GEMM kernels from scratch that achieve performance comparable to or faster than closed-source state-of-the-art cuBLAS and cuFFT. An FFT variant is introduced to effectively fuse FFT and GEMM workloads where a single thread block iterates over the hidden dimension, aligning with the k-loop in GEMM. Moreover, two shared memory swizzling patterns are designed to achieve 100% memory bank utilization when forwarding FFT output to GEMM and enable iFFT to retrieve GEMM results directly from shared memory.

TurboFNO integrates optimized implementations of FFT and CGEMM kernels to enable effective fusion and built-in FFT optimizations. The kernel fusion strategy in TurboFNO progresses through three levels: FFT-GEMM fusion, GEMM-iFFT fusion, and full FFT-GEMM-iFFT fusion. Each stage involves aligning the FFT workflow with GEMM, resolving data layout mismatches, and eliminating shared memory bank conflicts. Key techniques include modifying FFT output layout to match GEMM’s input format, applying thread swizzling for conflict-free shared memory access, and integrating inverse FFT as an epilogue stage of CGEMM to bypass intermediate global memory writes and enhance memory locality.

TurboFNO shows great performance in both 1D and 2D FNO evaluations. In 1D FNO tests, the optimized FFT-CGEMM-iFFT workflow achieves up to 100% speedup over PyTorch, averaging 50% improvement. These gains come from FFT pruning, which reduces computation by 25%-67.5%. The fully fused FFT-CGEMM-iFFT kernel delivers up to 150% speedup over PyTorch and provides an additional 10%-20% improvement over partial fusion strategies. Similarly, in 2D FNO, the optimized workflow outperforms PyTorch with average speedups above 50% and maximum improvements reaching 100%. The 2D fully fused kernel achieves 50%-105% speedup over PyTorch without performance degradation, despite the additional overhead of aligning FFT workload layout with CGEMM dataflow.

In this paper, researchers introduced TurboFNO, the first fully fused GPU kernel that integrates FFT, CGEMM, and iFFT for accelerating Fourier Neural Operators. They developed a series of architecture-aware optimizations to overcome inefficiencies in conventional FNO implementations, such as excessive kernel launches and global memory traffic. These include a custom FFT kernel with built-in frequency filtering and zero padding, a GEMM-compatible FFT variant that mimics k-loop behavior, and shared memory swizzling strategies that improve bank utilization from 25% to 100%. TurboFNO achieves up to 150% speedup and maintains an average 67% performance gain across all tested configurations.

Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Fourier Neural Operators Just Got a Turbo Boost: Researchers from UC Riverside Introduce TurboFNO, a Fully Fused FFT-GEMM-iFFT Kernel Achieving Up to 150% Speedup over PyTorch appeared first on MarkTechPost.