Aswin Ak, Author at MarkTechPost

HPC-AI Tech Releases Open-Sora 2.0: An Open-Source SOTA-Level Video Generation Model Trained for Just $200K

Aswin Ak — Sat, 15 Mar 2025 04:45:17 +0000

AI-generated videos from text descriptions or images hold immense potential for content creation, media production, and entertainment. Recent advancements in deep learning, particularly in transformer-based architectures and diffusion models, have propelled this progress. However, training these models remains resource-intensive, requiring large datasets, extensive computing power, and significant financial investment. These challenges limit access to cutting-edge video generation technologies, making them primarily available to well-funded research groups and organizations.

Training AI video models is expensive and computationally demanding. High-performance models require millions of training samples and powerful GPU clusters, making them difficult to develop without significant funding. Large-scale models, such as OpenAI’s Sora, push video generation quality to new heights but demand enormous computational resources. The high cost of training restricts access to advanced AI-driven video synthesis, limiting innovation to a few major organizations. Addressing these financial and technical barriers is essential to making AI video generation more widely available and encouraging broader adoption.

Different approaches have been developed to handle the computational demands of AI video generation. Proprietary models like Runway Gen-3 Alpha feature highly optimized architectures but are closed-source, restricting broader research contributions. Open-source models like HunyuanVideo and Step-Video-T2V offer transparency but require significant computing power. Many rely on extensive datasets, autoencoder-based compression, and hierarchical diffusion techniques to enhance video quality. However, each approach comes with trade-offs between efficiency and performance. While some models focus on high-resolution output and motion accuracy, others prioritize lower computational costs, resulting in varying performance levels across evaluation metrics. Researchers continue to seek an optimal balance that preserves video quality while reducing financial and computational burdens.

HPC-AI Tech researchers introduce Open-Sora 2.0, a commercial-level AI video generation model that achieves state-of-the-art performance while significantly reducing training costs. This model was developed with an investment of only $200,000, making it five to ten times more cost-efficient than competing models such as MovieGen and Step-Video-T2V. Open-Sora 2.0 is designed to democratize AI video generation by making high-performance technology accessible to a wider audience. Unlike previous high-cost models, this approach integrates multiple efficiency-driven innovations, including improved data curation, an advanced autoencoder, a novel hybrid transformer framework, and highly optimized training methodologies.

The research team implemented a hierarchical data filtering system that refines video datasets into progressively higher-quality subsets, ensuring optimal training efficiency. A significant breakthrough was the introduction of the Video DC-AE autoencoder, which improves video compression while reducing the number of tokens required for representation. The model’s architecture incorporates full attention mechanisms, multi-stream processing, and a hybrid diffusion transformer approach to enhance video quality and motion accuracy. Training efficiency was maximized through a three-stage pipeline: text-to-video learning on low-resolution data, image-to-video adaptation for improved motion dynamics, and high-resolution fine-tuning. This structured approach allows the model to understand complex motion patterns and spatial consistency while maintaining computational efficiency.

The model was tested across multiple dimensions: visual quality, prompt adherence, and motion realism. Human preference evaluations showed that Open-Sora 2.0 outperforms proprietary and open-source competitors in at least two categories. In VBench evaluations, the performance gap between Open-Sora and OpenAI’s Sora was reduced from 4.52% to just 0.69%, demonstrating substantial improvements. Open-Sora 2.0 also achieved a higher VBench score than HunyuanVideo and CogVideo, establishing itself as a strong contender among current open-source models. Also, the model integrates advanced training optimizations such as parallelized processing, activation checkpointing, and automated failure recovery, ensuring continuous operation and maximizing GPU efficiency.

Key takeaways from the research on Open-Sora 2.0 include :

Open-Sora 2.0 was trained for only $200,000, making it five to ten times more cost-efficient than comparable models.
The hierarchical data filtering system refines video datasets through multiple stages, improving training efficiency.
The Video DC-AE autoencoder significantly reduces token counts while maintaining high reconstruction fidelity.
The three-stage training pipeline optimizes learning from low-resolution data to high-resolution fine-tuning.
Human preference evaluations indicate that Open-Sora 2.0 outperforms leading proprietary and open-source models in at least two performance categories.
The model reduced the performance gap with OpenAI’s Sora from 4.52% to 0.69% in VBench evaluations.
Advanced system optimizations, such as activation checkpointing and parallelized training, maximize GPU efficiency and reduce hardware overhead.
Open-Sora 2.0 demonstrates that high-performance AI video generation can be achieved with controlled costs, making the technology more accessible to researchers and developers worldwide.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post HPC-AI Tech Releases Open-Sora 2.0: An Open-Source SOTA-Level Video Generation Model Trained for Just $200K appeared first on MarkTechPost.

From Sparse Rewards to Precise Mastery: How DEMO3 is Revolutionizing Robotic Manipulation

Aswin Ak — Thu, 13 Mar 2025 03:39:21 +0000

Long-horizon robotic manipulation tasks are a serious challenge for reinforcement learning, caused mainly by sparse rewards, high-dimensional action-state spaces, and the challenge of designing useful reward functions. Conventional reinforcement learning is not well-suited to handle efficient exploration since the lack of feedback hinders learning optimal policies. This issue is significant in robotic control tasks of multi-stage reasoning, where the achievement of sequential subgoals is essential to overall success. Poorly designed reward structures can cause agents to get stuck in local optima or exploit spurious shortcuts, leading to suboptimal learning processes. Additionally, most of the existing methods have high sample complexity, requiring large amounts of training data to generalize to diverse manipulation tasks. Such constraints render reinforcement learning impossible for real-world tasks, where data efficiency and well-structured learning signals are key to success.

Earlier research that has addressed these issues has explored model-based reinforcement learning, demonstration-based learning, and inverse reinforcement learning. Model-based methods, including TD-MPC2, improve sample efficiency by exploiting predictive world models but require large amounts of exploration to optimally optimize policies. Demonstration-based methods, including MoDem and CoDER, mitigate exploration problems by exploiting expert trajectories but lack good scaling to high-dimensional, long-horizon tasks due to the need for large datasets. Inverse reinforcement learning methods attempt to learn reward functions from demonstrations but lack good generalization ability and computational complexity. Further, most approaches in this field do not exploit the inherent structure of multi-stage tasks and hence do not exploit the possibility of decomposing complex objectives into more tractable subgoals.

To overcome these challenges, researchers have introduced Demonstration-Augmented Reward, Policy, and World Model Learning (DEMO3), a reinforcement learning framework that integrates structured reward acquisition, policy optimization, and model-based decision-making. The framework introduces three main innovations: the transformation of sparse stage indicators to continuous, structured rewards providing more reliable feedback; a bi-phasic training schedule that initially uses behavioral cloning followed by an interactive reinforcement learning process; and the integration of online world model learning, allowing dynamic penalty adaptation during training. Unlike current approaches, this method allows real-time structured reward acquisition through stage-specific discriminators evaluating the probability of progress toward subgoals. As a result, the framework focuses on the attainment of task goals rather than demonstration imitation, significantly improving sample efficiency and generalization across tasks in robotic manipulation.

DEMO3 is constructed from the foundation of the TD-MPC2 approach, which learns a latent-space world model for augmenting planning and control steps. The strategy is based on numerous stage-specific discriminators that each learns to forecast the chance of successful transitioning to the upcoming task stage. These discriminators are fine-tuned using the binary cross-entropy loss criterion and assist with online reward shaping, generating richer learning signals compared to sparse conventional rewards. Training adheres to a systematic two-phase process. First, in the pre-training stage, a policy and an encoder are learned using behavioral cloning from a partial set of expert demonstrations. Secondly, the agent engaged in continuous reinforcement learning processes learns to adjust and refine the policy through the process of environment interactions while depending on the derived dense rewards. An annealing process is introduced to improve the efficiency of operations through the transfer of dependence gradually from behavioral cloning to autonomous learning. This smooth transfer enables the progressive transfer of behavior from demonstration-induced imitation to policy improvement independently. The approach is tested on sixteen difficult robotic manipulation tasks, involving Meta-World, Robosuite, and ManiSkill3, and realizes substantial advances in learning efficiency as well as robustness when compared to existing state-of-the-art alternatives.

DEMO3 outperforms state-of-the-art reinforcement learning algorithms by far, garnering significant improvements in sample efficiency, learning time, and overall task completion success rates. The method records an average of 40% improved data efficiency over competing methods, with as much as 70% improvement reported for very difficult, long-horizon challenges. The system always reports high success rates with as few as five demonstrations, compared to competing methods that require much larger datasets to achieve comparable success. By being capable of processing multi-stage sparse reward instances appropriately, the system outperforms accurate robotic manipulation tasks like peg insertion and cube stacking with improved success rates within tight interaction budgets. The computational costs are also comparable, averaging around 5.19 hours for every 100,000 interaction steps, hence making it more efficient than competing reinforcement learning models while yielding superior results in learning complex robotic skills.

DEMO3 is a significant advance in reinforcement learning tailored for robotic control and is effective in addressing the challenges of dealing with long-horizon tasks with sparse rewards. By leveraging online dense reward learning, structured policy optimization, and model-based decision-making, this framework can achieve high performance and efficiency. The inclusion of a two-phase training procedure and dynamic reward adaptation helps in obtaining spectacular data efficiency improvements with success rates being 40-70% higher compared to existing methodologies on a variety of manipulation tasks. With the improvement of reward shaping, policy learning optimization, and reducing dependence on large demonstration datasets, this method provides the basis for more efficient and scalable robotic learning methods. Future research can be directed towards more advanced demonstration sampling approaches and adaptive reward-shaping techniques to further enhance data efficiency and accelerate reinforcement learning in real-world robotic tasks.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post From Sparse Rewards to Precise Mastery: How DEMO3 is Revolutionizing Robotic Manipulation appeared first on MarkTechPost.

Hugging Face Releases OlympicCoder: A Series of Open Reasoning AI Models that can Solve Olympiad-Level Programming Problems

Aswin Ak — Wed, 12 Mar 2025 05:28:41 +0000

In the realm of competitive programming, both human participants and artificial intelligence systems encounter a set of unique challenges. Many existing code generation models struggle to consistently meet the high standards required for solving complex, olympiad-level problems. A recurring issue is the difficulty in processing long chain-of-thought reasoning, often leading to solutions that pass only simplified test cases while failing under more stringent contest conditions. Datasets available today frequently capture only a fragment of the problems seen on platforms like CodeForces or in international competitions such as the International Olympiad in Informatics (IOI). This situation calls for models that can not only generate syntactically correct code but also follow a logical reasoning path that mirrors the careful thought process required in real contests.

Meet OlympicCoder

Hugging Face has recently introduced OlympicCoder, a series of models specifically designed to tackle the demands of olympiad-level programming challenges. This series consists of two fine-tuned models—OlympicCoder-7B and OlympicCoder-32B—that have been refined using a carefully curated dataset known as CodeForces-CoTs, which contains nearly 100,000 high-quality chain-of-thought samples. Notably, these models outperform closed-source frontier models like Claude 3.7 Sonnet on IOI problems, demonstrating that open-source models can compete with, and even exceed, the performance of larger proprietary systems. By integrating detailed explanations and multiple correct solutions into the training data, the OlympicCoder models are well-equipped to address the nuances of coding tasks that involve complex reasoning and problem-solving.

Technical Details and Benefits

Both OlympicCoder-7B and OlympicCoder-32B build on the foundation of the Qwen2.5-Coder Instruct model and are refined using a decontaminated version of the CodeForces dataset. For instance, OlympicCoder-7B, which contains approximately 7.6 billion parameters, is trained without employing sample packing—a technique that can inadvertently truncate lengthy reasoning chains. Instead, the training process uses a higher learning rate of 4e-5 combined with a cosine learning rate scheduler, ensuring that long-context solutions are preserved and fully utilized. Meanwhile, OlympicCoder-32B, a larger model with about 32.8 billion parameters, leverages distributed training methods with a focus on maintaining a long context window. These technical adjustments allow the models to better accommodate long and intricate reasoning sequences, which are crucial for accurately addressing the multi-layered challenges presented in competitive programming.

Results and Insights

The performance of these models has been evaluated on benchmarks such as LiveCodeBench and the IOI 2024 problems. In these assessments, the models are put through rigorous submission strategies that closely mimic real contest conditions by generating multiple submissions for individual subtasks. This method ensures that the most coherent chain-of-thought is selected for evaluation. The evaluation results confirm that both OlympicCoder-7B and OlympicCoder-32B not only deliver robust performance but, in the case of the 32B model, also achieve results that surpass those of some leading closed-source systems. Detailed analyses indicate that avoiding sample packing and applying a higher learning rate are critical factors that enhance performance, while the use of a carefully curated dataset helps capture the complexity of competitive programming problems.

Conclusion

In conclusion, OlympicCoder represents a thoughtful step forward in developing open reasoning models for competitive programming. With two fine-tuned models that excel even against larger, closed-source systems, these models exemplify how careful dataset curation and methodical fine-tuning can lead to significant advances in code generation. OlympicCoder offers valuable insights for both researchers and practitioners, paving the way for future innovations in AI-driven problem solving while maintaining a balanced and rigorous approach to model development.

Check out the 7B Model and 32B Model on Hugging Face, and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Hugging Face Releases OlympicCoder: A Series of Open Reasoning AI Models that can Solve Olympiad-Level Programming Problems appeared first on MarkTechPost.

From Genes to Genius: Evolving Large Language Models with Nature’s Blueprint

Aswin Ak — Wed, 12 Mar 2025 01:23:28 +0000

Large language models (LLMs) have transformed artificial intelligence with their superior performance on various tasks, including natural language understanding and complex reasoning. However, adapting these models to new tasks is a significant challenge, as traditional fine-tuning methods involve large labeled datasets and heavy computational resources. Existing methods for combining multiple LLMs lack the required flexibility and face challenges in generalizing to new tasks. Moreover, gradient-based optimization dependence limits efficiency and scalability, making real-time adaptation impossible. Therefore, there is an urgent need for a more effective approach that allows LLMs to dynamically adapt, need minimal data to adapt, and improve performance without paying a heavy computational price.

Several methods have been proposed to boost LLM adaptation, yet each has essential drawbacks. Expert Fusion merges fine-tuned models by averaging their parameters according to pre-defined rules, yet the method cannot adapt dynamically to a specific task. Other methods, such as LoraHub and Model Swarms, use evolutionary algorithms such as genetic programming or particle swarm optimization to adaptively combine models. They, however, need labeled adaptation data and tend to degrade in performance when scaling across multiple tasks. Parameter merging methods such as DARE, TIES, and Pack of LLMs align and merge weights from various models yet cannot adapt dynamically to changing demands. The shared drawbacks of these methods are high computational complexity, fixed adaptation mechanisms, and weak generalization ability in zero-shot settings. These drawbacks emphasize the need for a more sophisticated solution that can continuously evolve and adapt to tasks without needing extensive retraining.

Researchers from Northeastern University and Shanghai Artificial Intelligence Laboratory propose GENOME (GENetic Optimization for Model Evolution), a population-based evolutionary framework designed to enhance LLM adaptation. The approach uses mechanisms from genetics, i.e., crossover, mutation, selection, and succession, to evolve a population of models dynamically. Unlike conventional fine-tuning based on gradient-based learning, GENOME enables successful evolution under sparse data. Crossover allows the merging of high-performing models to create offspring with better capabilities, while mutation adds randomness to discover new capabilities. Selection retains only the most efficient models and rejects suboptimal ones, and succession allows knowledge transfer by enabling newly developed models to inherit the strengths and weaknesses of earlier versions. A variant called GENOME+ adds an ensemble mechanism that combines the predictions of top-performing models to improve robustness and accuracy. These innovations allow LLMs to adapt rapidly to new tasks while minimizing computational resources simultaneously, offering an improved and scalable alternative to the conventional model adaptation approaches.

The evolutionary model is applied by a population of LLMs seeded from gemma-2-2b-it, fine-tuned on 10 domains using the Tulu-v2-SFT dataset. Evolutionary operations are iteratively applied over 10 generations, fine-tuning model parameters by optimizing them using a fitness function that monitors validation accuracy. The system runs with population sizes between 10 and 40 models, with a crossover rate of 30% and a mutation probability of 20%. Accuracy, exact match, F1-score, and BLEU score are used as evaluation metrics for multilingual tasks. The approach is computationally efficient and executed on a single RTX 4090 GPU, making it a practical and viable alternative to traditional fine-tuning approaches.

Large-scale evaluations confirm that this approach surpasses state-of-the-art model adaptation and combination methods on several benchmarks in terms of enhanced accuracy, reasoning capacity, and scalability. The system achieves an average 24.06% gain over the top-performing single expert model and 10.75% over Model Swarms, with notable gains on resource-heavy reasoning tasks. In contrast to other adaptive methods, which are handicapped when faced with multiple tasks, this evolutionary approach exhibits consistent performance on a broad range of domains. In addition, it achieves strong zero-shot generalization, successfully transferring learned representations to new tasks without the need for additional training data. Scalability experiments confirm that increasing the population size from 10 to 40 models leads to further improvement in performance. Ablation studies confirm the importance of each evolutionary process, with selection and ensemble mechanisms playing a critical role in overall effectiveness.

By implementing population-based evolution to LLMs, this paper presents a gradient-free, adaptive, and scalable optimization method that allows continuous improvement in low-data conditions. Following the principles of genetic algorithms, the method allows models to evolve dynamically, perform better than traditional adaptation methods, and generalize well to novel tasks. On cost-effective implementation on current hardware, GENOME+ presents an economical and real-world solution to traditional fine-tuning and model fusion methods, enabling AI systems to improve and adapt continuously.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post From Genes to Genius: Evolving Large Language Models with Nature’s Blueprint appeared first on MarkTechPost.

This AI Paper from Google Unveils an AI System that Masters Disease Management and Medication Reasoning Better than Ever

Aswin Ak — Sat, 08 Mar 2025 06:41:49 +0000

Applying large language models (LLMs) in clinical disease management has numerous critical challenges. Although the models have been effective in diagnostic reasoning, their application in longitudinal disease management, drug prescription, and multi-visit patient care is yet to be tested. The main challenges are limited context understanding across numerous visits, heterogeneous adherence to clinical guidelines, and drug reasoning complexities. Moreover, providing real-time, high-quality patient interactions and computational efficiency is a significant challenge. Overcoming these challenges is essential in developing AI-based systems that can assist healthcare professionals in providing accurate, evidence-based, and personalized disease management.

Earlier artificial intelligence-based clinical models have predominantly focused on diagnostic reasoning, employing structured datasets to generate differential diagnoses. These approaches, however, encounter significant limitations when implemented in real-world disease management environments. A vast majority of existing approaches fail to maintain adequate tracking of patient history across visits, resulting in disconnected and inconsistent care recommendations. Various models also display an inability to effectively conform to existing clinical guidelines, thereby decreasing the reliability of their management plans. Moreover, medication reasoning is a challenge, as existing techniques tend to generate inconsistencies in drug choice, dosing, and interactions, thus decreasing their reliability for safe prescribing behavior. Even more so, the necessity for real-time decision-making in medical environments involves the quick processing of enormous clinical data, which is a computational bottleneck for most systems based on large language models.

Google researchers present an innovative LLM-based agentic system, designed for clinical disease management and multi-visit patient encounters. The solution improves AI-based medical reasoning with a series of innovations. A multi-agent system is presented, where a Dialogue Agent enables natural, empathetic conversation and tracks patient history from visit to visit, and a Management Reasoning (Mx) Agent reasons over clinical guidelines, patient history, and test results to create structured treatment plans. The system uses Gemini’s extended-context capabilities to remain aligned with current clinical guidelines and drug formularies. In contrast to legacy AI-based models operating within static, single-visit environments, this solution dynamically manages real-time, multi-visit interactions, enabling recommendations to evolve based on patient progress and test results. A new multiple-choice benchmark, RxQA, is also presented to assess medication reasoning accuracy. This dataset, created from two national drug formularies (US, UK), challenges the ability to process complex pharmacological queries, and it shows improved performance compared to human clinicians in managing high-difficulty drug-related tasks.

The system combines several cutting-edge methodologies to enhance performance. A blinded, randomized virtual Objective Structured Clinical Examination (OSCE) was implemented to compare this AI-enhanced method against 21 primary care physicians in 100 multi-visit case scenarios, including UK NICE Guidance and BMJ Best Practice guidelines. For medication reasoning assessment, the RxQA benchmark is made up of 600 multiple-choice questions drawn from OpenFDA and the British National Formulary (BNF) and validated by board-certified pharmacists. Architecturally, the system includes a Dialogue Agent based on Gemini 1.5 Flash, optimized for multi-visit medical dialogues, and an Mx Agent based on structured retrieval and reasoning to generate detailed management plans. A structured generation framework with specified constraints ensures consistency in output as well as citation fidelity from clinical guidelines. To ensure efficiency during real-time patient engagement, the model is designed to respond within one minute based on a comprehensive evaluation corpus of 627 clinical guidelines, including 10.5 million tokens, which requires optimized retrieval methods to effectively handle such vast data.

The AI system exhibited non-inferior performance to primary care physicians in disease management reasoning but outperformed them in critical areas like treatment accuracy, medication reasoning, and guideline compliance. A multi-visit OSCE study, it offered more structured and accurate management plans with improved compliance with clinical guidelines and more specificity in treatment and investigation recommendations. Medication reasoning ability also outperformed human clinicians, especially in high-difficulty drug-related queries, by successfully utilizing external drug formularies for improved accuracy. Moreover, specialist physician and patient actor ratings also reflected the AI’s capacity to monitor and update management plans, ensuring structured and patient-centered decision-making across multiple visits. These findings reflect its potential to improve AI-based clinical decision support, providing accurate, evidence-based, and effective disease management solutions.

This AI system is a remarkable leap in disease management from simple diagnostic functions to holistic patient care between visits and systematic treatment planning. With the addition of deep contextual reasoning, coordination of multiple agents, and real-time retrieval of clinical guidelines, it achieves decision-making capabilities on par with physicians for complex cases. Its ability to give accurate treatments, augment pharmacological reasoning, and strictly follow established protocols demonstrates its revolutionary potential for AI-aided clinical practice. While additional research is needed for application in real-world environments, this research is a notable step forward in bridging gaps in primary care, enhancing the uniformity of treatments, and maximizing healthcare delivery through AI-powered automation.

Check out the Paper and Blog Article. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post This AI Paper from Google Unveils an AI System that Masters Disease Management and Medication Reasoning Better than Ever appeared first on MarkTechPost.

Rethinking MoE Architectures: A Measured Look at the Chain-of-Experts Approach

Aswin Ak — Tue, 04 Mar 2025 05:57:17 +0000

Large language models have significantly advanced our understanding of artificial intelligence, yet scaling these models efficiently remains challenging. Traditional Mixture-of-Experts (MoE) architectures activate only a subset of experts per token to economize on computation. However, this design leads to two notable issues. First, experts process tokens in isolation—each expert works independently without any cross-communication. This separation can limit the model’s ability to harness diverse perspectives during processing. Second, although MoE architectures use a sparse activation pattern, they still require considerable memory because the overall parameter count is high even if only a few are active at a time. These challenges suggest that while MoE models are a step forward in scalability, their inherent design may limit both performance and resource efficiency.

The Chain-of-Experts (CoE) Approach

Chain-of-Experts (CoE) offers a thoughtful reexamination of MoE architectures by introducing a mechanism for sequential communication among experts. In contrast to the independent processing seen in traditional MoE models, CoE allows tokens to be processed in a series of iterations within each layer. In this arrangement, the output of one expert serves as the input for the next, thereby creating a communicative chain that enables experts to build upon one another’s work. This sequential interaction does not simply stack layers; it facilitates a more integrated approach to token processing, where each expert refines the interpretation of the token based on previous outputs. The result is a model that leverages the collaborative potential of its experts while aiming to use memory more efficiently [][

Technical Details and Benefits

At the heart of the CoE method is an iterative process that redefines how experts interact. For instance, consider a configuration described as CoE-2(4/64): the model operates with two iterations per token, with four experts selected from a pool of 64 available experts at each cycle. This design contrasts with traditional MoE setups, which rely on a single pass through a pre-selected group of experts.

A key technical element in CoE is the independent gating mechanism. In conventional MoE models, the gating function selects which experts should process a token, but these decisions are made once per token per layer. CoE extends this idea by allowing each expert’s gating decision to be made independently during each iteration. This flexibility encourages a form of specialization, where an expert can adjust its processing based on information received from earlier iterations.

Additionally, the use of inner residual connections in CoE further improves the model. Instead of simply adding the original token back after the entire sequence of processing (an outer residual connection), CoE integrates residual connections within each iteration. This design helps maintain the integrity of the token’s information while allowing for incremental improvements at every step.

These technical innovations collectively contribute to a model that not only maintains performance with fewer resources but also provides a more nuanced processing pathway that could be particularly valuable for tasks that require layered reasoning.

Experimental Results and Insights

Empirical studies underscore the potential of the Chain-of-Experts method. In controlled experiments—such as pretraining on math-related tasks—configurations like CoE-2(4/64) have demonstrated a reduction in validation loss (from 1.20 to 1.12) when compared with traditional MoE models operating under the same computational conditions. This improvement is achieved without increasing the overall memory or computational cost, as the sequential communication enables a more effective use of each expert’s capacity.

Further evaluations have shown that increasing the iteration count in CoE can yield benefits that are comparable to or even exceed those obtained by increasing the number of experts selected in a single pass. For instance, even when memory and compute budgets are held constant, CoE configurations exhibit up to an 18% reduction in memory usage while achieving similar or better performance outcomes.

Moreover, the sequential design of CoE opens up a substantially larger number of expert combinations—by as much as 823 times more than traditional methods. This dramatic increase in possible expert pathways means that the model has a richer set of options when processing each token, potentially leading to more robust and specialized outputs.

These findings suggest that CoE provides a pathway for rethinking how large language models can be both efficient and effective, paving the way for more sustainable AI applications in the future.

Conclusion

The Chain-of-Experts framework represents a measured evolution in the design of sparse neural networks. By introducing sequential communication among experts, CoE addresses the limitations of independent token processing and high memory usage inherent in traditional MoE models. The technical innovations—particularly the independent gating mechanism and inner residual connections—enable a more efficient and flexible approach to scaling large language models.

The experimental results, though preliminary, suggest that CoE can achieve modest yet meaningful improvements in performance and resource utilization. This approach invites further exploration, particularly in how iterative communication might be extended or refined in future model architectures. As research in this area continues, CoE stands as a thoughtful step toward achieving a balance between computational efficiency and model performance, one that may ultimately contribute to more accessible and sustainable AI systems

Check out the Technical details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Rethinking MoE Architectures: A Measured Look at the Chain-of-Experts Approach appeared first on MarkTechPost.

MedHELM: A Comprehensive Healthcare Benchmark to Evaluate Language Models on Real-World Clinical Tasks Using Real Electronic Health Records

Aswin Ak — Mon, 03 Mar 2025 06:28:15 +0000

Large Language Models (LLMs) are widely used in medicine, facilitating diagnostic decision-making, patient sorting, clinical reporting, and medical research workflows. Though they are exceedingly good in controlled medical testing, such as the United States Medical Licensing Examination (USMLE), their utility for real-world uses is still not well-tested. Most existing evaluations rely on synthetic benchmarks that fail to reflect the complexities of clinical practice. In a study last year, they found that a mere 5% of LLM analysis relies on actual world patient information, which reveals an enormous difference between testing real-world usability and indicates a profound problem with ascertaining how reliably they function in medical decision-making, therefore also questioning safety and effectiveness for use in actual-world clinical settings.

State-of-the-art evaluation methods mostly score language models with synthetic datasets, structured knowledge exams, and formal medical exams. Although these examinations test theoretical knowledge, they don’t reflect real patient scenarios with complex interactions. Most tests produce single metric results, without attention to critical details such as correctness of facts, clinical applicability, and likelihood of response bias. Furthermore, widely used public datasets are homogenous, compromising the generalization across different medical specialties and populations of patients. Another major setback is that most models trained against these benchmarks exhibit overfitting to test paradigms and therefore lose much of their performance in dynamic healthcare environments. A lack of whole-system frameworks embracing real-world patient interactions erodes confidence even further in employing them for practical medical use.

Researchers developed MedHELM, a thorough evaluation framework designed to test LLMs against real medical tasks, multi-metric assessment, and expert-revised benchmarks to address these gaps. It builds upon Stanford’s Holistic Evaluation of Language Models (HELM) and incorporates a systematic evaluation across five primary areas:

Clinical Decision Support
Clinical Note Generation
Patient Communication and Education
Medical Research Assistance
Administration and Workflow

A total of 22 subcategories and 121 specific medical tasks ensure broad coverage of critical healthcare applications. In comparison with earlier standards, MedHELM employs actual clinical data, assesses models both by structured and open-ended tasks, and applies multi-aspect scoring paradigms. The holistic coverage makes it better capable of not only measuring the recall of knowledge but also of clinical applicability, reasoning precision, and general everyday practical utility.

An extensive dataset infrastructure underpins the benchmarking process, comprising a total of 31 datasets. This collection includes 11 newly developed medical datasets alongside 20 that have been obtained from pre-existing clinical records. The datasets encompass various medical domains, thereby guaranteeing that assessments accurately represent real-world healthcare challenges rather than contrived testing scenarios.

The conversion of data sets into standardized references is a systematic process, which involves:

Context Definition: The specific data segment the model must analyze (e.g., clinical notes).
Prompting Strategy: A predefined instruction directing model behavior (e.g., “Determine the patient’s HAS-BLED score”).
Reference Response: A clinically validated output for comparison (e.g., classification labels, numerical values, or text-based diagnoses).
Scoring Metrics: A combination of exact match, classification accuracy, BLEU, ROUGE, and BERTScore for text similarity evaluations.

One example of this approach is in MedCalc-Bench, which tests how well a model can execute clinically significant numerical computations. Every data input contains a patient’s clinical history, a diagnostic question, and a solution verified by an expert, thus enabling a rigorous test of medical reasoning and precision.

Assessments conducted on six LLMs of varying sizes revealed distinct strengths and weaknesses based on task complexity. Large models like GPT-4o and Gemini 1.5 Pro performed well in medical reasoning and computational tasks and showed enhanced accuracy in tasks like clinical risk estimation and bias identification. Mid-size models like Llama-3.3-70B-instruct performed competitively in predictive healthcare tasks like hospital readmission risk prediction. Small models like Phi-3.5-mini-instruct and Qwen-2.5-7B-instruct fared poorly in domain-intensive knowledge tests, especially in mental health counseling and advanced medical diagnosis.

Aside from accuracy, response adherence to structured questions was also varied. Some models would not answer medically sensitive questions or would not answer in the desired format, at the expense of their overall performance. The test also discovered shortcomings in current automated metrics as conventional NLP scoring mechanisms tended to ignore real clinical accuracy. In the majority of benchmarks, the performance disparity between models remained negligible when employing BERTScore-F1 as the metric, which indicates that current automated evaluation procedures might not fully capture clinical usability. The results emphasize the necessity of stricter evaluation procedures incorporating fact-based scoring and unambiguous clinician feedback to ensure more reliability in evaluation.

With the advent of a clinically guided, multi-metric assessment framework, MedHELM offers a holistic and trustworthy method of assessing language models in the healthcare domain. Its methodology guarantees that LLMs will be assessed on actual clinical tasks, organized reasoning tests, and varied datasets, instead of artificial tests or truncated benchmarks. Its main contributions are:

A structured taxonomy of 121 real-world medical tasks, improving the scope of AI evaluation in clinical settings.
The use of real patient data to enhance model assessments beyond theoretical knowledge testing.
Rigorous evaluation of six state-of-the-art LLMs, identifying strengths and areas requiring improvement.
A call for improved evaluation methodologies, emphasizing fact-based scoring, steerability adjustments, and direct clinician validation.

Subsequent research efforts will concentrate on the improvement of MedHELM by introducing more specialized datasets, streamlining evaluation processes, and implementing direct feedback from healthcare professionals. Overcoming significant limitations in artificial intelligence evaluation, this framework establishes a solid foundation for the secure, effective, and clinically relevant integration of large language models into contemporary healthcare systems.

Check out the Full Leaderboard, Details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post MedHELM: A Comprehensive Healthcare Benchmark to Evaluate Language Models on Real-World Clinical Tasks Using Real Electronic Health Records appeared first on MarkTechPost.

Tencent AI Lab Introduces Unsupervised Prefix Fine-Tuning (UPFT): An Efficient Method that Trains Models on only the First 8-32 Tokens of Single Self-Generated Solutions

Aswin Ak — Sun, 02 Mar 2025 01:30:11 +0000

Unleashing a more efficient approach to fine-tuning reasoning in large language models, recent work by researchers at Tencent AI Lab and The Chinese University of Hong Kong introduces Unsupervised Prefix Fine-Tuning (UPFT). This method refines a model’s reasoning abilities by focusing solely on the first 8 to 32 tokens of its generated responses, rather than processing complete solution trajectories. By doing so, UPFT aims to capture the critical early steps of reasoning that are common across multiple solution paths while significantly reducing computational overhead.

Large language models have excelled in tasks such as language understanding and generation, yet enhancing their reasoning capabilities remains a complex challenge. Traditional fine-tuning techniques rely on either large amounts of annotated data or on procedures that generate multiple complete responses and then filter out errors through rejection sampling. These conventional methods are both resource intensive and dependent on the availability of reliable, labeled data. Moreover, extensive processing of full-length responses may include redundant information; the most informative content for reasoning appears in the early stages of the model’s output. Recognizing this, UPFT narrows the focus to the initial tokens—the part where reasoning begins and common structural elements emerge—thus addressing both efficiency and the dependence on expensive supervision.

Introducing Unsupervised Prefix Fine-Tuning

The work begins with an observation termed Prefix Self-Consistency. It was noted that, across various solution trajectories generated for the same problem, the initial reasoning steps tend to be remarkably similar. These early tokens often provide a shared foundation, even when later parts of the reasoning diverge. UPFT builds on this insight by training models using only these minimal prefixes. The method eliminates the need for detailed annotations or for generating and filtering multiple full responses, allowing the model to focus on establishing a robust reasoning framework early on. In essence, UPFT leverages the naturally occurring consistency in the model’s first few tokens to guide its learning process.

Technical Details and Advantages

At its core, UPFT reframes the training process using principles from Bayesian reasoning. Instead of considering entire reasoning traces, the method breaks down the probability of arriving at a correct answer into two components: coverage and accuracy. Coverage refers to the range of possible reasoning paths that stem from a given prefix, while accuracy indicates the likelihood that, once a particular prefix is established, the remaining tokens will lead to a correct answer. By training on these early tokens, UPFT maximizes the benefits of both elements, striking a balance between exploring diverse reasoning approaches and ensuring reliable outcomes.

Practically, this method offers clear benefits. Focusing on the prefix significantly reduces the amount of token data needed during training. Empirical studies suggest that UPFT can cut the number of tokens processed by up to 95% compared to full-token approaches. Furthermore, by dispensing with the need for rejection sampling, the method simplifies the training pipeline, reducing both time and memory requirements. This approach is particularly appealing for applications where computational resources are limited or where large labeled datasets are not available.

Empirical Insights and Data

The performance of UPFT has been evaluated across several established reasoning benchmarks, including GSM8K, MATH500, AIME2024, and GPQA. In these tests, models fine-tuned with UPFT performed comparably to those trained using conventional, more resource-intensive methods. For instance, when applied to the Qwen2.5-Math-7B-Instruct model, UPFT achieved an improvement in average accuracy while using significantly fewer tokens during both training and inference. On benchmarks that demand complex reasoning, such as AIME2024, the method demonstrated a marked enhancement in performance, suggesting that early reasoning steps contain the essential cues needed for problem-solving.

Additionally, UPFT’s efficiency in reducing computational costs is noteworthy. By working with substantially shorter token sequences, the training process becomes faster and less demanding on hardware, which could be particularly beneficial in scenarios where quick deployment or lower energy consumption is a priority.

Conclusion

The introduction of Unsupervised Prefix Fine-Tuning represents a thoughtful step toward more efficient and accessible methods for enhancing reasoning in large language models. By concentrating on the initial tokens—those that encapsulate the core of the reasoning process—this approach reduces the need for extensive labeled datasets and complex sampling strategies. Rather than relying on large-scale annotations or rejection sampling to correct errors later in the reasoning process, UPFT refines models by focusing on the parts of the response that are both consistent and informative.

In reflecting on the necessity of expensive labeled data and rejection sampling, UPFT suggests a more streamlined alternative. It offers a method where a minimal, unsupervised fine-tuning process can yield significant improvements in reasoning performance. This refined approach not only makes the process more resource efficient but also opens the door to developing self-improving reasoning models in a more accessible manner, challenging some of the conventional assumptions about what is required for effective model training.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Tencent AI Lab Introduces Unsupervised Prefix Fine-Tuning (UPFT): An Efficient Method that Trains Models on only the First 8-32 Tokens of Single Self-Generated Solutions appeared first on MarkTechPost.

Elevating AI Reasoning: The Art of Sampling for Learnability in LLM Training

Aswin Ak — Fri, 28 Feb 2025 03:53:45 +0000

Reinforcement learning (RL) has been a core component in training large language models (LLMs) to perform tasks that involve reasoning, particularly mathematical problem-solving. A considerable inefficiency occurs during training, including a situation where many questions are always answered or left unsolved. The lack of variability in success rates is to blame for inefficient learning results because questions that do not yield a gradient signal do not allow the model’s performance to be improved. Traditional RL-based fine-tuning strategies are susceptible to expensive computational costs, increased energy usage, and inefficient use of resources. Correcting this is necessary to improve training efficiency and make language models learn from problems that greatly improve their reasoning.

The standard training regimen of large language models (LLMs) uses policy gradient techniques, such as Proximal Policy Optimization (PPO), in which models engage with each query repeatedly and corrections are applied based on signs of success or failure. One of the greatest drawbacks of this approach, however, is that the majority of training examples belong to clusters of extremes—either always correct or always incorrect. When an example is always solved correctly, repeated attempts do not provide further learning information. On the contrary, an impossible query provides no feedback for improvement. As a result, precious computational resources are wasted on useless training scenarios. Different curriculum-learning techniques, such as Unsupervised Environment Design (UED), have attempted to dynamically control training difficulty. These techniques, however, rely on heuristics such as regret-based selection, which are largely insufficient in anticipating optimal problem difficulty and fail to generalize well to reasoning tasks relevant to LLM training.

To address this inefficiency, a novel training policy has been suggested and proposed that focuses on samples with high variance of success rates, thus forcing models to focus on questions not too easy and not too difficult. By identifying and choosing issues where the model performs erratically, the approach concentrates training on scenarios that provide the most informative learning signals. Differing from previous policies that utilized random sampling to train batches, this systematic selection method enhances update efficiency by eliminating problems that do not allow significant improvement. The procedure adapts during training, continuously optimizing question selection to track the fluctuating strength of the model. By targeting instances of moderate difficulty, the approach enables better learning and better generalization to novel tasks.

The structured selection process operates through a multi-step pipeline that begins with the identification of candidate questions at each training iteration. Multiple rollouts are generated to assess the probability of success for each problem, and the variance of these success rates is computed using the function 𝑝 ( 1 − 𝑝 ), where 𝑝 represents the likelihood of a correct solution. The most learnable questions with moderate success probabilities are prioritized and stored in a dynamic buffer. Training batches are then formed by selecting a combination of high-variance problems from this buffer and additional randomly sampled examples from the dataset. This carefully crafted batch is then utilized to calculate policy gradients and update the model parameters. The efficacy of this strategy is validated by applying two reinforcement learning algorithms, PPO and VinePPO, to two mathematical reasoning datasets: MATH, comprising 12,000 competition-level problems, and GSM8K, comprising 8,000-grade school-level questions. Additional tests are performed on the CollegeMath and OlympiadBench datasets to quantify the generalization capabilities outside the original training distribution. The entire framework combines VinePPO with smooth optimizations such as gradient accumulation, multi-rollout estimation, and Deepspeed ZeRO to offer scalable performance.

The learning-driven selection mechanism greatly improves both the speed and efficiency of model training. Models trained with this curriculum are as accurate as models trained with traditional methods in roughly four times fewer training steps, with a remarkable improvement in convergence rates. Performance improves consistently through several datasets, with better test accuracy on GSM8K and MATH. The structured curriculum also generalizes to out-of-distribution tasks, with better generalization to datasets like CollegeMath and OlympiadBench. Training batch composition is optimized by eliminating questions with zero learning signal, leading to more efficient training. The approach is also found to be computationally beneficial, as sample generation can be scaled efficiently without redundant model updates. The combination of faster convergence, better generalization, and lower computational overhead makes this adaptive learning process a valuable and efficient tool for reinforcement learning-based LLM fine-tuning.

A paradigm for high-variance learning opportunity target question selection effectively addresses the inefficiencies witnessed in language model fine-tuning based on reinforcement learning. Focusing on problems that produce the most informative training signals maximizes learning efficiency, achieving faster improvement and better adaptability with new samples. Large-scale experiments validate the strategy to be better in enhancing training speed, test accuracy, and generalization to more than one dataset. The findings highlight the promise of structured sample selection in model training refinement improvement and computational resource optimization. Future studies on the strategy can investigate its applicability to other reinforcement learning tasks, such as reward model optimization, preference-based fine-tuning, and generalized decision-making tasks in AI.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Elevating AI Reasoning: The Art of Sampling for Learnability in LLM Training appeared first on MarkTechPost.

Microsoft AI Releases Phi-4-multimodal and Phi-4-mini: The Newest Models in Microsoft’s Phi Family of Small Language Models (SLMs)

Aswin Ak — Thu, 27 Feb 2025 20:45:13 +0000

In today’s rapidly evolving technological landscape, developers and organizations often grapple with a series of practical challenges. One of the most significant hurdles is the efficient processing of diverse data types—text, speech, and vision—within a single system. Traditional approaches have typically required separate pipelines for each modality, leading to increased complexity, higher latency, and greater computational costs. In many applications—from healthcare diagnostics to financial analytics—these limitations can hinder the development of responsive and adaptive AI solutions. The need for models that balance robustness with efficiency is more pressing than ever. In this context, Microsoft’s recent work on small language models (SLMs) provides a promising approach by striving to consolidate capabilities in a compact, versatile package.

Microsoft AI has recently introduced Phi-4-multimodal and Phi-4-mini, the newest additions to its Phi family of SLMs. These models have been developed with a clear focus on streamlining multimodal processing. Phi-4-multimodal is designed to handle text, speech, and visual inputs concurrently, all within a unified architecture. This integrated approach means that a single model can now interpret and generate responses based on varied data types without the need for separate, specialized systems.

In contrast, Phi-4-mini is tailored specifically for text-based tasks. Despite being more compact, it has been engineered to excel in reasoning, coding, and instruction following. Both models are made accessible via platforms like Azure AI Foundry and Hugging Face, ensuring that developers from a range of industries can experiment with and integrate these models into their applications. This balanced release represents a thoughtful step towards making advanced AI more practical and accessible.

Technical Details and Benefits

At the technical level, Phi-4-multimodal is a 5.6-billion-parameter model that incorporates a mixture-of-LoRAs—a method that allows the integration of speech, vision, and text within a single representation space. This design significantly simplifies the architecture by removing the need for separate processing pipelines. As a result, the model not only reduces computational overhead but also achieves lower latency, which is particularly beneficial for real-time applications.

Phi-4-mini, with its 3.8-billion parameters, is built as a dense, decoder-only transformer. It features grouped-query attention and boasts a vocabulary of 200,000 tokens, enabling it to handle sequences of up to 128,000 tokens. Despite its smaller size, Phi-4-mini performs remarkably well in tasks that require deep reasoning and language understanding. One of its standout features is the capability for function calling—allowing it to interact with external tools and APIs, thus extending its practical utility without requiring a larger, more resource-intensive model.

Both models have been optimized for on-device execution. This optimization is particularly important for applications in environments with limited compute resources or in edge computing scenarios. The models’ reduced computational requirements make them a cost-effective choice, ensuring that advanced AI functionalities can be deployed even on devices that do not have extensive processing capabilities.

Performance Insights and Benchmark Data

Benchmark results provide a clear view of how these models perform in practical scenarios. For instance, Phi-4-multimodal has demonstrated an impressive word error rate (WER) of 6.14% in automatic speech recognition (ASR) tasks. This is a modest improvement over previous models like WhisperV3, which reported a WER of 6.5%. Such improvements are particularly significant in applications where accuracy in speech recognition is critical.

Beyond ASR, Phi-4-multimodal also shows robust performance in tasks such as speech translation and summarization. Its ability to process visual inputs is notable in tasks like document reasoning, chart understanding, and optical character recognition (OCR). In several benchmarks—ranging from synthetic speech interpretation on visual data to document analysis—the model’s performance consistently aligns with or exceeds that of larger, more resource-intensive models.

Similarly, Phi-4-mini has been evaluated on a variety of language benchmarks, where it holds its own despite its more compact design. Its aptitude for reasoning, handling complex mathematical problems, and coding tasks underlines its versatility in text-based applications. The inclusion of a function-calling mechanism further enriches its potential, enabling the model to draw on external data and tools seamlessly. These results underscore a measured and thoughtful improvement in multimodal and language processing capabilities, providing clear benefits without overstating its performance.

Conclusion

The introduction of Phi-4-multimodal and Phi-4-mini by Microsoft marks an important evolution in the field of AI. Rather than relying on bulky, resource-demanding architectures, these models offer a refined balance between efficiency and performance. By integrating multiple modalities in a single, cohesive framework, Phi-4-multimodal simplifies the complexity inherent in multimodal processing. Meanwhile, Phi-4-mini provides a robust solution for text-intensive tasks, proving that smaller models can indeed offer significant capabilities.

Check out the Technical details and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Microsoft AI Releases Phi-4-multimodal and Phi-4-mini: The Newest Models in Microsoft’s Phi Family of Small Language Models (SLMs) appeared first on MarkTechPost.