New Releases Category - MarkTechPost

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

Asif Razzaq — Tue, 06 May 2025 23:13:01 +0000

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture

LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

Speech Encoder: Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
Speech Adapter: Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
Core LLM: The Qwen2.5 models serve as the main reasoning engine.
Streaming TTS Decoder: Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Streaming Generation with Read-Write Scheduling

The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach

Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

Stage I: Independently optimizes the speech-to-text and text-to-speech modules.
Stage II: Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

Benchmark Results

The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

Model	Llama Q (S2S)	Web Q (S2S)	GPT-4o Score	ASR-WER	Latency (ms)
GLM-4-Voice (9B)	50.7	15.9	4.09	3.48	1562.8
LLaMA-Omni (8B)	49.0	23.7	3.52	3.67	346.7
LLaMA-Omni2-7B	60.7	31.3	4.15	3.26	582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses

Gate Fusion Module: Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
TTS Pretraining: Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
Read/Write Strategies: Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion

LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.

Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

Asif Razzaq — Tue, 06 May 2025 05:47:32 +0000

NVIDIA has unveiled Parakeet TDT 0.6B, a state-of-the-art automatic speech recognition (ASR) model that is now fully open-sourced on Hugging Face. With 600 million parameters, a commercially permissive CC-BY-4.0 license, and a staggering real-time factor (RTF) of 3386, this model sets a new benchmark for performance and accessibility in speech AI.

Blazing Speed and Accuracy

At the heart of Parakeet TDT 0.6B’s appeal is its unmatched speed and transcription quality. The model can transcribe 60 minutes of audio in just one second, a performance that’s over 50x faster than many existing open ASR models. On Hugging Face’s Open ASR Leaderboard, Parakeet V2 achieves a 6.05% word error rate (WER)—the best-in-class among open models.

This performance represents a significant leap forward for enterprise-grade speech applications, including real-time transcription, voice-based analytics, call center intelligence, and audio content indexing.

Technical Overview

Parakeet TDT 0.6B builds on a transformer-based architecture fine-tuned with high-quality transcription data and optimized for inference on NVIDIA hardware. Here are the key highlights:

600M parameter encoder-decoder model
Quantized and fused kernels for maximum inference efficiency
Optimized for TDT (Transducer Decoder Transformer) architecture
Supports accurate timestamp formatting, numerical formatting, and punctuation restoration
Pioneers song-to-lyrics transcription, a rare capability in ASR models

The model’s high-speed inference is powered by NVIDIA’s TensorRT and FP8 quantization, enabling it to reach a real-time factor of RTF = 3386, meaning it processes audio 3386 times faster than real-time.

Benchmark Leadership

On the Hugging Face Open ASR Leaderboard—a standardized benchmark for evaluating speech models across public datasets—Parakeet TDT 0.6B leads with the lowest WER recorded among open-source models. This positions it well above comparable models like Whisper from OpenAI and other community-driven efforts.

Data based on May 5 2025

This performance makes Parakeet V2 not only a leader in quality but also in deployment readiness for latency-sensitive applications.

Beyond Conventional Transcription

Parakeet is not just about speed and word error rate. NVIDIA has embedded unique capabilities into the model:

Song-to-lyrics transcription: Unlocks transcription for sung content, expanding use cases into music indexing and media platforms.
Numerical and timestamp formatting: Improves readability and usability in structured contexts like meeting notes, legal transcripts, and health records.
Punctuation restoration: Enhances natural readability for downstream NLP applications.

These features elevate the quality of transcripts and reduce the burden on post-processing or human editing, especially in enterprise-grade deployments.

Strategic Implications

The release of Parakeet TDT 0.6B represents another step in NVIDIA’s strategic investment in AI infrastructure and open ecosystem leadership. With strong momentum in foundational models (e.g., Nemotron for language and BioNeMo for protein design), NVIDIA is positioning itself as a full-stack AI company—from GPUs to state-of-the-art models.

For the AI developer community, this open release could become the new foundation for building speech interfaces in everything from smart devices and virtual assistants to multimodal AI agents.

Getting Started

Parakeet TDT 0.6B is available now on Hugging Face, complete with model weights, tokenizer, and inference scripts. It runs optimally on NVIDIA GPUs with TensorRT, but support is also available for CPU environments with reduced throughput.

Whether you’re building transcription services, annotating massive audio datasets, or integrating voice into your product, Parakeet TDT 0.6B offers a compelling open-source alternative to commercial APIs.

Check out the Model on Hugging Face. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
ML News Community – r/machinelearningnews (92k+ members)

The post NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second appeared first on MarkTechPost.

Meta AI Releases Llama Prompt Ops: A Python Toolkit for Prompt Optimization on Llama Models

Asif Razzaq — Sun, 04 May 2025 04:20:04 +0000

Meta AI has released Llama Prompt Ops, a Python package designed to streamline the process of adapting prompts for Llama models. This open-source tool is built to help developers and researchers improve prompt effectiveness by transforming inputs that work well with other large language models (LLMs) into forms that are better optimized for Llama. As the Llama ecosystem continues to grow, Llama Prompt Ops addresses a critical gap: enabling smoother and more efficient cross-model prompt migration while enhancing performance and reliability.

Why Prompt Optimization Matters

Prompt engineering plays a crucial role in the effectiveness of any LLM interaction. However, prompts that perform well on one model—such as GPT, Claude, or PaLM—may not yield similar results on another. This discrepancy is due to architectural and training differences across models. Without tailored optimization, prompt outputs can be inconsistent, incomplete, or misaligned with user expectations.

Llama Prompt Ops solves this challenge by introducing automated and structured prompt transformations. The package makes it easier to fine-tune prompts for Llama models, helping developers unlock their full potential without relying on trial-and-error tuning or domain-specific knowledge.

What Is Llama Prompt Ops?

At its core, Llama Prompt Ops is a library for systematic prompt transformation. It applies a set of heuristics and rewriting techniques to existing prompts, optimizing them for better compatibility with Llama-based LLMs. The transformations consider how different models interpret prompt elements such as system messages, task instructions, and conversation history.

This tool is particularly useful for:

Migrating prompts from proprietary or incompatible models to open Llama models.
Benchmarking prompt performance across different LLM families.
Fine-tuning prompt formatting for improved output consistency and relevance.

Features and Design

Llama Prompt Ops is built with flexibility and usability in mind. Its key features include:

Prompt Transformation Pipeline: The core functionality is organized into a transformation pipeline. Users can specify the source model (e.g., gpt-3.5-turbo) and target model (e.g., llama-3) to generate an optimized version of a prompt. These transformations are model-aware and encode best practices that have been observed in community benchmarks and internal evaluations.
Support for Multiple Source Models: While optimized for Llama as the output model, Llama Prompt Ops supports inputs from a wide range of common LLMs, including OpenAI’s GPT series, Google’s Gemini (formerly Bard), and Anthropic’s Claude.
Test Coverage and Reliability: The repository includes a suite of prompt transformation tests that ensure transformations are robust and reproducible. This ensures confidence for developers integrating it into their workflows.
Documentation and Examples: Clear documentation accompanies the package, making it easy for developers to understand how to apply transformations and extend the functionality as needed.

How It Works

The tool applies modular transformations to the prompt’s structure. Each transformation rewrites parts of the prompt, such as:

Replacing or removing proprietary system message formats.
Reformatting task instructions to suit Llama’s conversational logic.
Adapting multi-turn histories into formats more natural for Llama models.

The modular nature of these transformations allows users to understand what changes are made and why, making it easier to iterate and debug prompt modifications.

Conclusion

As large language models continue to evolve, the need for prompt interoperability and optimization grows. Meta’s Llama Prompt Ops offers a practical, lightweight, and effective solution for improving prompt performance on Llama models. By bridging the formatting gap between Llama and other LLMs, it simplifies adoption for developers while promoting consistency and best practices in prompt engineering.

Check out the GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta AI Releases Llama Prompt Ops: A Python Toolkit for Prompt Optimization on Llama Models appeared first on MarkTechPost.

IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks

Asif Razzaq — Sun, 04 May 2025 01:36:20 +0000

IBM has introduced a preview of Granite 4.0 Tiny, the smallest member of its upcoming Granite 4.0 family of language models. Released under the Apache 2.0 license, this compact model is designed for long-context tasks and instruction-following scenarios, striking a balance between efficiency, transparency, and performance. The release reflects IBM’s continued focus on delivering open, auditable, and enterprise-ready foundation models.

Granite 4.0 Tiny Preview includes two key variants: the Base-Preview, which showcases a novel decoder-only architecture, and the Tiny-Preview (Instruct), which is fine-tuned for dialog and multilingual applications. Despite its reduced parameter footprint, Granite 4.0 Tiny demonstrates competitive results on reasoning and generation benchmarks—underscoring the benefits of its hybrid design.

Architecture Overview: A Hybrid MoE with Mamba-2-Style Dynamics

At the core of Granite 4.0 Tiny lies a hybrid Mixture-of-Experts (MoE) structure, with 7 billion total parameters and only 1 billion active parameters per forward pass. This sparsity allows the model to deliver scalable performance while significantly reducing computational overhead—making it well-suited for resource-constrained environments and edge inference.

The Base-Preview variant employs a decoder-only architecture augmented with Mamba-2-style layers—a linear recurrent alternative to traditional attention mechanisms. This architectural shift enables the model to scale more efficiently with input length, enhancing its suitability for long-context tasks such as document understanding, dialogue summarization, and knowledge-intensive QA.

Another notable design decision is the use of NoPE (No Positional Encodings). Instead of fixed or learned positional embeddings, the model integrates position handling directly into its layer dynamics. This approach improves generalization across varying input lengths and helps maintain consistency in long-sequence generation.

Benchmark Performance: Efficiency Without Compromise

Despite being a preview release, Granite 4.0 Tiny already exhibits meaningful performance gains over prior models in IBM’s Granite series. On benchmark evaluations, the Base-Preview demonstrates:

+5.6 improvement on DROP (Discrete Reasoning Over Paragraphs), a benchmark for multi-hop QA
+3.8 on AGIEval, which assesses general language understanding and reasoning

These improvements are attributed to both the model’s architecture and its extensive pretraining—reportedly on 2.5 trillion tokens, spanning diverse domains and linguistic structures.

Instruction-Tuned Variant: Designed for Dialogue, Clarity, and Multilingual Reach

The Granite-4.0-Tiny-Preview (Instruct) variant extends the base model through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), using a Tülu-style dataset consisting of both open and synthetic dialogues. This variant is tailored for instruction-following and interactive use cases.

Supporting 8,192 token input windows and 8,192 token generation lengths, the model maintains coherence and fidelity across extended interactions. Unlike encoder–decoder hybrids that often trade off interpretability for performance, the decoder-only setup here yields clearer and more traceable outputs—a valuable feature for enterprise and safety-critical applications.

Evaluation Scores:

86.1 on IFEval, indicating strong performance in instruction-following benchmarks
70.05 on GSM8K, for grade-school math problem solving
82.41 on HumanEval, measuring Python code generation accuracy

Moreover, the instruct model supports multilingual interaction across 12 languages, making it viable for global deployments in customer service, enterprise automation, and educational tools.

Open-Source Availability and Ecosystem Integration

IBM has made both models publicly available on Hugging Face:

The models are accompanied by full model weights, configuration files, and sample usage scripts under the Apache 2.0 license, encouraging transparent experimentation, fine-tuning, and integration across downstream NLP workflows.

Outlook: Laying the Groundwork for Granite 4.0

Granite 4.0 Tiny Preview serves as an early glimpse into IBM’s broader strategy for its next-generation language model suite. By combining efficient MoE architectures, long-context support, and instruction-focused tuning, the model family aims to deliver state-of-the-art capabilities in a controllable and resource-efficient package.

As more variants of Granite 4.0 are released, we can expect IBM to deepen its investment in responsible, open AI—positioning itself as a key player in shaping the future of transparent, high-performance language models for enterprise and research.

Check out the Technical details, Granite 4.0 Tiny Base Preview and Granite 4.0 Tiny Instruct Preview. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks appeared first on MarkTechPost.

JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks

Asif Razzaq — Fri, 02 May 2025 07:43:42 +0000

JetBrains has officially open-sourced Mellum, a purpose-built 4-billion-parameter language model tailored for software development tasks. Developed from the ground up, Mellum reflects JetBrains’ engineering-first approach, offering a domain-specialized model trained for practical usage across codebases and programming environments. With its release on Hugging Face under the Apache 2.0 license, JetBrains extends an invitation to the broader research and developer community to experiment, adapt, and advance Mellum’s capabilities.

A Focal Model for Code Understanding

Unlike general-purpose LLMs, Mellum is classified by JetBrains as a “focal model”—a term they use to describe models with a narrow yet deep specialization. Mellum is optimized specifically for programming-related tasks such as autocompletion, infilling, and structural understanding of source code. This focused design avoids the overhead of broader linguistic modeling and enables the model to perform efficiently in IDE-like environments.

The model supports a wide array of languages including Java, Kotlin, Python, Go, PHP, C, C++, C#, JavaScript, TypeScript, CSS, HTML, Rust, and Ruby—reflecting the polyglot nature of modern development teams.

Model Architecture and Training Pipeline

Mellum follows a LLaMA-style architecture and was trained from scratch using over 4.2 trillion tokens drawn from code-rich sources such as The Stack, StarCoder, CommitPack, and English Wikipedia. It features an 8K token context window and was trained using bf16 mixed precision across a high-throughput cluster of 256 NVIDIA H200 GPUs connected via Infiniband.

The training process spanned approximately 20 days and leveraged modern infrastructure for scalable model development. The architecture and training procedure were designed with reproducibility and deployment flexibility in mind, making Mellum usable in both cloud inference setups (e.g., vLLM) and on local environments (e.g., llama.cpp, Ollama).

Benchmarking and Evaluation

JetBrains evaluated Mellum across a range of benchmarks that reflect its primary use cases—code infilling and completion. The model’s performance indicates strong alignment with the design goals:

RepoBench v1.1 (8K context):
- Python EM: 27.97%
- Java EM: 31.08%
SAFIM (Syntax-Aware Fill-in-the-Middle):
- pass@1: 38.11%
HumanEval Infilling:
- Single-line: 66.21%
- Multi-line: 38.52%
- Random-span: 29.70%

These results reflect Mellum’s specialization for structured code understanding, especially in scenarios involving partial or interrupted code, which are common in real-world development workflows.

Rationale for Open Sourcing

JetBrains’ decision to release Mellum as open-source is grounded in several practical motivations:

Transparency: Enables scrutiny of both training data and architectural decisions.
Reusability: Supports integration in custom development environments and research experiments.
Community Collaboration: Facilitates contribution from external developers to refine model behavior.
Pedagogical Value: Provides educators and students with a hands-on artifact for understanding how domain-specific LLMs are constructed and applied.

The release includes both the base model (Mellum-4b-base) and a fine-tuned variant for Python (Mellum-4b-sft-python).

Implications for Developer Tooling

The availability of a compact, performant model optimized for source code opens new opportunities in the IDE space and beyond. JetBrains envisions Mellum as part of a broader strategy involving multiple focal models, each optimized for specific programming tasks such as diff generation or code review assistance. This approach aligns with the growing need for deployable, cost-effective, and context-aware AI tooling that can augment developer productivity without introducing opaque or oversized general-purpose models.

Conclusion

Mellum represents a deliberate shift toward smaller, specialized language models that prioritize utility, transparency, and efficiency. By making the model openly available, JetBrains offers a high-quality foundation for building the next generation of AI-assisted developer tools. Its architecture, training methodology, and benchmark performance signal a practical step forward in the evolving space of LLMs tailored for software engineering.

The release includes both the base model (Mellum-4b-base) and a fine-tuned variant for Python (Mellum-4b-sft-python). Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks appeared first on MarkTechPost.

Meta and Booz Allen Deploy Space Llama: Open-Source AI Heads to the ISS for Onboard Decision-Making

Nikhil — Fri, 02 May 2025 07:00:15 +0000

In a significant step toward enabling autonomous AI systems in space, Meta and Booz Allen Hamilton have announced the deployment of Space Llama, a customized instance of Meta’s open-source large language model, Llama 3.2, aboard the International Space Station (ISS) U.S. National Laboratory. This initiative marks one of the first practical integrations of an LLM in a remote, bandwidth-limited, space-based environment.

Addressing Disconnection and Autonomy Challenges

Unlike terrestrial applications, AI systems deployed in orbit face strict constraints—limited compute resources, constrained bandwidth, and high-latency communication links with ground stations. Space Llama has been designed to function entirely offline, allowing astronauts to access technical assistance, documentation, and maintenance protocols without requiring live support from mission control.

To address these constraints, the AI model had to be optimized for onboard deployment, incorporating the ability to reason over mission-specific queries, retrieve context from local data stores, and interact with astronauts in natural language—all without internet connectivity.

Technical Framework and Integration Stack

The deployment leverages a combination of commercially available and mission-adapted technologies:

Llama 3.2: Meta’s latest open-source LLM serves as the foundation, fine-tuned for contextual understanding and general reasoning tasks in edge environments. Its open architecture enables modular adaptation for aerospace-grade applications.
A2E2 (AI for Edge Environments): Booz Allen’s AI framework provides containerized deployment and modular orchestration tailored to constrained environments like the ISS. It abstracts complexity in model serving and resource allocation across diverse compute layers.
HPE Spaceborne Computer-2: This edge computing platform, developed by Hewlett Packard Enterprise, provides reliable high-performance processing hardware for space. It supports real-time inference workloads and model updates when necessary.
NVIDIA CUDA-capable GPUs: These enable the accelerated execution of transformer-based inference tasks while staying within the ISS’s strict power and thermal budgets.

This integrated stack ensures that the model operates within the limits of orbital infrastructure, delivering utility without compromising reliability.

Open-Source Strategy for Aerospace AI

The selection of an open-source model like Llama 3.2 aligns with growing momentum around transparency and adaptability in mission-critical AI. The benefits include:

Modifiability: Engineers can tailor the model to meet specific operational requirements, such as natural language understanding in mission terminology or handling multi-modal astronaut inputs.
Data Sovereignty: With all inference running locally, sensitive data never needs to leave the ISS, ensuring compliance with NASA and partner agency privacy standards.
Resource Optimization: Open access to the model’s architecture allows for fine-grained control over memory and compute use—critical for environments where system uptime and resilience are prioritized.
Community-Based Validation: Using a widely studied open-source model promotes reproducibility, transparency in behavior, and better testing under mission simulation conditions.

Toward Long-Duration and Autonomous Missions

Space Llama is not just a research demonstration—it lays the groundwork for embedding AI systems into longer-term missions. In future scenarios like lunar outposts or deep-space habitats, where round-trip communication latency with Earth spans minutes or hours, onboard intelligent systems must assist with diagnostics, operations planning, and real-time problem-solving.

Furthermore, the modular nature of Booz Allen’s A2E2 platform opens up the potential for expanding the use of LLMs to non-space environments with similar constraints—such as polar research stations, underwater facilities, or forward operating bases in military applications.

Conclusion

The Space Llama initiative represents a methodical advancement in deploying AI systems to operational environments beyond Earth. By combining Meta’s open-source LLMs with Booz Allen’s edge deployment expertise and proven space computing hardware, the collaboration demonstrates a viable approach to AI autonomy in space.

Rather than aiming for generalized intelligence, the model is engineered for bounded, reliable utility in mission-relevant contexts—an important distinction in environments where robustness and interpretability take precedence over novelty.

As space systems become more software-defined and AI-assisted, efforts like Space Llama will serve as reference points for future AI deployments in autonomous exploration and off-Earth habitation.

Check out the Details here. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta and Booz Allen Deploy Space Llama: Open-Source AI Heads to the ISS for Onboard Decision-Making appeared first on MarkTechPost.

Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning

Mohammad Asjad — Fri, 02 May 2025 06:31:03 +0000

Large language models (LLMs) face significant challenges when trained as autonomous agents in interactive environments. Unlike static tasks, agent settings require sequential decision-making, cross-turn memory maintenance, and adaptation to stochastic environmental feedback. These capabilities are essential for developing effective planning assistants, robotics applications, and tutoring agents that can self-improve through experience. While reinforcement learning (RL) has been applied to LLMs using rule-based rewards, training self-evolving agents that can reason and adapt remains underexplored. Current approaches suffer from training instability, complex reward signal interpretation, and limited generalisation across varying prompts or changing environments, particularly during multi-turn interactions with unpredictable feedback. The fundamental question emerges: which design elements are crucial for creating LLM agents that learn effectively and maintain stability throughout their evolution?

Through diverse methodologies, RL has significantly advanced LLMs’ reasoning capabilities. PPO maintains training stability by clipping policy updates, while GRPO enhances systematic problem-solving abilities. SAC employs entropy-regularised objectives for robust exploration, and meta tokens facilitate structured thinking. PRM and MCTS-based approaches have further improved systematic reasoning. Simultaneously, chain-of-thought techniques like STaR iteratively utilise small rationale examples alongside larger datasets. At the same time, DAPO, Dr. GRPO, and Open Reasoner Zero demonstrate that minimalist RL techniques with decoupled clipping and simple reward schemes can substantially enhance reasoning performance.

LLM agent architectures have evolved from basic reasoning-action frameworks to structured planning approaches and complex multi-agent systems. Testing environments range from specialised platforms like Sokoban and FrozenLake to general-purpose frameworks like HuggingGPT, enabling applications from web navigation to coding assistance and embodied tasks. Despite these advances, challenges persist in architectural complexity and self-correction, particularly for diverse multi-step reasoning tasks where maintaining coherence across interactions remains problematic.

Researchers have approached agent learning through StarPO (State-Thinking-Actions-Reward Policy Optimisation), a unified framework for trajectory-level agent training with flexible control over reasoning processes, reward mechanisms, and prompt structures. Building on this framework, they developed RAGEN, a modular system implementing complete training loops for analysing LLM agent dynamics in multi-turn stochastic environments. To isolate learning factors from confounding variables like pretrained knowledge, evaluation focuses on three controlled gaming environments: Bandit (single-turn, stochastic), Sokoban (multi-turn, deterministic), and Frozen Lake (multi-turn, stochastic). These minimalistic environments require policy learning through interaction rather than relying on pre-existing knowledge. The analysis reveals three critical dimensions of agent learning: gradient stability issues in multi-turn reinforcement learning, the importance of rollout frequency and diversity in shaping agent evolution, and the need for carefully designed reward signals to develop genuine reasoning capabilities rather than shallow action selection or hallucinated thinking processes.

StarPO represents a unique framework designed specifically for optimising multi-turn interaction trajectories in LLM agents. Unlike traditional approaches that treat each action independently, StarPO optimises entire trajectories—including observations, reasoning traces, actions, and feedback—as coherent units. This trajectory-level approach is particularly suited for interactive environments where agents must maintain memory across turns and adapt to stochastic feedback. StarPO’s objective function focuses on maximising expected rewards across complete trajectories rather than individual steps, making it directly compatible with autoregressive LLMs through decomposition into token-level likelihoods. The framework integrates reasoning-guided structured outputs that combine both intermediate thinking processes and executable actions, enabling agents to develop more sophisticated decision-making capabilities while maintaining learning stability in complex environments.

Experimental results reveal that StarPO-S significantly outperforms vanilla StarPO across multiple agent tasks. By implementing uncertainty-based instance filtering, KL term removal, and asymmetric clipping, StarPO-S effectively delays performance collapse and enhances final task outcomes. The stabilised approach demonstrates particular effectiveness in complex environments like FrozenLake and Sokoban, where retaining only 25-50% of high-variance rollouts dramatically improves training stability while reducing computational requirements by up to 50%.

Task diversity and interaction granularity significantly impact performance. Models trained with higher task diversity and 4-6 actions per turn demonstrate superior generalisation capabilities across novel vocabulary and larger environments. Also, frequent rollout updates prove critical for maintaining alignment between optimisation targets and policy behavior. Agents trained with up-to-date rollouts every 1-10 updates achieve faster convergence and higher success rates compared to those relying on outdated trajectory data.

Symbolic reasoning benefits vary substantially between single-turn and multi-turn tasks. While reasoning traces significantly improve generalisation in single-turn Bandit environments, they provide limited advantage in complex multi-turn settings like Sokoban and FrozenLake. Analysis shows reasoning length consistently declines during training, suggesting models gradually suppress their thought processes when rewards are sparse and delayed. This highlights the need for reward mechanisms that directly reinforce intermediate reasoning steps rather than relying solely on outcome-based feedback.

This research establishes reinforcement learning as a viable approach for training language agents in complex, stochastic environments. StarPO-S represents a significant advancement in stabilising multi-turn agent training through uncertainty-based sampling and exploration encouragement. By transitioning from human supervision to verifiable outcome-based rewards, this framework creates opportunities for developing more capable AI systems across theorem proving, software engineering, and scientific discovery. Future work should focus on multi-modal inputs, enhanced training efficiency, and applications to increasingly complex domains with verifiable objectives.

Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning appeared first on MarkTechPost.

DeepSeek-AI Released DeepSeek-Prover-V2: An Open-Source Large Language Model Designed for Formal Theorem, Proving through Subgoal Decomposition and Reinforcement Learning

Asif Razzaq — Thu, 01 May 2025 19:54:11 +0000

Formal mathematical reasoning has evolved into a specialized subfield of artificial intelligence that requires strict logical consistency. Unlike informal problem solving, which allows for intuition and loosely defined heuristics, formal theorem proving relies on every step being fully described, precise, and verifiable by computational systems. Proof assistants, such as Lean, Coq, and Isabelle, serve as the structural frameworks within which these formal proofs are constructed. Their operation demands logical soundness with no space for omissions, approximations, or unstated assumptions. This makes the challenge particularly demanding for AI systems, especially large language models, which excel in producing coherent natural language responses but typically lack the rigor to produce verifiable formal proofs. However, the desire to blend these strengths, AI’s fluency in informal reasoning and the structure of formal verification, has led to new innovations at the interface of language modeling and formal logic automation.

A major issue arises from the inability of current language models to bridge the conceptual divide between informal and formal reasoning. Language models typically excel at generating human-like explanations and solving math problems written in natural language. However, this reasoning is inherently informal and often lacks the structural precision required by formal logic systems. While humans can intuitively leap from one deductive step to another, proof assistants require a fully specified sequence of steps, free of ambiguity. Thus, the challenge is to guide AI models to produce logically coherent formal outputs from their otherwise informal and intuitive internal reasoning processes. This problem becomes increasingly complex when handling advanced theorems from domains such as number theory or geometry, where precision is crucial.

Recent efforts have attempted to address this issue by guiding models first to generate natural language proof sketches, which are then manually or semi-automatically translated into formal proof steps. A known strategy includes decomposing a complex theorem into smaller subgoals. Each subgoal represents a lemma that can be tackled independently and later combined to form a complete proof. Frameworks like “Draft, Sketch, and Prove” have applied this idea, using language models to generate proof outlines that are then translated into formal language. Another method employs hierarchical reinforcement learning, breaking down complex mathematical problems into simpler layers. However, these models often struggle to produce fully verifiable outputs in Lean or Coq environments. Moreover, the training data for these models is usually limited, and proof attempts frequently fail to yield successful outcomes that provide useful learning signals.

A team of researchers from DeepSeek-AI has introduced a new model, DeepSeek-Prover-V2, designed to generate formal mathematical proofs by leveraging subgoal decomposition and reinforcement learning. The core of their approach utilizes DeepSeek-V3 to break down a complex theorem into manageable subgoals, each of which is translated into a “have” statement in Lean 4 with a placeholder indicating that the proof is incomplete. These subgoals are then passed to a 7B-sized prover model that completes each proof step. Once all steps are resolved, they are synthesized into a complete Lean proof and paired with the original natural language reasoning generated by DeepSeek-V3. This forms a rich cold-start dataset for reinforcement learning. Importantly, the model’s training is entirely bootstrapped from synthetic data, with no human-annotated proof steps used.

The cold-start pipeline begins by prompting DeepSeek-V3 to create proof sketches in natural language. These sketches are transformed into formal theorem statements with unresolved parts. A key innovation lies in recursively solving each subgoal using the 7B prover, reducing computation costs while maintaining formal rigor. Researchers constructed a curriculum learning framework that increased the complexity of training tasks over time. They also implemented two types of subgoal theorems, one incorporating preceding subgoals as premises, and one treating them independently. This dual structure was embedded into the model’s expert iteration stage to train it on progressively more challenging problem sets. The model’s capability was then reinforced through a consistency-based reward system during training, ensuring that all decomposed lemmas were correctly incorporated into the final formal proof.

On the MiniF2F-test benchmark, the model achieved an 88.9% pass rate with high sampling (Pass@8192), compared to 82.0% by Kimina-Prover and 64.7% by Geodel-Prover. It also solved 49 out of 658 problems from PutnamBench, a platform featuring challenging mathematical tasks. On the newly introduced ProverBench dataset, comprising 325 formalized problems, the model addressed 6 out of 15 issues from the AIME (American Invitational Mathematics Examination) competitions for the years 2024 and 2025. These benchmarks highlight the model’s generalization ability across multiple formal reasoning tasks. Even when compared to DeepSeek-V3, which employs natural-language reasoning, the new model demonstrates competitive performance, solving a comparable number of AIME problems while ensuring formal verifiability.

Several Key Takeaways from the Research on DeepSeek-Prover-V2:

DeepSeek-Prover-V2 achieved an 88.9% pass rate on the MiniF2F-test (Pass@8192), the highest reported among formal reasoning models so far.
The model successfully solved 49 out of 658 problems from the PutnamBench dataset, which contains advanced mathematical challenges.
It tackled 6 out of 15 problems from the recent AIME 2024–2025 competitions, showcasing real-world applicability.
A new benchmark, ProverBench, comprising 325 formal problems, has been introduced for evaluating formal reasoning models.
The pipeline unifies natural language proof sketching and formal proof construction by combining DeepSeek-V3 and a 7B prover model.
Two types of subgoal decompositions—one with and one without dependent premises—were used to train the model in a structured, curriculum-guided manner.
Reinforcement learning with a consistency-based reward significantly improved proof accuracy by enforcing structural alignment between sketch and solution.
The entire training strategy relies on synthetic cold-start data, eliminating dependence on manually labeled proofs.

Check out the model on Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post DeepSeek-AI Released DeepSeek-Prover-V2: An Open-Source Large Language Model Designed for Formal Theorem, Proving through Subgoal Decomposition and Reinforcement Learning appeared first on MarkTechPost.

Microsoft AI Released Phi-4-Reasoning: A 14B Parameter Open-Weight Reasoning Model that Achieves Strong Performance on Complex Reasoning Tasks

Asif Razzaq — Thu, 01 May 2025 06:53:07 +0000

Despite notable advancements in large language models (LLMs), effective performance on reasoning-intensive tasks—such as mathematical problem solving, algorithmic planning, or coding—remains constrained by model size, training methodology, and inference-time capabilities. Models that perform well on general NLP benchmarks often lack the ability to construct multi-step reasoning chains or reflect on intermediate problem-solving states. Furthermore, while scaling up model size can improve reasoning capacity, it introduces prohibitive computational and deployment costs, especially for applied use in education, engineering, and decision-support systems.

Microsoft Releases Phi-4 Reasoning Model Suite

Microsoft recently introduced the Phi-4 reasoning family, consisting of three models—Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning. These models are derived from the Phi-4 base (14B parameters) and are specifically trained to handle complex reasoning tasks in mathematics, scientific domains, and software-related problem solving. Each variant addresses different trade-offs between computational efficiency and output precision. Phi-4-reasoning is optimized via supervised fine-tuning, while Phi-4-reasoning-plus extends this with outcome-based reinforcement learning, particularly targeting improved performance in high-variance tasks such as competition-level mathematics.

The open weight models were released with transparent training details and evaluation logs, including benchmark design, and are hosted on Hugging Face for reproducibility and public access.

Technical Composition and Methodological Advances

The Phi-4-reasoning models build upon the Phi-4 architecture with targeted improvements to model behavior and training regime. Key methodological decisions include:

Structured Supervised Fine-Tuning (SFT): Over 1.4M prompts were curated with a focus on “boundary” cases—problems at the edge of Phi-4’s baseline capabilities. Prompts were sourced and filtered to emphasize multi-step reasoning rather than factual recall, and responses were synthetically generated using o3-mini in high-reasoning mode.
Chain-of-Thought Format: To facilitate structured reasoning, models were trained to generate output using explicit tags, encouraging separation between reasoning traces and final answers.
Extended Context Handling: The RoPE base frequency was modified to support a 32K token context window, allowing for deeper solution traces, particularly relevant in multi-turn or long-form question formats.
Reinforcement Learning (Phi-4-reasoning-plus): Using Group Relative Policy Optimization (GRPO), Phi-4-reasoning-plus was further refined on a small curated set of ∼6,400 math-focused problems. A reward function was crafted to favor correct, concise, and well-structured outputs, while penalizing verbosity, repetition, and format violations.

This data-centric and format-aware training regime supports better inference-time utilization and model generalization across domains, including unseen symbolic reasoning problems.

Evaluation and Comparative Performance

Across a broad range of reasoning benchmarks, Phi-4-reasoning and Phi-4-reasoning-plus deliver competitive results relative to significantly larger open-weight models:

Phi-4-reasoning-plus shows strong performance not only on domain-specific evaluations but also generalizes well to planning and combinatorial problems like TSP and 3SAT, despite no explicit training in these areas. Performance gains were also observed in instruction-following (IFEval) and long-context QA (FlenQA), suggesting the chain-of-thought formulation improves broader model utility.

Importantly, Microsoft reports full variance distributions across 50+ generation runs for sensitive datasets like AIME 2025, revealing that Phi-4-reasoning-plus matches or exceeds the performance consistency of models like o3-mini, while remaining disjoint from smaller baseline distributions like DeepSeek-R1-Distill.

Conclusion and Implications

The Phi-4 reasoning models represent a methodologically rigorous effort to advance small model capabilities in structured reasoning. By combining data-centric training, architectural tuning, and minimal but well-targeted reinforcement learning, Microsoft demonstrates that 14B-scale models can match or outperform much larger systems in tasks requiring multi-step inference and generalization.

The models’ open weight availability and transparent benchmarking set a precedent for future development in small LLMs, particularly for applied domains where interpretability, cost, and reliability are paramount. Future work is expected to extend the reasoning capabilities into additional STEM fields, improve decoding strategies, and explore scalable reinforcement learning on longer horizons.

Check out the Paper, HuggingFace Page and Microsoft Blog. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Microsoft AI Released Phi-4-Reasoning: A 14B Parameter Open-Weight Reasoning Model that Achieves Strong Performance on Complex Reasoning Tasks appeared first on MarkTechPost.

Meta AI Introduces ReasonIR-8B: A Reasoning-Focused Retriever Optimized for Efficiency and RAG Performance

Asif Razzaq — Thu, 01 May 2025 06:21:30 +0000

Addressing the Challenges in Reasoning-Intensive Retrieval

Despite notable progress in retrieval-augmented generation (RAG) systems, retrieving relevant information for complex, multi-step reasoning tasks remains a significant challenge. Most retrievers today are trained on datasets composed of short factual questions, which align well with document-level lexical or semantic overlaps. However, they fall short when faced with longer, abstract, or cross-domain queries that require synthesizing dispersed knowledge. In such cases, retrieval errors can propagate through the pipeline, impairing downstream reasoning by large language models (LLMs). While LLM-based rerankers can improve relevance, their substantial computational cost often renders them impractical in real-world deployments.

Meta AI Introduces ReasonIR-8B, a Retriever Built for Reasoning

Meta AI has released ReasonIR-8B, a retriever model designed explicitly for reasoning-intensive information retrieval. Trained from LLaMA3.1-8B, the model establishes new performance standards on the BRIGHT benchmark, achieving a normalized Discounted Cumulative Gain (nDCG@10) of 36.9 when used with a lightweight Qwen2.5 reranker. Notably, it surpasses leading reranking models such as Rank1-32B while offering 200× lower inference-time compute, making it significantly more practical for scaled RAG applications.

ReasonIR-8B is trained using a novel data generation pipeline, ReasonIR-SYNTHESIZER, which constructs synthetic queries and document pairs that mirror the challenges posed by real-world reasoning tasks. The model is released open-source on Hugging Face, along with training code and synthetic data tools, enabling further research and reproducibility.

Model Architecture, Training Pipeline, and Key Innovations

ReasonIR-8B employs a bi-encoder architecture, where queries and documents are encoded independently into embeddings and scored via cosine similarity. The model’s training relies heavily on synthetically generated data tailored to reasoning scenarios. The ReasonIR-SYNTHESIZER pipeline produces two primary types of training instances:

Varied-Length (VL) Queries: These are long, information-rich queries (up to 2000 tokens), paired with corresponding documents, encouraging the retriever to handle extended contexts effectively.
Hard Queries (HQ): Derived from curated documents with high educational value, these queries are designed to require logical inference. Multi-turn prompts are used to construct hard negatives—documents that appear superficially relevant but do not contain the necessary reasoning pathways.

This approach contrasts with conventional negative sampling methods, which often rely on lexical overlap and are less effective for abstract or multi-hop questions.

Additionally, the model’s attention mask is modified from LLaMA’s causal configuration to a bi-directional one, allowing the encoder to consider the full query context symmetrically, which is beneficial for non-sequential semantic alignment.

Empirical Results on IR and RAG Benchmarks

ReasonIR-8B achieves strong performance across several benchmarks:

BRIGHT Benchmark (Reasoning-Intensive Retrieval):
- 24.4 nDCG@10 on original queries
- 29.9 with GPT-4 rewritten queries
- 36.9 with Qwen2.5 reranking, outperforming larger LLM rerankers at a fraction of the cost
Retrieval-Augmented Generation (RAG) Tasks:
- +6.4% improvement on MMLU over a closed-book baseline
- +22.6% improvement on GPQA

These gains are consistent across both standard and rewritten queries, with further improvements observed when combining REASONIR-8B with a sparse retriever like BM25 or a lightweight reranker.

Importantly, the model continues to improve as query lengths scale, unlike other retrievers whose performance plateaus or declines. This suggests that ReasonIR-8B can better exploit information-rich queries, making it particularly well-suited for test-time techniques such as query rewriting.

Conclusion

ReasonIR-8B addresses a key bottleneck in reasoning-focused information retrieval by introducing a retriever optimized not only for relevance but also for computational efficiency. Its design—rooted in synthetic training tailored for reasoning, coupled with architectural and data-centric improvements—enables consistent gains in both retrieval and RAG tasks.

By releasing the model, codebase, and training data generation pipeline as open-source tools, Meta AI encourages the research community to extend this work toward more robust, multilingual, and multimodal retrievers. For applications requiring cost-effective and high-quality retrieval under reasoning constraints, ReasonIR-8B represents a compelling and practical solution.

Check out the Paper, HuggingFace Page and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta AI Introduces ReasonIR-8B: A Reasoning-Focused Retriever Optimized for Efficiency and RAG Performance appeared first on MarkTechPost.