Large Language Model Category - MarkTechPost

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

Asif Razzaq — Tue, 06 May 2025 23:13:01 +0000

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture

LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

Speech Encoder: Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
Speech Adapter: Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
Core LLM: The Qwen2.5 models serve as the main reasoning engine.
Streaming TTS Decoder: Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Streaming Generation with Read-Write Scheduling

The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach

Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

Stage I: Independently optimizes the speech-to-text and text-to-speech modules.
Stage II: Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

Benchmark Results

The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

Model	Llama Q (S2S)	Web Q (S2S)	GPT-4o Score	ASR-WER	Latency (ms)
GLM-4-Voice (9B)	50.7	15.9	4.09	3.48	1562.8
LLaMA-Omni (8B)	49.0	23.7	3.52	3.67	346.7
LLaMA-Omni2-7B	60.7	31.3	4.15	3.26	582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses

Gate Fusion Module: Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
TTS Pretraining: Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
Read/Write Strategies: Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion

LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.

Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.

How AI Agents Store, Forget, and Retrieve? A Fresh Look at Memory Operations for the Next-Gen LLMs

Sana Hassan — Mon, 05 May 2025 23:26:46 +0000

Memory plays a crucial role in LLM-based AI systems, supporting sustained, coherent interactions over time. While earlier surveys have explored memory about LLMs, they often lack attention to the fundamental operations governing memory functions. Key components like memory storage, retrieval, and memory-grounded generation have been studied in isolation, but a unified framework that systematically integrates these processes remains underdeveloped. Although a few recent efforts have proposed operational views of memory to categorize existing work, the field still lacks cohesive memory architectures that clearly define how these atomic operations interact.

Furthermore, existing surveys tend to address only specific subtopics within the broader memory landscape, such as long-context handling, long-term memory, personalization, or knowledge editing. These fragmented approaches often miss essential operations like indexing and fail to offer comprehensive overviews of memory dynamics. Additionally, most prior work does not establish a clear research scope or provide structured benchmarks and tool coverage, limiting their practical value for guiding future advancements in memory for AI systems.

Researchers from the Chinese University, the University of Edinburgh, HKUST, and the Poisson Lab at Huawei UK R&D Ltd. present a detailed survey on memory in AI systems. They classify memory into parametric, contextual-structured, and contextual-unstructured types, distinguishing between short-term and long-term memory inspired by cognitive psychology. Six fundamental operations—consolidation, updating, indexing, forgetting, retrieval, and compression—are defined and mapped to key research areas, including long-term memory, long-context modeling, parametric modification, and multi-source integration. Based on an analysis of over 30,000 papers using the Relative Citation Index, the survey also outlines tools, benchmarks, and future directions.

The researchers first develop a three‐part taxonomy of AI memory—parametric (model weights), contextual‐structured (e.g., indexed dialogue histories), and contextual‐unstructured (raw text or embeddings)—and distinguish short‐ versus long‐term spans. They then define six core memory operations: consolidation (storing new information), updating (modifying existing entries), indexing (organizing for fast access), forgetting (removing stale data), retrieval (fetching relevant content), and compression (distilling memories). To ground this framework, they mined over 30,000 top‐tier AI papers (2022–2025), ranked them by Relative Citation Index, and clustered high‐impact works into four themes—long‐term memory, long‐context modeling, parametric editing, and multi‐source integration—thereby mapping each operation and memory type to active research areas and highlighting key benchmarks and tools.

The study describes a layered ecosystem of memory-centric AI systems that support long-term context management, user modeling, knowledge retention, and adaptive behavior. This ecosystem is structured across four tiers: foundational components (such as vector stores, large language models like Llama and GPT-4, and retrieval mechanisms like FAISS and BM25), frameworks for memory operations (e.g., LangChain and LlamaIndex), memory layer systems for orchestration and persistence (such as Memary and Memobase), and end-user-facing products (including Me. bot and ChatGPT). These tools provide infrastructure for memory integration, enabling capabilities like grounding, similarity search, long-context understanding, and personalized AI interactions.

The survey also discusses open challenges and future research directions in AI memory. It highlights the importance of spatio-temporal memory, which balances historical context with real-time updates for adaptive reasoning. Key challenges include parametric memory retrieval, lifelong learning, and efficient knowledge management across memory types. Additionally, the paper draws inspiration from biological memory models, emphasizing dual-memory architectures and hierarchical memory structures. Future work should focus on unifying memory representations, supporting multi-agent memory systems, and addressing security concerns, particularly memory safety and malicious attacks in machine learning techniques.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post How AI Agents Store, Forget, and Retrieve? A Fresh Look at Memory Operations for the Next-Gen LLMs appeared first on MarkTechPost.

RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity

Sajjad Ansari — Mon, 05 May 2025 18:09:19 +0000

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Methods like Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, representing a fundamental challenge for this class of models.

Linear complexity language models have emerged as alternatives to transformer-based architectures that suffer from quadratic computational demands when processing long sequences. The RWKV model series combines transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, from the foundational RWKV-4 to RWKV-5 to RWKV-6 to RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Further, Native Sparse Attention organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention includes SeerAttention and Block Attention (MoBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

RWKV-X is a hybrid architecture that integrates RWKV-7 blocks with sparse attention blocks. Rather than training from scratch, RWKV-X builds upon existing models using an interleaved block expansion approach and zero-initialization mechanism inspired by LLaMA Pro. The training follows a two-stage process:

First, the model trains on short 1024-token contexts from the MiniPile dataset while freezing all parameters except the newly added blocks.
The second stage involves long-context continual pretraining using the ProLong-64K dataset and a context length of 64K tokens, processing approximately 1 billion tokens total. During this phase, all parameters are unfrozen and jointly optimized. The training employs Long-context Cross-Entropy (LongCE) loss, which dynamically weights tokens based on their importance.

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller RWKV-X (0.22B) achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity appeared first on MarkTechPost.

Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling

Mohammad Asjad — Mon, 05 May 2025 05:31:33 +0000

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities across diverse tasks, with Reinforcement Learning (RL) serving as a crucial mechanism for refining their deep thinking abilities. While RL techniques have shown particular success in mathematical reasoning and coding domains with well-defined rules and verifiable correctness criteria, extending these approaches to broader reasoning contexts presents significant challenges, including limited training data and difficulties in ensuring cross-domain generalisation.

Evolution of Reasoning in LLMs

The development of Chain-of-Thought (CoT) methodology marked a significant advancement in LLM reasoning capabilities. CoT has demonstrated substantial improvements across mathematics, science, and programming domains by incorporating multi-step intermediate reasoning processes before reaching conclusions. This approach allows models to break down complex problems into manageable steps, mirroring human problem-solving processes.

While mathematical reasoning has dominated recent research due to its verifiable nature, the expansion of RL training to diverse domains remains largely unexplored. Prior research works suggest that blending mathematical content with other verifiable domains can improve performance on broad reasoning benchmarks. However, systematic investigation into how non-mathematical reasoning data, such as legal analysis, social science, or historical interpretation, impacts RL training effectiveness still represents a significant research gap.

Challenges in Diversifying Reasoning Domains

Recent research has explored methods for diversifying RL training datasets, yet questions about optimal data-blending strategies and the relative importance of various sources remain unanswered. A fundamental challenge in applying RL to general reasoning tasks is developing verifiable reward models for domains lacking deterministic solutions. Domain-specific reasoning processes—whether rule-based and symbolic in mathematics or contextual and heuristic in fields like law and history—require different cognitive approaches. In addition to that, question formats (open-ended versus multiple-choice) demand distinct reasoning strategies, suggesting that incorporating diverse reasoning domains could significantly enhance LLMs’ broad cognitive capabilities.

Nemotron-CrossThink: A Multi-Domain Approach

Researchers from NVIDIA, Carnegie Mellon University, and Boston University introduce Nemotron-CrossThink, representing a systematic framework for incorporating multi-domain corpora into RL training to enhance cross-task generalisation. The methodology follows a comprehensive pipeline that curates diverse data sources, including synthetic data from CommonCrawl and open-source question-answer pairs across STEM, humanities, law, and social sciences. By applying templated formats (MCQ/Open-Ended) to constrain answer spaces, filtering samples for verifiable rewards, and implementing strategic data-blending recipes, the framework enables effective self-learning through RL across diverse reasoning domains.

Key Results and Innovations

Nemotron-CrossThink significantly enhances LLM reasoning capabilities by integrating multi-domain data with different question formats. Models trained with this approach demonstrate not only higher accuracy but also dynamic response strategies—generating concise answers for general-purpose questions while providing detailed responses for mathematical problems—thereby optimising inference costs while maintaining task-specific precision.

The framework addresses the challenge of verifiable rewards in non-deterministic domains through templated data curation that limits answer space diversity. It also provides an efficient filtering approach that ranks general-purpose reasoning data by complexity, showing that training with more challenging samples amplifies RL impact across all domains. These innovations have led to substantial performance gains in both mathematical benchmarks (MATH-500: +30.1%, AMC23: +27.5%) and non-mathematical tasks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%).

Comprehensive Data Curation

Nemotron-CrossThink begins with meticulous data curation from multiple sources to ensure diversity. The training dataset combines synthetically generated data from CommonCrawl and publicly available open-source QA datasets, encompassing both general-purpose reasoning and mathematical content. General-purpose reasoning data includes MMLU, Natural Reasoning, and synthesised QA pairs spanning STEM fields, economics, social sciences, and humanities, while mathematical reasoning incorporates datasets like MATH and Numina-Math alongside synthetically generated problems.

Template Application and Data Filtering

To address the challenge of verifiable rewards in non-mathematical domains, the framework applies specific templates to structure question-answer formats: Multiple Choice Questions (MCQ) and Open-Ended questions. This approach exposes the model to diverse answer formats and reasoning pathways while limiting answer space variability to enable effective reward modeling. Rigorous filtering removes samples that are infeasible to evaluate with rule-based reward functions, discarding MCQs where correct answers aren’t among the choices and open-ended responses exceeding ten words.

Strategic Data Blending and Reinforcement Learning

Nemotron-CrossThink employs Group Relative Policy Optimisation (GRPO) for reinforcement learning, which improves efficiency by estimating baselines from group scores rather than using a separate critic model. The methodology investigates the impact of diverse data sources, question types, and data usefulness through six distinct blending recipes. This systematic approach enables detailed analysis of how general-purpose reasoning data complements mathematical reasoning, ultimately producing more adaptable and generalizable language models.

Technical Contributions

The research demonstrates several key technical advances in multi-domain reasoning through reinforcement learning:

Templated question-answer formats provide more stable reward modeling, with unified open-ended question formats improving performance by 1.21% over mixed formats, and short-form answer templates outperforming long-form ones by 1.20%.
Strategic data-blending proves essential, with multi-domain corpora boosting average reasoning accuracy by 1.61% compared to math-only training while reducing token usage by 28%.
Model-driven filtering techniques effectively select challenging samples by removing those solvable by smaller models, yielding an additional 2.15% accuracy gain for Qwen-2.5-32B.

These findings represent significant progress in developing LLMs with robust reasoning capabilities across diverse domains, moving beyond the traditional focus on mathematical reasoning to encompass the full spectrum of human knowledge and inference patterns.

Experiments and Results

Experimental results demonstrate that different datasets significantly impact model performance across reasoning benchmarks. NuminaMath produced the highest overall average, outperforming the baseline by 8.30%, with particular strength in mathematical tasks while also generalizing well across diverse domains. Synthetic question-answering data improved performance by approximately 1.0%, showing strong accuracy in MMLU-PRO, AGIEVAL, and MATH-500 tasks, confirming that synthetically generated instruction-style data can effectively generalize when aligned with benchmark distributions.

The Nemotron-CrossThink approach consistently outperformed the base model across various blending strategies. The general-purpose reasoning blend (Bgpr↑) achieved the highest overall average, exceeding OPEN-REASONER-ZERO by approximately 5% on average and showing substantial gains on reasoning-focused benchmarks (+12.82% on MMLU-PRO, +15.12% on AGIEVAL). Though Bonly_math performed slightly better on strictly mathematical tasks, it lagged on non-mathematical reasoning benchmarks, demonstrating Bgpr↑’s superior versatility through strong cross-domain transfer.

Further analysis revealed that open-ended question formats (Bopen↑) yielded stronger results on mathematical benchmarks than multiple-choice formats (Bmcq↑), suggesting alignment with the inherently open-ended structure of mathematical problems. Mathematical reasoning data showed transferability to structured reasoning tasks, while general-purpose data proved less effective in isolation. This counterintuitive finding confirms that optimal general-purpose reasoning performance requires including mathematical problems in training blends.

Conclusion

Nemotron-CrossThink introduces a scalable framework that enhances LLM generalization through reinforcement learning with multi-domain corpora. By strategically blending diverse reasoning data with a 2:1 ratio of general-purpose to mathematical content, the approach achieves a remarkable 13.36% average improvement over baselines. The research demonstrates that data diversity, not merely volume, drives broader reasoning capabilities. Through difficulty-based filtering and thoughtful template design, Nemotron-CrossThink establishes a practical methodology for developing more generalizable, efficient, and reliable LLMs that extend self-learning beyond mathematical reasoning.

Check out the Paper and Project Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
ML News Community – r/machinelearningnews (92k+ members)

The post Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling appeared first on MarkTechPost.

Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation

Sana Hassan — Mon, 05 May 2025 03:33:09 +0000

RAG has proven effective in enhancing the factual accuracy of LLMs by grounding their outputs in external, relevant information. However, most existing RAG implementations are limited to text-based corpora, which restricts their applicability to real-world scenarios where queries may require diverse types of information, ranging from textual definitions to spatial understanding from images or temporal reasoning from videos. While some recent approaches have extended RAG to handle different modalities like images and videos, these systems are often constrained to operate within a single modality-specific corpus. This limits their ability to effectively respond to a wide spectrum of user queries that demand multimodal reasoning. Moreover, current RAG methods usually retrieve from all modalities without discerning which is most relevant for a given query, making the process inefficient and less adaptive to specific information needs.

To address this, recent research emphasizes the need for adaptive RAG systems to determine the appropriate modality and retrieval granularity based on the query context. Strategies include routing queries based on complexity, such as deciding between no retrieval, single-step, or multi-step retrieval, and using model confidence to trigger retrieval only when needed. Furthermore, the granularity of retrieval plays a crucial role, as studies have shown that indexing corpora at finer levels, like propositions or specific video clips, can significantly improve retrieval relevance and system performance. Hence, for RAG to truly support complex, real-world information needs, it must handle multiple modalities and adapt its retrieval depth and scope to the specific demands of each query.

Researchers from KAIST and DeepAuto.ai introduce UniversalRAG, a RAG framework that retrieves and integrates knowledge from various modality-specific sources (text, image, video) and multiple granularity levels. Unlike traditional approaches that embed all modalities into a shared space, leading to modality bias, UniversalRAG uses a modality-aware routing mechanism to select the most relevant corpus dynamically based on the query. It further enhances retrieval precision by organizing each modality into granularity-specific corpora, such as paragraphs or video clips. Validated on eight multimodal benchmarks, UniversalRAG consistently outperforms unified and modality-specific baselines, demonstrating its adaptability to diverse query needs.

UniversalRAG is a retrieval-augmented generation framework that handles queries across various modalities and data granularities. Unlike standard RAG models limited to a single corpus, UniversalRAG separates knowledge into text, image, and video corpora, each with fine- and coarse-grained levels. A routing module first determines the optimal modality and granularity for a given query, choosing among options like paragraphs, full documents, video clips, or full video, and retrieves relevant information accordingly. This router can be either a training-free LLM-based classifier or a trained model using heuristic labels from benchmark datasets. An LVLM then uses the selected content to generate the final response.

The experimental setup assesses UniversalRAG across six retrieval scenarios: no retrieval, paragraph, document, image, clip, and video. For no-retrieval, MMLU tests general knowledge. Paragraph-level tasks use SQuAD and Natural Questions, while HotpotQA handles multi-hop document retrieval. Image-based queries come from WebQA, and video-related ones are sourced from LVBench and VideoRAG datasets, split into clip- and full-video levels. Corresponding retrieval corpora are curated for each modality—Wikipedia-based for text, WebQA for images, and YouTube videos for video tasks. This comprehensive benchmark ensures robust evaluation across varied modalities and retrieval granularities.

In conclusion, UniversalRAG is a Retrieval-Augmented Generation framework that can retrieve knowledge from multiple modalities and levels of granularity. Unlike existing RAG methods that rely on a single, often text-only, corpus or a single-modality source, UniversalRAG dynamically routes queries to the most appropriate modality- and granularity-specific corpus. This approach addresses issues like modality gaps and rigid retrieval structures. Evaluated on eight multimodal benchmarks, UniversalRAG outperforms both unified and modality-specific baselines. The study also emphasizes the benefits of fine-grained retrieval and highlights how both trained and train-free routing mechanisms contribute to robust, flexible multimodal reasoning.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation appeared first on MarkTechPost.

Google Researchers Advance Diagnostic AI: AMIE Now Matches or Outperforms Primary Care Physicians Using Multimodal Reasoning with Gemini 2.0 Flash

Sana Hassan — Sun, 04 May 2025 20:00:11 +0000

LLMs have shown impressive promise in conducting diagnostic conversations, particularly through text-based interactions. However, their evaluation and application have largely ignored the multimodal nature of real-world clinical settings, especially in remote care delivery, where images, lab reports, and other medical data are routinely shared through messaging platforms. While systems like the Articulate Medical Intelligence Explorer (AMIE) have matched or surpassed primary care physicians in text-only consultations, this format falls short of reflecting telemedicine environments. Multimodal communication is essential in modern care, as patients often share photographs, documents, and other visual artifacts that cannot be fully conveyed through text alone. Limiting AI systems to textual inputs risks omitting critical clinical information, increasing diagnostic errors, and creating accessibility barriers for patients with lower health or digital literacy. Despite the widespread use of multimedia messaging apps in global healthcare, there has been little research into how LLMs can reason over such diverse data during diagnostic interactions.

Research in diagnostic conversational agents began with rule-based systems like MYCIN, but recent developments have focused on LLMs capable of emulating clinical reasoning. While multimodal AI systems, such as vision-language models, have demonstrated success in radiology and dermatology, integrating these capabilities into conversational diagnostics remains challenging. Effective AI-based diagnostic tools must handle the complexity of multimodal reasoning and uncertainty-driven information gathering, a step beyond merely answering isolated questions. Evaluation frameworks like OSCEs and platforms such as AgentClinic provide useful starting points, yet tailored metrics are still needed to assess performance in multimodal diagnostic contexts. Moreover, while messaging apps are increasingly used in low-resource settings for sharing clinical data, concerns about data privacy, integration with formal health systems, and policy compliance persist.

Google DeepMind and Google Research have enhanced the AMIE with multimodal capabilities for improved conversational diagnosis and management. Using Gemini 2.0 Flash, AMIE employs a state-aware dialogue framework that adapts conversation flow based on patient state and diagnostic uncertainty, allowing strategic, structured history-taking with multimodal inputs like skin images, ECGs, and documents. AMIE outperformed or matched primary care physicians in a randomized OSCE-style study with 105 scenarios and 25 patient actors across 29 of 32 clinical metrics and 7 of 9 multimodal-specific criteria, demonstrating strong diagnostic accuracy, reasoning, communication, and empathy.

The study enhances the AMIE diagnostic system by incorporating multimodal perception and a state-aware dialogue framework that guides conversations through phases of history taking, diagnosis, and follow-up. Gemini 2.0 Flash powers the system and dynamically adapts based on evolving patient data, including text, images, and clinical documents. A structured patient profile and differential diagnosis are updated throughout the interaction, with targeted questions and multimodal data requests guiding clinical reasoning. Evaluation includes automated perception tests on isolated artifacts, simulated dialogues rated by auto-evaluators, and expert OSCE-style assessments, ensuring robust diagnostic performance and clinical realism.

The results show that the multimodal AMIE system performs at par or better than primary care physicians (PCPs) across multiple clinical tasks in simulated text-chat consultations. In OSCE-style assessments, AMIE consistently outperformed PCPs in diagnostic accuracy, especially when interpreting multimodal data such as images and clinical documents. It also demonstrated greater robustness when image quality was poor and showed fewer hallucinations. Patient actors rated AMIE’s communication skills highly, including empathy and trust. Automated evaluations confirmed that AMIE’s advanced reasoning framework, built on the Gemini 2.0 Flash model, significantly improved diagnosis and conversation quality, validating its design and effectiveness in real-world clinical scenarios.

In conclusion, the study advances conversational diagnostic AI by enhancing AMIE to integrate multimodal reasoning within patient dialogues. Using a novel state-aware inference-time strategy with Gemini 2.0 Flash, AMIE can interpret and reason about medical artifacts like images or ECGs in real-time clinical conversations. Evaluated through a multimodal OSCE framework, AMIE outperformed or matched primary care physicians in diagnostic accuracy, empathy, and artifact interpretation, even in complex cases. Despite limitations tied to chat-based interfaces and the need for real-world testing, these findings highlight AMIE’s potential as a robust, context-aware diagnostic assistant for future telehealth applications.

Check out the Paper and Technical details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Google Researchers Advance Diagnostic AI: AMIE Now Matches or Outperforms Primary Care Physicians Using Multimodal Reasoning with Gemini 2.0 Flash appeared first on MarkTechPost.

IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks

Asif Razzaq — Sun, 04 May 2025 01:36:20 +0000

IBM has introduced a preview of Granite 4.0 Tiny, the smallest member of its upcoming Granite 4.0 family of language models. Released under the Apache 2.0 license, this compact model is designed for long-context tasks and instruction-following scenarios, striking a balance between efficiency, transparency, and performance. The release reflects IBM’s continued focus on delivering open, auditable, and enterprise-ready foundation models.

Granite 4.0 Tiny Preview includes two key variants: the Base-Preview, which showcases a novel decoder-only architecture, and the Tiny-Preview (Instruct), which is fine-tuned for dialog and multilingual applications. Despite its reduced parameter footprint, Granite 4.0 Tiny demonstrates competitive results on reasoning and generation benchmarks—underscoring the benefits of its hybrid design.

Architecture Overview: A Hybrid MoE with Mamba-2-Style Dynamics

At the core of Granite 4.0 Tiny lies a hybrid Mixture-of-Experts (MoE) structure, with 7 billion total parameters and only 1 billion active parameters per forward pass. This sparsity allows the model to deliver scalable performance while significantly reducing computational overhead—making it well-suited for resource-constrained environments and edge inference.

The Base-Preview variant employs a decoder-only architecture augmented with Mamba-2-style layers—a linear recurrent alternative to traditional attention mechanisms. This architectural shift enables the model to scale more efficiently with input length, enhancing its suitability for long-context tasks such as document understanding, dialogue summarization, and knowledge-intensive QA.

Another notable design decision is the use of NoPE (No Positional Encodings). Instead of fixed or learned positional embeddings, the model integrates position handling directly into its layer dynamics. This approach improves generalization across varying input lengths and helps maintain consistency in long-sequence generation.

Benchmark Performance: Efficiency Without Compromise

Despite being a preview release, Granite 4.0 Tiny already exhibits meaningful performance gains over prior models in IBM’s Granite series. On benchmark evaluations, the Base-Preview demonstrates:

+5.6 improvement on DROP (Discrete Reasoning Over Paragraphs), a benchmark for multi-hop QA
+3.8 on AGIEval, which assesses general language understanding and reasoning

These improvements are attributed to both the model’s architecture and its extensive pretraining—reportedly on 2.5 trillion tokens, spanning diverse domains and linguistic structures.

Instruction-Tuned Variant: Designed for Dialogue, Clarity, and Multilingual Reach

The Granite-4.0-Tiny-Preview (Instruct) variant extends the base model through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), using a Tülu-style dataset consisting of both open and synthetic dialogues. This variant is tailored for instruction-following and interactive use cases.

Supporting 8,192 token input windows and 8,192 token generation lengths, the model maintains coherence and fidelity across extended interactions. Unlike encoder–decoder hybrids that often trade off interpretability for performance, the decoder-only setup here yields clearer and more traceable outputs—a valuable feature for enterprise and safety-critical applications.

Evaluation Scores:

86.1 on IFEval, indicating strong performance in instruction-following benchmarks
70.05 on GSM8K, for grade-school math problem solving
82.41 on HumanEval, measuring Python code generation accuracy

Moreover, the instruct model supports multilingual interaction across 12 languages, making it viable for global deployments in customer service, enterprise automation, and educational tools.

Open-Source Availability and Ecosystem Integration

IBM has made both models publicly available on Hugging Face:

The models are accompanied by full model weights, configuration files, and sample usage scripts under the Apache 2.0 license, encouraging transparent experimentation, fine-tuning, and integration across downstream NLP workflows.

Outlook: Laying the Groundwork for Granite 4.0

Granite 4.0 Tiny Preview serves as an early glimpse into IBM’s broader strategy for its next-generation language model suite. By combining efficient MoE architectures, long-context support, and instruction-focused tuning, the model family aims to deliver state-of-the-art capabilities in a controllable and resource-efficient package.

As more variants of Granite 4.0 are released, we can expect IBM to deepen its investment in responsible, open AI—positioning itself as a key player in shaping the future of transparent, high-performance language models for enterprise and research.

Check out the Technical details, Granite 4.0 Tiny Base Preview and Granite 4.0 Tiny Instruct Preview. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks appeared first on MarkTechPost.

From ELIZA to Conversation Modeling: Evolution of Conversational AI Systems and Paradigms

Yam Marcovitz — Fri, 02 May 2025 18:08:46 +0000

TL;DR: Conversational AI has transformed from ELIZA’s simple rule-based systems in the 1960s to today’s sophisticated platforms. The journey progressed through scripted bots in the 80s-90s, hybrid ML-rule frameworks like Rasa in the 2010s, and the revolutionary large language models of the 2020s that enabled natural, free-form interactions. Now, cutting-edge conversation modeling platforms like Parlant combine LLMs’ generative power with structured guidelines, creating experiences that are both richly interactive and practically deployable—offering developers unprecedented control, iterative flexibility, and real-world scalability.

ELIZA: The Origin of Conversational Agents (1960s)

The lineage of conversational AI begins with ELIZA, created by Joseph Weizenbaum at MIT in 1966.

ELIZA was a rule-based chatbot that used simple pattern matching and substitution rules to simulate conversation. Weizenbaum’s most famous script for ELIZA, called “DOCTOR,” parroted a Rogerian psychotherapist: it would reflect the user’s inputs back as questions or prompts. For example, if a user said “I feel stressed about work,” ELIZA might reply, “Why do you feel stressed about work?” This gave an illusion of understanding without any real comprehension of meaning.

ELIZA was one of the first programs to attempt the Turing Test (engaging in dialogue indistinguishable from a human). While it was a very simple system, ELIZA proved that humans could be momentarily convinced they were chatting with an understanding entity – a phenomenon later dubbed the “Eliza effect.” This early success sparked widespread interest and laid the foundation for chatbot development, even though ELIZA’s capabilities were rudimentary and entirely scripted.

Scripted Chatbots: Menu-Driven Systems and AIML (1980s–1990s)

After ELIZA, conversational systems remained largely rule-based but grew more sophisticated.

Many early customer service bots and phone IVR systems in the 1980s and 1990s were essentially menu-driven – they guided users through predefined options (e.g. “Press 1 for account info, 2 for support”) rather than truly “understanding” free text.

Around the same time, more advanced text-based bots used bigger rule sets and pattern libraries to appear conversational. A landmark was A.L.I.C.E. (Artificial Linguistic Internet Computer Entity), introduced in 1995 by Richard Wallace. ALICE employed a specialized scripting language called AIML (Artificial Intelligence Markup Language) to manage conversation rules. Instead of hard-coding every response, AIML let developers define patterns and template replies. As a result, ALICE had an enormous base of about 41,000 predefined templates and pattern-response pairs. This allowed it to engage in more varied, natural-sounding chats than ELIZA’s simple keyword tricks. ALICE was even awarded the Loebner Prize (a conversational AI contest) multiple times in the early 2000s.

Despite these improvements, bots like ALICE and its contemporaries still relied on static scripts. They lacked true understanding and could be easily led off-track by inputs outside their scripted patterns. In practice, developers often had to anticipate countless phrasings or guide users to stay within expected inputs (hence the popularity of menu-driven designs for reliability). By the late 1990s, the paradigm in industry was that chatbots were essentially expert systems: large collections of if-then rules or decision trees. These systems worked for narrowly defined tasks (like tech support FAQs or simple dialog games) but were brittle and labor-intensive to expand. Still, this era demonstrated that with enough rules, a chatbot could handle surprisingly complex dialogues – a stepping stone toward more data-driven approaches.

The Rise of ML and Hybrid NLU Frameworks (2010s)

The 2010s saw a shift toward machine learning (ML) in conversational AI, aiming to make chatbots less brittle and easier to build. Instead of manually writing thousands of rules, developers began using statistical Natural Language Understanding (NLU) techniques to interpret user input.

Frameworks like Google’s Dialogflow and the open-source Rasa platform (open-sourced in 2017) exemplified this hybrid approach. They let developers define intents (user’s goals) and entities (key information), and then train ML models on example phrases. The ML model generalizes from those examples, so the bot can recognize a user request even if it’s phrased in an unforeseen way. For instance, whether a user says “Book me a flight for tomorrow” or “I need to fly out tomorrow,” an intent classification model can learn to map both to the same “BookFlight” intent. This significantly reduced the need to hand-craft every possible pattern.

Over time, these NLU models incorporated Transformer-based innovations to boost accuracy. For example, Rasa introduced the DIET (Dual Intent and Entity Transformer) architecture, a lightweight transformer network for intent classification and entity extraction. Such models approach the language-understanding performance of large pre-trained transformers like BERT, but are tailored to the specific intents/entities of the chatbot. Meanwhile, the dialogue management in these frameworks was still often rule-based or followed story graphs defined by developers. In Dialogflow, one would design conversational flows with contexts and transitions. In Rasa, one could write stories or rules that specify how the bot should respond or which action to take next given the recognized intent and dialogue state.

This combination of ML + rules was a major step up. It allowed chatbots to handle more natural language variation while maintaining controlled flows for business logic. Many virtual assistants and customer support bots deployed in the late 2010s (on platforms like Facebook Messenger, Slack, or bank websites) were built this way. However, challenges remained. Designing and maintaining the conversation flows could become complex as an assistant’s scope grew. Every new feature or edge case might require adding new intents, more training data, and more dialogue branches – which risked turning into a tangle of states (a “graph-based” framework that can become overwhelmingly complex as the agent grows).

Moreover, while these systems were more flexible than pure rules, they still could fail if users went truly off-script or asked something outside the trained data.

The LLM Era: Prompt-Based Conversations and RAG (2020s)

A watershed moment came with the advent of Large Language Models (LLMs) in the early 2020s. Models like OpenAI’s GPT-3 (2020) and later ChatGPT (2022) demonstrated that a single, massive neural network trained on internet-scale data could engage in remarkably fluent open-ended conversations.

ChatGPT, for instance, can generate responses that are often difficult to distinguish from human-written text, and it can carry on a dialogue spanning many turns without explicit rules scripted by a developer. Instead of defining intents or writing dialogue trees, developers could now provide a prompt (e.g. a starting instruction like “You are a helpful customer service agent…”) and let the LLM generate the conversation. This approach flips the old paradigm: rather than the developer explicitly mapping out the conversation, the model itself learned conversational patterns from its training data and can dynamically produce answers.

However, using LLMs for reliable conversational agents brought new challenges. Firstly, large models have a fixed knowledge cutoff (ChatGPT’s base knowledge, for example, only went up to 2021 data in its initial release). And they are prone to “hallucinations” – confidently generating incorrect or fabricated information when asked something outside their knowledge.

To tackle this, a technique called Retrieval-Augmented Generation (RAG) became popular. RAG pairs the LLM with an external knowledge source: when a user asks a question, the system first retrieves relevant documents (from a database or search index) and then feeds those into the model’s context so it can base its answer on up-to-date, factual information. This method helps address the knowledge gap and reduces hallucinations by grounding the LLM’s responses in real data. Many modern QA bots and enterprise assistants use RAG – for example, a customer support chatbot might retrieve policy documents or user account info so that the LLM’s answer is accurate and personalized.

Another tool in this era is the use of system prompts and few-shot examples to steer LLM behavior. By providing instructions like “Always respond in a formal tone,” or giving examples of desired Q&A pairs, developers attempt to guide the model’s style and compliance with rules. This is powerful but not foolproof: LLMs often ignore instructions if a conversation is long or if the prompt is complex, as parts fall out of its attention.

Essentially, pure prompting lacks guarantees – it’s still the model’s learned behavior that decides the outcome. And while RAG can inject facts, it “can’t guide behavior” or enforce complex dialogue flows. For instance, RAG will help a bot cite the correct price from a database, but it won’t ensure the bot follows a company’s escalation protocol or keeps a consistent persona beyond what the prompt suggests.

By late 2024, developers had a mix of approaches for conversational AI:

Fine-tuning an LLM on custom data to specialize it (which can be expensive and inflexible, often requiring re-training the whole model for small changes).
Prompt engineering and RAG to leverage pre-trained LLMs without full retraining (quick to prototype, but needing careful tweaking and still lacking strong runtime control and consistency).
Traditional frameworks (intents/flows or graphical dialog builders) which offer deterministic behavior but at the cost of flexibility and significant manual work, especially as complexity grows.

Each approach had trade-offs. Many teams found themselves combining methods and still encountering issues with consistency and maintainability. This set the stage for a new paradigm aiming to capture the best of both worlds – the knowledge and linguistic fluency of LLMs with the control and predictability of rule-based systems. This emerging paradigm is what we refer to as Conversation Modeling.

Conversation Modeling with Parlant.io: A New Paradigm

The latest development in conversational AI is the rise of Conversation Modeling platforms, with Parlant as a prime example. Parlant is an open-source Conversation Modeling Engine designed to build user-facing agents that are adaptive, yet predictable and accurate. In essence, it provides a structured way to shape an LLM-driven conversation without reverting to rigid workflows or expensive model retraining. Instead of coding up dialogue flows or endlessly tweaking prompts, a developer using Parlant focuses on writing guidelines that direct the AI’s behavior.

Guideline-Driven Conversations

Guidelines in Parlant are like contextual rules or principles that the AI agent should follow. Each guideline has a condition (when it applies) and an action (what it should make the agent do).

For example, a guideline might be: When the user is asking to book a hotel room and they haven’t specified the number of guests, then ask for the number of guests. This “when X, then Y” format encapsulates business logic or conversation policy in a flexible, declarative way. The crucial difference from old-school rules is that guidelines don’t script out the exact wording of the bot’s response or a fixed path – they simply set expectations that the generative model must adhere to.

Parlant’s engine takes care of enforcing these guidelines during the conversation. It does so by dynamically injecting the relevant guidelines into the LLM’s context at the right time.

In our hotel booking example, if the user says, “I need a hotel in New York this weekend,” Parlant would recognize that the “ask about number of guests” guideline’s condition is met. It would then load that guideline into the prompt for the LLM, so the AI’s response would be guided to, say, “Certainly! I can help with that. How many guests will be staying?” instead of the model’s default response, which might have omitted the guest count question. If another guideline says the agent should always respond enthusiastically, that guideline would also be activated, ensuring the tone is upbeat. This way, multiple guidelines can shape each response.

Importantly, Parlant keeps the model’s “cognitive load” light by only including guidelines that are contextually relevant, given the current conversation state. An agent could have dozens of guidelines defined, but the user doesn’t get bombarded with irrelevant behavior – the system is smart about which rules apply when.

This dynamic approach allows richer interactions than a static flowchart: the conversation can go in many directions, but whenever a situation arises that has a guideline, the model will consistently follow that instruction. In effect, the LLM becomes more grounded and consistent in its behavior, without losing its natural language flexibility.

Reliability, Enforcement, and Explainability

A standout feature of Parlant’s conversation modeling is how it checks and explains the agent’s decisions.

Traditional chatbots might log which intent was matched or which rule fired, but Parlant goes further. It actually supervises the AI’s output before it reaches the user to ensure that the guidelines were followed. One novel technique the Parlant team developed is called Attentive Reasoning Queries (ARQs).

In simplified terms, ARQs are an internal query the system poses (via the LLM’s reasoning capabilities) to double-check that the response satisfies the active guidelines. If something is off – say the model produced an answer that violates a guideline or contradicts a prior instruction – Parlant can catch that and correct course. This might involve instructing the model to try again or adjusting the context. The result is an extra layer of assurance that the agent’s answers are on-policy and safe before the user sees them.

From a developer’s perspective, this yields a high degree of predictability and makes it easier to debug conversations. Parlant provides extensive feedback on the agent’s decisions and interpretations. One can trace which guideline triggered at a given turn, what the model “thought” the user meant, and why it chose a certain reply.

This level of transparency is rarely available in pure LLM solutions (which can feel like a black box) and even in many ML-based frameworks. If a conversation went wrong, you can quickly see if a guideline was missing or mis-specified, or if the AI misunderstood because no guideline covered a scenario, and then adjust accordingly.

Faster Iteration and Scalable Testing

Conversation modeling also dramatically improves the development lifecycle for AI agents. In older approaches, if a business stakeholder said “Our chatbot should change its behavior in X scenario,” implementing that could mean re-writing parts of a flow, collecting new training data, or even fine-tuning a model – and then testing extensively to ensure nothing else broke. With Parlant, that request usually translates to simply adding or editing a guideline.

For instance, if the sales team decides that during holidays the bot should offer a 10% discount, a developer can implement a guideline: When it is a holiday, then the agent should offer a discount. There’s no need to retrain the language model or overhaul the dialog tree; the guideline is a modular addition.

Parlant was built so that developers can iterate quickly in response to business needs, updating the conversational behavior at the pace of changing requirements. This agility is akin to how a human manager might update a customer service script or policies, and immediately all agents follow the new policy – here, the “policies” are guidelines, and the AI agent follows them immediately once updated.

Because guidelines are discrete and declarative, it’s also easier to test and scale conversational agents built this way. Each guideline can be seen as a testable unit: one can devise example dialogues to verify that the guideline triggers properly and that the agent’s response meets expectations. Parlant’s deterministic injection of guidelines means the agent will behave consistently for a given scenario, which makes automated testing feasible (you won’t get a completely random response every time, as raw LLMs might give).

The platform’s emphasis on explainability also means you can catch regressions or unintended effects early – you’ll see if a new guideline conflicts with an existing one, for example. This approach lends itself to more robust, enterprise-grade deployments where reliability and compliance are crucial.

Integration with Business Logic and Tools

Another way Parlant stands apart is in how it separates conversational behavior from back-end logic.

Earlier chatbot frameworks sometimes entangled the two – for example, a dialog flow node might both decide what to say and invoke an API call. Parlant encourages a clean separation: use guidelines for conversation design, and use tool functions (external APIs or code) for any business logic or data retrieval.

Guidelines can trigger those tools, but they don’t contain the logic themselves. This means you can have a guideline like “When the customer asks to track an order, then retrieve the order status and communicate it.”

The actual act of looking up the order status is done by a deterministic function (so no uncertainty there), and the guideline ensures the AI knows when to call it and how to incorporate the result into the conversation. By not embedding complex computations or database queries into the AI’s prompt, Parlant avoids the pitfalls of LLMs struggling with multi-step reasoning or math.

The division of labor leads to more maintainable and reliable systems: developers can update business logic in code without touching the conversation scripts, and vice versa. It’s a design paradigm that scales well as projects grow.

Real-World Impact and Use Cases

All these capabilities make conversation modeling suitable for applications that were previously very challenging for conversational AI.

Parlant emphasizes use cases like regulated industries and high-stakes customer interactions. For example, in financial services or legal assistance, an AI agent must strictly follow compliance guidelines and wording protocols – a single off-script response can have serious consequences. Parlant’s approach ensures the agent reliably follows prescribed protocols in such domains.

In healthcare communications, accuracy and consistency are paramount; an agent should stick to approved responses and escalate when unsure. Guidelines can encode those requirements (e.g. “if user mentions a medical symptom, always provide the disclaimer and suggest scheduling an appointment”).

Brand-sensitive customer service is another area: companies want AI that reflects their brand voice and policies exactly. With conversation modeling, the brand team can literally read the guidelines as if they are a policy document for the AI. This is a big improvement over hoping an ML model “learned” the desired style from training examples.

Teams using Parlant have noted that it enables richer interactions without sacrificing control. Users aren’t forced down rigid conversational menus; instead, they can ask things naturally and the AI can handle it, because the generative model is free to respond creatively as long as it follows the playbook defined by guidelines.

At the same time, the development overhead is lower – you manage a library of guidelines (which are human-readable and modular) instead of a spaghetti of code. And when the AI does something unexpected, you have the tools to diagnose why and fix it systematically.

In short, Parlant’s conversation modeling represents a convergence of the two historical threads in chatbot evolution: the free-form flexibility of advanced AI language models with the governed reliability of rule-based systems. This paradigm is poised to define the next generation of conversational agents that are both intelligent and trustworthy, from virtual customer assistants to automated advisors across industries.

Disclaimer: The views and opinions expressed in this guest article are those of the author and do not necessarily reflect the official policy or position of Marktechpost.

The post From ELIZA to Conversation Modeling: Evolution of Conversational AI Systems and Paradigms appeared first on MarkTechPost.

JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks

Asif Razzaq — Fri, 02 May 2025 07:43:42 +0000

JetBrains has officially open-sourced Mellum, a purpose-built 4-billion-parameter language model tailored for software development tasks. Developed from the ground up, Mellum reflects JetBrains’ engineering-first approach, offering a domain-specialized model trained for practical usage across codebases and programming environments. With its release on Hugging Face under the Apache 2.0 license, JetBrains extends an invitation to the broader research and developer community to experiment, adapt, and advance Mellum’s capabilities.

A Focal Model for Code Understanding

Unlike general-purpose LLMs, Mellum is classified by JetBrains as a “focal model”—a term they use to describe models with a narrow yet deep specialization. Mellum is optimized specifically for programming-related tasks such as autocompletion, infilling, and structural understanding of source code. This focused design avoids the overhead of broader linguistic modeling and enables the model to perform efficiently in IDE-like environments.

The model supports a wide array of languages including Java, Kotlin, Python, Go, PHP, C, C++, C#, JavaScript, TypeScript, CSS, HTML, Rust, and Ruby—reflecting the polyglot nature of modern development teams.

Model Architecture and Training Pipeline

Mellum follows a LLaMA-style architecture and was trained from scratch using over 4.2 trillion tokens drawn from code-rich sources such as The Stack, StarCoder, CommitPack, and English Wikipedia. It features an 8K token context window and was trained using bf16 mixed precision across a high-throughput cluster of 256 NVIDIA H200 GPUs connected via Infiniband.

The training process spanned approximately 20 days and leveraged modern infrastructure for scalable model development. The architecture and training procedure were designed with reproducibility and deployment flexibility in mind, making Mellum usable in both cloud inference setups (e.g., vLLM) and on local environments (e.g., llama.cpp, Ollama).

Benchmarking and Evaluation

JetBrains evaluated Mellum across a range of benchmarks that reflect its primary use cases—code infilling and completion. The model’s performance indicates strong alignment with the design goals:

RepoBench v1.1 (8K context):
- Python EM: 27.97%
- Java EM: 31.08%
SAFIM (Syntax-Aware Fill-in-the-Middle):
- pass@1: 38.11%
HumanEval Infilling:
- Single-line: 66.21%
- Multi-line: 38.52%
- Random-span: 29.70%

These results reflect Mellum’s specialization for structured code understanding, especially in scenarios involving partial or interrupted code, which are common in real-world development workflows.

Rationale for Open Sourcing

JetBrains’ decision to release Mellum as open-source is grounded in several practical motivations:

Transparency: Enables scrutiny of both training data and architectural decisions.
Reusability: Supports integration in custom development environments and research experiments.
Community Collaboration: Facilitates contribution from external developers to refine model behavior.
Pedagogical Value: Provides educators and students with a hands-on artifact for understanding how domain-specific LLMs are constructed and applied.

The release includes both the base model (Mellum-4b-base) and a fine-tuned variant for Python (Mellum-4b-sft-python).

Implications for Developer Tooling

The availability of a compact, performant model optimized for source code opens new opportunities in the IDE space and beyond. JetBrains envisions Mellum as part of a broader strategy involving multiple focal models, each optimized for specific programming tasks such as diff generation or code review assistance. This approach aligns with the growing need for deployable, cost-effective, and context-aware AI tooling that can augment developer productivity without introducing opaque or oversized general-purpose models.

Conclusion

Mellum represents a deliberate shift toward smaller, specialized language models that prioritize utility, transparency, and efficiency. By making the model openly available, JetBrains offers a high-quality foundation for building the next generation of AI-assisted developer tools. Its architecture, training methodology, and benchmark performance signal a practical step forward in the evolving space of LLMs tailored for software engineering.

The release includes both the base model (Mellum-4b-base) and a fine-tuned variant for Python (Mellum-4b-sft-python). Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks appeared first on MarkTechPost.

Meta and Booz Allen Deploy Space Llama: Open-Source AI Heads to the ISS for Onboard Decision-Making

Nikhil — Fri, 02 May 2025 07:00:15 +0000

In a significant step toward enabling autonomous AI systems in space, Meta and Booz Allen Hamilton have announced the deployment of Space Llama, a customized instance of Meta’s open-source large language model, Llama 3.2, aboard the International Space Station (ISS) U.S. National Laboratory. This initiative marks one of the first practical integrations of an LLM in a remote, bandwidth-limited, space-based environment.

Addressing Disconnection and Autonomy Challenges

Unlike terrestrial applications, AI systems deployed in orbit face strict constraints—limited compute resources, constrained bandwidth, and high-latency communication links with ground stations. Space Llama has been designed to function entirely offline, allowing astronauts to access technical assistance, documentation, and maintenance protocols without requiring live support from mission control.

To address these constraints, the AI model had to be optimized for onboard deployment, incorporating the ability to reason over mission-specific queries, retrieve context from local data stores, and interact with astronauts in natural language—all without internet connectivity.

Technical Framework and Integration Stack

The deployment leverages a combination of commercially available and mission-adapted technologies:

Llama 3.2: Meta’s latest open-source LLM serves as the foundation, fine-tuned for contextual understanding and general reasoning tasks in edge environments. Its open architecture enables modular adaptation for aerospace-grade applications.
A2E2 (AI for Edge Environments): Booz Allen’s AI framework provides containerized deployment and modular orchestration tailored to constrained environments like the ISS. It abstracts complexity in model serving and resource allocation across diverse compute layers.
HPE Spaceborne Computer-2: This edge computing platform, developed by Hewlett Packard Enterprise, provides reliable high-performance processing hardware for space. It supports real-time inference workloads and model updates when necessary.
NVIDIA CUDA-capable GPUs: These enable the accelerated execution of transformer-based inference tasks while staying within the ISS’s strict power and thermal budgets.

This integrated stack ensures that the model operates within the limits of orbital infrastructure, delivering utility without compromising reliability.

Open-Source Strategy for Aerospace AI

The selection of an open-source model like Llama 3.2 aligns with growing momentum around transparency and adaptability in mission-critical AI. The benefits include:

Modifiability: Engineers can tailor the model to meet specific operational requirements, such as natural language understanding in mission terminology or handling multi-modal astronaut inputs.
Data Sovereignty: With all inference running locally, sensitive data never needs to leave the ISS, ensuring compliance with NASA and partner agency privacy standards.
Resource Optimization: Open access to the model’s architecture allows for fine-grained control over memory and compute use—critical for environments where system uptime and resilience are prioritized.
Community-Based Validation: Using a widely studied open-source model promotes reproducibility, transparency in behavior, and better testing under mission simulation conditions.

Toward Long-Duration and Autonomous Missions

Space Llama is not just a research demonstration—it lays the groundwork for embedding AI systems into longer-term missions. In future scenarios like lunar outposts or deep-space habitats, where round-trip communication latency with Earth spans minutes or hours, onboard intelligent systems must assist with diagnostics, operations planning, and real-time problem-solving.

Furthermore, the modular nature of Booz Allen’s A2E2 platform opens up the potential for expanding the use of LLMs to non-space environments with similar constraints—such as polar research stations, underwater facilities, or forward operating bases in military applications.

Conclusion

The Space Llama initiative represents a methodical advancement in deploying AI systems to operational environments beyond Earth. By combining Meta’s open-source LLMs with Booz Allen’s edge deployment expertise and proven space computing hardware, the collaboration demonstrates a viable approach to AI autonomy in space.

Rather than aiming for generalized intelligence, the model is engineered for bounded, reliable utility in mission-relevant contexts—an important distinction in environments where robustness and interpretability take precedence over novelty.

As space systems become more software-defined and AI-assisted, efforts like Space Llama will serve as reference points for future AI deployments in autonomous exploration and off-Earth habitation.

Check out the Details here. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta and Booz Allen Deploy Space Llama: Open-Source AI Heads to the ISS for Onboard Decision-Making appeared first on MarkTechPost.

Large Language Model Category - MarkTechPost

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

Overview of the LLaMA-Omni2 Architecture

Streaming Generation with Read-Write Scheduling

Training Approach

Benchmark Results

Component Analyses

Conclusion

How AI Agents Store, Forget, and Retrieve? A Fresh Look at Memory Operations for the Next-Gen LLMs

RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity

Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling

Evolution of Reasoning in LLMs

Challenges in Diversifying Reasoning Domains

Nemotron-CrossThink: A Multi-Domain Approach

Key Results and Innovations

Comprehensive Data Curation

Template Application and Data Filtering

Strategic Data Blending and Reinforcement Learning

Technical Contributions

Experiments and Results

Conclusion

Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation

Google Researchers Advance Diagnostic AI: AMIE Now Matches or Outperforms Primary Care Physicians Using Multimodal Reasoning with Gemini 2.0 Flash

IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks

Architecture Overview: A Hybrid MoE with Mamba-2-Style Dynamics

Benchmark Performance: Efficiency Without Compromise

Instruction-Tuned Variant: Designed for Dialogue, Clarity, and Multilingual Reach

Evaluation Scores:

Open-Source Availability and Ecosystem Integration

Outlook: Laying the Groundwork for Granite 4.0

From ELIZA to Conversation Modeling: Evolution of Conversational AI Systems and Paradigms

ELIZA: The Origin of Conversational Agents (1960s)​

Scripted Chatbots: Menu-Driven Systems and AIML (1980s–1990s)​

The Rise of ML and Hybrid NLU Frameworks (2010s)​

The LLM Era: Prompt-Based Conversations and RAG (2020s)​

Conversation Modeling with Parlant.io: A New Paradigm​

Guideline-Driven Conversations​

Reliability, Enforcement, and Explainability​

Faster Iteration and Scalable Testing​

Integration with Business Logic and Tools​

Real-World Impact and Use Cases​

JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks

A Focal Model for Code Understanding

Model Architecture and Training Pipeline

Benchmarking and Evaluation

Rationale for Open Sourcing

Implications for Developer Tooling

Conclusion

Meta and Booz Allen Deploy Space Llama: Open-Source AI Heads to the ISS for Onboard Decision-Making

Addressing Disconnection and Autonomy Challenges

Technical Framework and Integration Stack

Open-Source Strategy for Aerospace AI

Toward Long-Duration and Autonomous Missions

Conclusion

ELIZA: The Origin of Conversational Agents (1960s)

Scripted Chatbots: Menu-Driven Systems and AIML (1980s–1990s)

The Rise of ML and Hybrid NLU Frameworks (2010s)

The LLM Era: Prompt-Based Conversations and RAG (2020s)

Conversation Modeling with Parlant.io: A New Paradigm

Guideline-Driven Conversations

Reliability, Enforcement, and Explainability

Faster Iteration and Scalable Testing

Integration with Business Logic and Tools

Real-World Impact and Use Cases