AI Shorts Category - MarkTechPost

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

Asif Razzaq — Tue, 06 May 2025 23:13:01 +0000

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture

LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

Speech Encoder: Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
Speech Adapter: Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
Core LLM: The Qwen2.5 models serve as the main reasoning engine.
Streaming TTS Decoder: Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Streaming Generation with Read-Write Scheduling

The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach

Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

Stage I: Independently optimizes the speech-to-text and text-to-speech modules.
Stage II: Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

Benchmark Results

The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

Model	Llama Q (S2S)	Web Q (S2S)	GPT-4o Score	ASR-WER	Latency (ms)
GLM-4-Voice (9B)	50.7	15.9	4.09	3.48	1562.8
LLaMA-Omni (8B)	49.0	23.7	3.52	3.67	346.7
LLaMA-Omni2-7B	60.7	31.3	4.15	3.26	582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses

Gate Fusion Module: Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
TTS Pretraining: Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
Read/Write Strategies: Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion

LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.

Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.

How AI Agents Store, Forget, and Retrieve? A Fresh Look at Memory Operations for the Next-Gen LLMs

Sana Hassan — Mon, 05 May 2025 23:26:46 +0000

Memory plays a crucial role in LLM-based AI systems, supporting sustained, coherent interactions over time. While earlier surveys have explored memory about LLMs, they often lack attention to the fundamental operations governing memory functions. Key components like memory storage, retrieval, and memory-grounded generation have been studied in isolation, but a unified framework that systematically integrates these processes remains underdeveloped. Although a few recent efforts have proposed operational views of memory to categorize existing work, the field still lacks cohesive memory architectures that clearly define how these atomic operations interact.

Furthermore, existing surveys tend to address only specific subtopics within the broader memory landscape, such as long-context handling, long-term memory, personalization, or knowledge editing. These fragmented approaches often miss essential operations like indexing and fail to offer comprehensive overviews of memory dynamics. Additionally, most prior work does not establish a clear research scope or provide structured benchmarks and tool coverage, limiting their practical value for guiding future advancements in memory for AI systems.

Researchers from the Chinese University, the University of Edinburgh, HKUST, and the Poisson Lab at Huawei UK R&D Ltd. present a detailed survey on memory in AI systems. They classify memory into parametric, contextual-structured, and contextual-unstructured types, distinguishing between short-term and long-term memory inspired by cognitive psychology. Six fundamental operations—consolidation, updating, indexing, forgetting, retrieval, and compression—are defined and mapped to key research areas, including long-term memory, long-context modeling, parametric modification, and multi-source integration. Based on an analysis of over 30,000 papers using the Relative Citation Index, the survey also outlines tools, benchmarks, and future directions.

The researchers first develop a three‐part taxonomy of AI memory—parametric (model weights), contextual‐structured (e.g., indexed dialogue histories), and contextual‐unstructured (raw text or embeddings)—and distinguish short‐ versus long‐term spans. They then define six core memory operations: consolidation (storing new information), updating (modifying existing entries), indexing (organizing for fast access), forgetting (removing stale data), retrieval (fetching relevant content), and compression (distilling memories). To ground this framework, they mined over 30,000 top‐tier AI papers (2022–2025), ranked them by Relative Citation Index, and clustered high‐impact works into four themes—long‐term memory, long‐context modeling, parametric editing, and multi‐source integration—thereby mapping each operation and memory type to active research areas and highlighting key benchmarks and tools.

The study describes a layered ecosystem of memory-centric AI systems that support long-term context management, user modeling, knowledge retention, and adaptive behavior. This ecosystem is structured across four tiers: foundational components (such as vector stores, large language models like Llama and GPT-4, and retrieval mechanisms like FAISS and BM25), frameworks for memory operations (e.g., LangChain and LlamaIndex), memory layer systems for orchestration and persistence (such as Memary and Memobase), and end-user-facing products (including Me. bot and ChatGPT). These tools provide infrastructure for memory integration, enabling capabilities like grounding, similarity search, long-context understanding, and personalized AI interactions.

The survey also discusses open challenges and future research directions in AI memory. It highlights the importance of spatio-temporal memory, which balances historical context with real-time updates for adaptive reasoning. Key challenges include parametric memory retrieval, lifelong learning, and efficient knowledge management across memory types. Additionally, the paper draws inspiration from biological memory models, emphasizing dual-memory architectures and hierarchical memory structures. Future work should focus on unifying memory representations, supporting multi-agent memory systems, and addressing security concerns, particularly memory safety and malicious attacks in machine learning techniques.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post How AI Agents Store, Forget, and Retrieve? A Fresh Look at Memory Operations for the Next-Gen LLMs appeared first on MarkTechPost.

RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity

Sajjad Ansari — Mon, 05 May 2025 18:09:19 +0000

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Methods like Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, representing a fundamental challenge for this class of models.

Linear complexity language models have emerged as alternatives to transformer-based architectures that suffer from quadratic computational demands when processing long sequences. The RWKV model series combines transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, from the foundational RWKV-4 to RWKV-5 to RWKV-6 to RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Further, Native Sparse Attention organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention includes SeerAttention and Block Attention (MoBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

RWKV-X is a hybrid architecture that integrates RWKV-7 blocks with sparse attention blocks. Rather than training from scratch, RWKV-X builds upon existing models using an interleaved block expansion approach and zero-initialization mechanism inspired by LLaMA Pro. The training follows a two-stage process:

First, the model trains on short 1024-token contexts from the MiniPile dataset while freezing all parameters except the newly added blocks.
The second stage involves long-context continual pretraining using the ProLong-64K dataset and a context length of 64K tokens, processing approximately 1 billion tokens total. During this phase, all parameters are unfrozen and jointly optimized. The training employs Long-context Cross-Entropy (LongCE) loss, which dynamically weights tokens based on their importance.

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller RWKV-X (0.22B) achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.

Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity appeared first on MarkTechPost.

How the Model Context Protocol (MCP) Standardizes, Simplifies, and Future-Proofs AI Agent Tool Calling Across Models for Scalable, Secure, Interoperable Workflows Traditional Approaches to AI–Tool Integration

Sana Hassan — Mon, 05 May 2025 05:56:54 +0000

Before MCP, LLMs relied on ad-hoc, model-specific integrations to access external tools. Approaches like ReAct interleave chain-of-thought reasoning with explicit function calls, while Toolformer trains the model to learn when and how to invoke APIs. Libraries such as LangChain and LlamaIndex provide agent frameworks that wrap LLM prompts around custom Python or REST connectors, and systems like Auto-GPT decompose goals into sub-tasks by repeatedly calling bespoke services. Because each new data source or API requires its own wrapper, and the agent must be trained to use it, these methods produce fragmented, difficult-to-maintain codebases. In short, prior paradigms enable tool calling but impose isolated, non-standard workflows, motivating the search for a unified solution.

Image Source

Model Context Protocol (MCP): An Overview

The Model Context Protocol (MCP) was introduced to standardize how AI agents discover and invoke external tools and data sources. MCP is an open protocol that defines a common JSON-RPC-based API layer between LLM hosts and servers. In effect, MCP acts like a “USB-C port for AI applications”, a universal interface that any model can use to access tools. MCP enables secure, two-way connections between an organization’s data sources and AI-powered tools, replacing the piecemeal connectors of the past. Crucially, MCP decouples the model from the tools. Instead of writing model-specific prompts or hard-coding function calls, an agent simply connects to one or more MCP servers, each of which exposes data or capabilities in a standardized way. The agent (or host) retrieves a list of available tools, including their names, descriptions, and input/output schemas, from the server. The model can then invoke any tool by name. This standardization and reuse are a core advantage over prior approaches.

Image Source

MCP’s open specification defines three core roles:

Host – The LLM application or user interface (e.g., a chat UI, IDE, or agent orchestration engine) that the user interacts with. The host embeds the LLM and acts as an MCP client.
Client – The software module within the host that implements the MCP protocol (typically via SDKs). The client handles messaging, authentication, and marshalling model prompts and responses.
Server – A service (local or remote) that provides context and tools. Each MCP server may wrap a database, API, codebase, or other system, and it advertises its capabilities to the client.

MCP was explicitly inspired by the Language Server Protocol (LSP) used in IDEs: just as LSP standardizes how editors query language features, MCP standardizes how LLMs query contextual tools. By using a common JSON-RPC 2.0 message format, any client and server that adheres to MCP can interoperate, regardless of the programming language or LLM used.

Technical Design and Architecture of MCP

MCP relies on JSON-RPC 2.0 to carry three types of messages, requests, responses, and notifications, allowing agents to perform both synchronous tool calls and receive asynchronous updates. In local deployments, the client often spawns a subprocess and communicates over stdin/stdout (the stdio transport). In contrast, remote servers typically use HTTP with Server-Sent Events (SSE) to stream messages in real-time. This flexible messaging layer ensures that tools can be invoked and results delivered without blocking the host application’s main workflow.

Under the MCP specification, every server exposes three standardized entities: resources, tools, and prompts. Resources are fetchable pieces of context, such as text files, database tables, or cached documents, that the client can retrieve by ID. Tools are named functions with well-defined input and output schemas, whether that’s a search API, a calculator, or a custom data-processing routine. Prompts are optional, higher-level templates or workflows that guide the model through multi-step interactions. By providing JSON schemas for each entity, MCP enables any capable large language model (LLM) to interpret and invoke these capabilities without requiring bespoke parsing or hard-coded integrations.

The MCP architecture cleanly separates concerns across three roles. The host embeds the LLM and orchestrates conversation flow, passing user queries into the model and handling its outputs. The client implements the MCP protocol itself, managing all message marshalling, authentication, and transport details. The server advertises available resources and tools, executes incoming requests (for example, listing tools or performing a query), and returns structured results. This modular design, encompassing AI and UI in the host, protocol logic in the client, and execution in the server, ensures that systems remain maintainable, extensible, and easy to evolve.

Interaction Model and Agent Workflows

Using MCP in an agent follows a simple pattern of discovery and execution. When the agent connects to an MCP server, it first calls the ‘list_tools()’ method to retrieve all available tools and resources. The client then integrates these descriptions into the LLM’s context (e.g., by formatting them into the prompt). The model now knows that these tools exist and what parameters they take. When the agent decides to use a tool (often prompted by a user’s query), the LLM emits a structured call (e.g., a JSON object with ‘”call”: “tool_name”, “args”: {…}’). The host recognizes this as a tool invocation, and the client issues a corresponding ‘call_tool()’ request to the server. The server executes the tool and sends back the result. The client then feeds this result into the model’s next prompt, making it appear as additional context.

This workflow replaces brittle ad-hoc parsing. The Agents SDK will call ‘list_tools()’ on MCP servers each time the agent is run, making the LLM aware of the server’s tools. When the LLM calls a tool, the SDK calls the ‘call_tool()’ function on the server behind the scenes. This protocol transparently handles the loop of discover→prompt→tool→respond. Furthermore, MCP supports composable workflows. Servers can define multi-step prompt templates, where the output of one tool serves as the input for another, enabling the agent to execute complex sequences. Future versions of MCP and related SDKs will already be adding features such as long-running sessions, stateful interactions, and scheduled tasks.

Implementations and Ecosystem

MCP is implementation-agnostic. The official specification is maintained on GitHub, and multiple language SDKs are available, including TypeScript, Python, Java, Kotlin, and C#. Developers can write MCP clients or servers in their preferred stack. For example, the OpenAI Agents SDK includes classes that enable easy connection to standard MCP servers from Python. InfraCloud’s tutorial demonstrates setting up a Node.js-based file-system MCP server to allow an LLM to browse local files.

A growing number of MCP servers have been published as open source. Anthropic has released connectors for many popular services, including Google Drive, Slack, GitHub, Postgres, MongoDB, and web browsing with Puppeteer, among others. Once one team builds a server for Jira or Salesforce, any compliant agent can use it without rework. On the client/host side, many agent platforms have integrated MCP support. Claude Desktop can attach to MCP servers. Google’s Agent Development Kit treats MCP servers as tool providers for Gemini models. Cloudflare’s Agents SDK added an McpAgent class so that any FogLAMP can become an MCP client with built-in auth support. Even auto-agents like Auto-GPT can plug into MCP: instead of coding a specific function for each API, the agent uses an MCP client library to call tools. This trend toward universal connectors promises a more modular autonomous agent architecture.

In practice, this ecosystem enables any given AI assistant to connect to multiple data sources simultaneously. One can imagine an agent that, in one session, uses an MCP server for corporate docs, another for CRM queries, and yet another for on-device file search. MCP even handles naming collisions gracefully: if two servers each have a tool called ‘analyze’, clients can namespace them (e.g., ‘ImageServer.analyze’ vs ‘CodeServer.analyze’) so both remain available without conflict.

Advantages of MCP Over Prior Paradigms

MCP brings several key benefits that earlier methods lack:

Standardized Integration: MCP provides a single protocol for all tools. Whereas each framework or model previously had its way of defining tools, MCP means that the tool servers and clients agree on JSON schemas. This eliminates the need for separate connectors per model or per agent, streamlining development and eliminating the need for custom parsing logic for each tool’s output.
Dynamic Tool Discovery: Agents can discover tools at runtime by calling ‘list_tools()’ and dynamically learning about available capabilities. There is no need to restart or reprogram the model when a new tool is added. This flexibility stands in contrast to frameworks where available tools are hardcoded at startup.
Interoperability and Reuse: Because MCP is model-agnostic, the same tool server can serve multiple LLM clients. With MCP, an organization can implement a single connector for a service and have it work with any compliant LLM, thereby avoiding vendor lock-in and reducing duplicate engineering efforts.
Scalability and Maintenance: MCP dramatically reduces duplicated work. Rather than writing ten different file-search functions for ten models, developers write one MCP file-search server. Updates and bug fixes to that server benefit all agents across all models.
Composable Ecosystem: MCP enables a marketplace of independently developed servers. Companies can publish MCP connectors for their software, allowing any AI to integrate with their data. This encourages an open ecosystem of connectors analogous to web APIs.
Security and Control: The protocol supports clear authorization flows. MCP servers describe their tools and required scopes, and hosts must obtain user consent before exposing data. This explicit approach improves auditability and security compared to free-form prompting.

Industry Impact and Real-World Applications

MCP adoption is growing rapidly. Major vendors and frameworks have publicly invested in MCP or related agent standards. Organizations are exploring MCP to integrate internal systems, such as CRM, knowledge bases, and analytics platforms, into AI assistants.

Concrete use cases include:

Developer Tools: Code editors and search platforms (e.g., Zed, Replit, Sourcegraph) utilize MCP to enable assistants to query code repositories, documentation, and commit history, resulting in richer code completion and refactoring suggestions.
Enterprise Knowledge & Chatbots: Helpdesk bots can access Zendesk or SAP data via MCP servers, answering questions about open tickets or generating reports based on real-time enterprise data, all with built-in authorization and audit trails.
Enhanced Retrieval-Augmented Generation: RAG agents can combine embedding-based retrieval with specialized MCP tools for database queries or graph searches, thereby overcoming the limitations of LLMs in terms of factual accuracy and arithmetic.
Proactive Assistants: Event-driven agents monitor email or task streams and autonomously schedule meetings or summarize action items by calling calendar and note-taking tools through MCP.

In each scenario, MCP enables agents to scale across diverse systems without requiring the rewriting of integration code, delivering maintainable, secure, and interoperable AI solutions.

Comparisons with Prior Paradigms

Versus ReAct: ReAct-style prompting embeds action instructions directly into free text, requiring developers to parse model outputs and manually handle each action. MCP provides the model with a formal interface using JSON schemas, enabling clients to manage execution seamlessly.
Versus Toolformer: Toolformer ties tool knowledge to the model’s training data, necessitating retraining for new tools. MCP externalizes tool interfaces entirely from the model, enabling zero-shot support for any registered tool without retraining.
Versus Framework Libraries: Libraries like LangChain simplify building agent loops but still require hardcoded connectors. MCP shifts integration logic into a reusable protocol, making agents more flexible and reducing code duplication.
Versus Autonomous Agents: Auto-GPT agents typically bake tool wrappers and loop logic into Python scripts. By using MCP clients, such agents need no bespoke code for new services, instead relying on dynamic discovery and JSON-RPC calls.
Versus Function-Calling APIs: While modern LLM APIs offer function-calling capabilities, they remain model-specific and are limited to single turns. MCP generalizes function calling across any client and server, with support for streaming, discovery, and multiplexed services.

MCP thus unifies and extends previous approaches, offering dynamic discovery, standardized schemas, and cross-model interoperability in a single protocol.

Limitations and Challenges

Despite its promise, MCP is still maturing:

Authentication and Authorization: The spec leaves auth schemes to implementations. Current solutions require layering OAuth or API keys externally, which can complicate deployments without a unified auth standard.
Multi-step Workflows: MCP focuses on discrete tool calls. Orchestrating long-running, stateful workflows often still relies on external schedulers or prompt chaining, as the protocol lacks a built-in session concept.
Discovery at Scale: Managing many MCP server endpoints can be burdensome in large environments. Proposed solutions include well-known URLs, service registries, and a central connector marketplace, but these are not yet standardized.
Ecosystem Maturity: MCP is new, so not every tool or data source has an existing connector. Developers may need to build custom servers for niche systems, although the protocol’s simplicity keeps that effort relatively low.
Development Overhead: For single, simple tool calls, the MCP setup can feel heavyweight compared to a quick, direct API call. MCP’s benefits accrue most in multi-tool, long-lived production systems rather than short experiments.

Many of these gaps are already being addressed by contributors and vendors, with plans to add standardized auth extensions, session management, and discovery infrastructure.

In conclusion, the Model Context Protocol represents a significant milestone in AI agent design, offering a unified, extensible, and interoperable approach for LLMs to access external tools and data sources. By standardizing discovery, invocation, and messaging, MCP eliminates the need for custom connectors per model or framework, enabling agents to integrate diverse services seamlessly. Early adopters across development tools, enterprise chatbots, and proactive assistants are already reaping the benefits of maintainability, scalability, and security that MCP offers. As MCP evolves, adding richer auth, session support, and registry services, it is poised to become the universal standard for AI connectivity, much like HTTP did for the web. For researchers, developers, and technology leaders alike, MCP opens the door to more powerful, flexible, and future-proof AI solutions.

Sources

The post How the Model Context Protocol (MCP) Standardizes, Simplifies, and Future-Proofs AI Agent Tool Calling Across Models for Scalable, Secure, Interoperable Workflows Traditional Approaches to AI–Tool Integration appeared first on MarkTechPost.

Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling

Mohammad Asjad — Mon, 05 May 2025 05:31:33 +0000

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities across diverse tasks, with Reinforcement Learning (RL) serving as a crucial mechanism for refining their deep thinking abilities. While RL techniques have shown particular success in mathematical reasoning and coding domains with well-defined rules and verifiable correctness criteria, extending these approaches to broader reasoning contexts presents significant challenges, including limited training data and difficulties in ensuring cross-domain generalisation.

Evolution of Reasoning in LLMs

The development of Chain-of-Thought (CoT) methodology marked a significant advancement in LLM reasoning capabilities. CoT has demonstrated substantial improvements across mathematics, science, and programming domains by incorporating multi-step intermediate reasoning processes before reaching conclusions. This approach allows models to break down complex problems into manageable steps, mirroring human problem-solving processes.

While mathematical reasoning has dominated recent research due to its verifiable nature, the expansion of RL training to diverse domains remains largely unexplored. Prior research works suggest that blending mathematical content with other verifiable domains can improve performance on broad reasoning benchmarks. However, systematic investigation into how non-mathematical reasoning data, such as legal analysis, social science, or historical interpretation, impacts RL training effectiveness still represents a significant research gap.

Challenges in Diversifying Reasoning Domains

Recent research has explored methods for diversifying RL training datasets, yet questions about optimal data-blending strategies and the relative importance of various sources remain unanswered. A fundamental challenge in applying RL to general reasoning tasks is developing verifiable reward models for domains lacking deterministic solutions. Domain-specific reasoning processes—whether rule-based and symbolic in mathematics or contextual and heuristic in fields like law and history—require different cognitive approaches. In addition to that, question formats (open-ended versus multiple-choice) demand distinct reasoning strategies, suggesting that incorporating diverse reasoning domains could significantly enhance LLMs’ broad cognitive capabilities.

Nemotron-CrossThink: A Multi-Domain Approach

Researchers from NVIDIA, Carnegie Mellon University, and Boston University introduce Nemotron-CrossThink, representing a systematic framework for incorporating multi-domain corpora into RL training to enhance cross-task generalisation. The methodology follows a comprehensive pipeline that curates diverse data sources, including synthetic data from CommonCrawl and open-source question-answer pairs across STEM, humanities, law, and social sciences. By applying templated formats (MCQ/Open-Ended) to constrain answer spaces, filtering samples for verifiable rewards, and implementing strategic data-blending recipes, the framework enables effective self-learning through RL across diverse reasoning domains.

Key Results and Innovations

Nemotron-CrossThink significantly enhances LLM reasoning capabilities by integrating multi-domain data with different question formats. Models trained with this approach demonstrate not only higher accuracy but also dynamic response strategies—generating concise answers for general-purpose questions while providing detailed responses for mathematical problems—thereby optimising inference costs while maintaining task-specific precision.

The framework addresses the challenge of verifiable rewards in non-deterministic domains through templated data curation that limits answer space diversity. It also provides an efficient filtering approach that ranks general-purpose reasoning data by complexity, showing that training with more challenging samples amplifies RL impact across all domains. These innovations have led to substantial performance gains in both mathematical benchmarks (MATH-500: +30.1%, AMC23: +27.5%) and non-mathematical tasks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%).

Comprehensive Data Curation

Nemotron-CrossThink begins with meticulous data curation from multiple sources to ensure diversity. The training dataset combines synthetically generated data from CommonCrawl and publicly available open-source QA datasets, encompassing both general-purpose reasoning and mathematical content. General-purpose reasoning data includes MMLU, Natural Reasoning, and synthesised QA pairs spanning STEM fields, economics, social sciences, and humanities, while mathematical reasoning incorporates datasets like MATH and Numina-Math alongside synthetically generated problems.

Template Application and Data Filtering

To address the challenge of verifiable rewards in non-mathematical domains, the framework applies specific templates to structure question-answer formats: Multiple Choice Questions (MCQ) and Open-Ended questions. This approach exposes the model to diverse answer formats and reasoning pathways while limiting answer space variability to enable effective reward modeling. Rigorous filtering removes samples that are infeasible to evaluate with rule-based reward functions, discarding MCQs where correct answers aren’t among the choices and open-ended responses exceeding ten words.

Strategic Data Blending and Reinforcement Learning

Nemotron-CrossThink employs Group Relative Policy Optimisation (GRPO) for reinforcement learning, which improves efficiency by estimating baselines from group scores rather than using a separate critic model. The methodology investigates the impact of diverse data sources, question types, and data usefulness through six distinct blending recipes. This systematic approach enables detailed analysis of how general-purpose reasoning data complements mathematical reasoning, ultimately producing more adaptable and generalizable language models.

Technical Contributions

The research demonstrates several key technical advances in multi-domain reasoning through reinforcement learning:

Templated question-answer formats provide more stable reward modeling, with unified open-ended question formats improving performance by 1.21% over mixed formats, and short-form answer templates outperforming long-form ones by 1.20%.
Strategic data-blending proves essential, with multi-domain corpora boosting average reasoning accuracy by 1.61% compared to math-only training while reducing token usage by 28%.
Model-driven filtering techniques effectively select challenging samples by removing those solvable by smaller models, yielding an additional 2.15% accuracy gain for Qwen-2.5-32B.

These findings represent significant progress in developing LLMs with robust reasoning capabilities across diverse domains, moving beyond the traditional focus on mathematical reasoning to encompass the full spectrum of human knowledge and inference patterns.

Experiments and Results

Experimental results demonstrate that different datasets significantly impact model performance across reasoning benchmarks. NuminaMath produced the highest overall average, outperforming the baseline by 8.30%, with particular strength in mathematical tasks while also generalizing well across diverse domains. Synthetic question-answering data improved performance by approximately 1.0%, showing strong accuracy in MMLU-PRO, AGIEVAL, and MATH-500 tasks, confirming that synthetically generated instruction-style data can effectively generalize when aligned with benchmark distributions.

The Nemotron-CrossThink approach consistently outperformed the base model across various blending strategies. The general-purpose reasoning blend (Bgpr↑) achieved the highest overall average, exceeding OPEN-REASONER-ZERO by approximately 5% on average and showing substantial gains on reasoning-focused benchmarks (+12.82% on MMLU-PRO, +15.12% on AGIEVAL). Though Bonly_math performed slightly better on strictly mathematical tasks, it lagged on non-mathematical reasoning benchmarks, demonstrating Bgpr↑’s superior versatility through strong cross-domain transfer.

Further analysis revealed that open-ended question formats (Bopen↑) yielded stronger results on mathematical benchmarks than multiple-choice formats (Bmcq↑), suggesting alignment with the inherently open-ended structure of mathematical problems. Mathematical reasoning data showed transferability to structured reasoning tasks, while general-purpose data proved less effective in isolation. This counterintuitive finding confirms that optimal general-purpose reasoning performance requires including mathematical problems in training blends.

Conclusion

Nemotron-CrossThink introduces a scalable framework that enhances LLM generalization through reinforcement learning with multi-domain corpora. By strategically blending diverse reasoning data with a 2:1 ratio of general-purpose to mathematical content, the approach achieves a remarkable 13.36% average improvement over baselines. The research demonstrates that data diversity, not merely volume, drives broader reasoning capabilities. Through difficulty-based filtering and thoughtful template design, Nemotron-CrossThink establishes a practical methodology for developing more generalizable, efficient, and reliable LLMs that extend self-learning beyond mathematical reasoning.

Check out the Paper and Project Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
ML News Community – r/machinelearningnews (92k+ members)

The post Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling appeared first on MarkTechPost.

Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation

Sana Hassan — Mon, 05 May 2025 03:33:09 +0000

RAG has proven effective in enhancing the factual accuracy of LLMs by grounding their outputs in external, relevant information. However, most existing RAG implementations are limited to text-based corpora, which restricts their applicability to real-world scenarios where queries may require diverse types of information, ranging from textual definitions to spatial understanding from images or temporal reasoning from videos. While some recent approaches have extended RAG to handle different modalities like images and videos, these systems are often constrained to operate within a single modality-specific corpus. This limits their ability to effectively respond to a wide spectrum of user queries that demand multimodal reasoning. Moreover, current RAG methods usually retrieve from all modalities without discerning which is most relevant for a given query, making the process inefficient and less adaptive to specific information needs.

To address this, recent research emphasizes the need for adaptive RAG systems to determine the appropriate modality and retrieval granularity based on the query context. Strategies include routing queries based on complexity, such as deciding between no retrieval, single-step, or multi-step retrieval, and using model confidence to trigger retrieval only when needed. Furthermore, the granularity of retrieval plays a crucial role, as studies have shown that indexing corpora at finer levels, like propositions or specific video clips, can significantly improve retrieval relevance and system performance. Hence, for RAG to truly support complex, real-world information needs, it must handle multiple modalities and adapt its retrieval depth and scope to the specific demands of each query.

Researchers from KAIST and DeepAuto.ai introduce UniversalRAG, a RAG framework that retrieves and integrates knowledge from various modality-specific sources (text, image, video) and multiple granularity levels. Unlike traditional approaches that embed all modalities into a shared space, leading to modality bias, UniversalRAG uses a modality-aware routing mechanism to select the most relevant corpus dynamically based on the query. It further enhances retrieval precision by organizing each modality into granularity-specific corpora, such as paragraphs or video clips. Validated on eight multimodal benchmarks, UniversalRAG consistently outperforms unified and modality-specific baselines, demonstrating its adaptability to diverse query needs.

UniversalRAG is a retrieval-augmented generation framework that handles queries across various modalities and data granularities. Unlike standard RAG models limited to a single corpus, UniversalRAG separates knowledge into text, image, and video corpora, each with fine- and coarse-grained levels. A routing module first determines the optimal modality and granularity for a given query, choosing among options like paragraphs, full documents, video clips, or full video, and retrieves relevant information accordingly. This router can be either a training-free LLM-based classifier or a trained model using heuristic labels from benchmark datasets. An LVLM then uses the selected content to generate the final response.

The experimental setup assesses UniversalRAG across six retrieval scenarios: no retrieval, paragraph, document, image, clip, and video. For no-retrieval, MMLU tests general knowledge. Paragraph-level tasks use SQuAD and Natural Questions, while HotpotQA handles multi-hop document retrieval. Image-based queries come from WebQA, and video-related ones are sourced from LVBench and VideoRAG datasets, split into clip- and full-video levels. Corresponding retrieval corpora are curated for each modality—Wikipedia-based for text, WebQA for images, and YouTube videos for video tasks. This comprehensive benchmark ensures robust evaluation across varied modalities and retrieval granularities.

In conclusion, UniversalRAG is a Retrieval-Augmented Generation framework that can retrieve knowledge from multiple modalities and levels of granularity. Unlike existing RAG methods that rely on a single, often text-only, corpus or a single-modality source, UniversalRAG dynamically routes queries to the most appropriate modality- and granularity-specific corpus. This approach addresses issues like modality gaps and rigid retrieval structures. Evaluated on eight multimodal benchmarks, UniversalRAG outperforms both unified and modality-specific baselines. The study also emphasizes the benefits of fine-grained retrieval and highlights how both trained and train-free routing mechanisms contribute to robust, flexible multimodal reasoning.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation appeared first on MarkTechPost.

Google Researchers Advance Diagnostic AI: AMIE Now Matches or Outperforms Primary Care Physicians Using Multimodal Reasoning with Gemini 2.0 Flash

Sana Hassan — Sun, 04 May 2025 20:00:11 +0000

LLMs have shown impressive promise in conducting diagnostic conversations, particularly through text-based interactions. However, their evaluation and application have largely ignored the multimodal nature of real-world clinical settings, especially in remote care delivery, where images, lab reports, and other medical data are routinely shared through messaging platforms. While systems like the Articulate Medical Intelligence Explorer (AMIE) have matched or surpassed primary care physicians in text-only consultations, this format falls short of reflecting telemedicine environments. Multimodal communication is essential in modern care, as patients often share photographs, documents, and other visual artifacts that cannot be fully conveyed through text alone. Limiting AI systems to textual inputs risks omitting critical clinical information, increasing diagnostic errors, and creating accessibility barriers for patients with lower health or digital literacy. Despite the widespread use of multimedia messaging apps in global healthcare, there has been little research into how LLMs can reason over such diverse data during diagnostic interactions.

Research in diagnostic conversational agents began with rule-based systems like MYCIN, but recent developments have focused on LLMs capable of emulating clinical reasoning. While multimodal AI systems, such as vision-language models, have demonstrated success in radiology and dermatology, integrating these capabilities into conversational diagnostics remains challenging. Effective AI-based diagnostic tools must handle the complexity of multimodal reasoning and uncertainty-driven information gathering, a step beyond merely answering isolated questions. Evaluation frameworks like OSCEs and platforms such as AgentClinic provide useful starting points, yet tailored metrics are still needed to assess performance in multimodal diagnostic contexts. Moreover, while messaging apps are increasingly used in low-resource settings for sharing clinical data, concerns about data privacy, integration with formal health systems, and policy compliance persist.

Google DeepMind and Google Research have enhanced the AMIE with multimodal capabilities for improved conversational diagnosis and management. Using Gemini 2.0 Flash, AMIE employs a state-aware dialogue framework that adapts conversation flow based on patient state and diagnostic uncertainty, allowing strategic, structured history-taking with multimodal inputs like skin images, ECGs, and documents. AMIE outperformed or matched primary care physicians in a randomized OSCE-style study with 105 scenarios and 25 patient actors across 29 of 32 clinical metrics and 7 of 9 multimodal-specific criteria, demonstrating strong diagnostic accuracy, reasoning, communication, and empathy.

The study enhances the AMIE diagnostic system by incorporating multimodal perception and a state-aware dialogue framework that guides conversations through phases of history taking, diagnosis, and follow-up. Gemini 2.0 Flash powers the system and dynamically adapts based on evolving patient data, including text, images, and clinical documents. A structured patient profile and differential diagnosis are updated throughout the interaction, with targeted questions and multimodal data requests guiding clinical reasoning. Evaluation includes automated perception tests on isolated artifacts, simulated dialogues rated by auto-evaluators, and expert OSCE-style assessments, ensuring robust diagnostic performance and clinical realism.

The results show that the multimodal AMIE system performs at par or better than primary care physicians (PCPs) across multiple clinical tasks in simulated text-chat consultations. In OSCE-style assessments, AMIE consistently outperformed PCPs in diagnostic accuracy, especially when interpreting multimodal data such as images and clinical documents. It also demonstrated greater robustness when image quality was poor and showed fewer hallucinations. Patient actors rated AMIE’s communication skills highly, including empathy and trust. Automated evaluations confirmed that AMIE’s advanced reasoning framework, built on the Gemini 2.0 Flash model, significantly improved diagnosis and conversation quality, validating its design and effectiveness in real-world clinical scenarios.

In conclusion, the study advances conversational diagnostic AI by enhancing AMIE to integrate multimodal reasoning within patient dialogues. Using a novel state-aware inference-time strategy with Gemini 2.0 Flash, AMIE can interpret and reason about medical artifacts like images or ECGs in real-time clinical conversations. Evaluated through a multimodal OSCE framework, AMIE outperformed or matched primary care physicians in diagnostic accuracy, empathy, and artifact interpretation, even in complex cases. Despite limitations tied to chat-based interfaces and the need for real-world testing, these findings highlight AMIE’s potential as a robust, context-aware diagnostic assistant for future telehealth applications.

Check out the Paper and Technical details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Google Researchers Advance Diagnostic AI: AMIE Now Matches or Outperforms Primary Care Physicians Using Multimodal Reasoning with Gemini 2.0 Flash appeared first on MarkTechPost.

Meta AI Releases Llama Prompt Ops: A Python Toolkit for Prompt Optimization on Llama Models

Asif Razzaq — Sun, 04 May 2025 04:20:04 +0000

Meta AI has released Llama Prompt Ops, a Python package designed to streamline the process of adapting prompts for Llama models. This open-source tool is built to help developers and researchers improve prompt effectiveness by transforming inputs that work well with other large language models (LLMs) into forms that are better optimized for Llama. As the Llama ecosystem continues to grow, Llama Prompt Ops addresses a critical gap: enabling smoother and more efficient cross-model prompt migration while enhancing performance and reliability.

Why Prompt Optimization Matters

Prompt engineering plays a crucial role in the effectiveness of any LLM interaction. However, prompts that perform well on one model—such as GPT, Claude, or PaLM—may not yield similar results on another. This discrepancy is due to architectural and training differences across models. Without tailored optimization, prompt outputs can be inconsistent, incomplete, or misaligned with user expectations.

Llama Prompt Ops solves this challenge by introducing automated and structured prompt transformations. The package makes it easier to fine-tune prompts for Llama models, helping developers unlock their full potential without relying on trial-and-error tuning or domain-specific knowledge.

What Is Llama Prompt Ops?

At its core, Llama Prompt Ops is a library for systematic prompt transformation. It applies a set of heuristics and rewriting techniques to existing prompts, optimizing them for better compatibility with Llama-based LLMs. The transformations consider how different models interpret prompt elements such as system messages, task instructions, and conversation history.

This tool is particularly useful for:

Migrating prompts from proprietary or incompatible models to open Llama models.
Benchmarking prompt performance across different LLM families.
Fine-tuning prompt formatting for improved output consistency and relevance.

Features and Design

Llama Prompt Ops is built with flexibility and usability in mind. Its key features include:

Prompt Transformation Pipeline: The core functionality is organized into a transformation pipeline. Users can specify the source model (e.g., gpt-3.5-turbo) and target model (e.g., llama-3) to generate an optimized version of a prompt. These transformations are model-aware and encode best practices that have been observed in community benchmarks and internal evaluations.
Support for Multiple Source Models: While optimized for Llama as the output model, Llama Prompt Ops supports inputs from a wide range of common LLMs, including OpenAI’s GPT series, Google’s Gemini (formerly Bard), and Anthropic’s Claude.
Test Coverage and Reliability: The repository includes a suite of prompt transformation tests that ensure transformations are robust and reproducible. This ensures confidence for developers integrating it into their workflows.
Documentation and Examples: Clear documentation accompanies the package, making it easy for developers to understand how to apply transformations and extend the functionality as needed.

How It Works

The tool applies modular transformations to the prompt’s structure. Each transformation rewrites parts of the prompt, such as:

Replacing or removing proprietary system message formats.
Reformatting task instructions to suit Llama’s conversational logic.
Adapting multi-turn histories into formats more natural for Llama models.

The modular nature of these transformations allows users to understand what changes are made and why, making it easier to iterate and debug prompt modifications.

Conclusion

As large language models continue to evolve, the need for prompt interoperability and optimization grows. Meta’s Llama Prompt Ops offers a practical, lightweight, and effective solution for improving prompt performance on Llama models. By bridging the formatting gap between Llama and other LLMs, it simplifies adoption for developers while promoting consistency and best practices in prompt engineering.

Check out the GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta AI Releases Llama Prompt Ops: A Python Toolkit for Prompt Optimization on Llama Models appeared first on MarkTechPost.

IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks

Asif Razzaq — Sun, 04 May 2025 01:36:20 +0000

IBM has introduced a preview of Granite 4.0 Tiny, the smallest member of its upcoming Granite 4.0 family of language models. Released under the Apache 2.0 license, this compact model is designed for long-context tasks and instruction-following scenarios, striking a balance between efficiency, transparency, and performance. The release reflects IBM’s continued focus on delivering open, auditable, and enterprise-ready foundation models.

Granite 4.0 Tiny Preview includes two key variants: the Base-Preview, which showcases a novel decoder-only architecture, and the Tiny-Preview (Instruct), which is fine-tuned for dialog and multilingual applications. Despite its reduced parameter footprint, Granite 4.0 Tiny demonstrates competitive results on reasoning and generation benchmarks—underscoring the benefits of its hybrid design.

Architecture Overview: A Hybrid MoE with Mamba-2-Style Dynamics

At the core of Granite 4.0 Tiny lies a hybrid Mixture-of-Experts (MoE) structure, with 7 billion total parameters and only 1 billion active parameters per forward pass. This sparsity allows the model to deliver scalable performance while significantly reducing computational overhead—making it well-suited for resource-constrained environments and edge inference.

The Base-Preview variant employs a decoder-only architecture augmented with Mamba-2-style layers—a linear recurrent alternative to traditional attention mechanisms. This architectural shift enables the model to scale more efficiently with input length, enhancing its suitability for long-context tasks such as document understanding, dialogue summarization, and knowledge-intensive QA.

Another notable design decision is the use of NoPE (No Positional Encodings). Instead of fixed or learned positional embeddings, the model integrates position handling directly into its layer dynamics. This approach improves generalization across varying input lengths and helps maintain consistency in long-sequence generation.

Benchmark Performance: Efficiency Without Compromise

Despite being a preview release, Granite 4.0 Tiny already exhibits meaningful performance gains over prior models in IBM’s Granite series. On benchmark evaluations, the Base-Preview demonstrates:

+5.6 improvement on DROP (Discrete Reasoning Over Paragraphs), a benchmark for multi-hop QA
+3.8 on AGIEval, which assesses general language understanding and reasoning

These improvements are attributed to both the model’s architecture and its extensive pretraining—reportedly on 2.5 trillion tokens, spanning diverse domains and linguistic structures.

Instruction-Tuned Variant: Designed for Dialogue, Clarity, and Multilingual Reach

The Granite-4.0-Tiny-Preview (Instruct) variant extends the base model through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), using a Tülu-style dataset consisting of both open and synthetic dialogues. This variant is tailored for instruction-following and interactive use cases.

Supporting 8,192 token input windows and 8,192 token generation lengths, the model maintains coherence and fidelity across extended interactions. Unlike encoder–decoder hybrids that often trade off interpretability for performance, the decoder-only setup here yields clearer and more traceable outputs—a valuable feature for enterprise and safety-critical applications.

Evaluation Scores:

86.1 on IFEval, indicating strong performance in instruction-following benchmarks
70.05 on GSM8K, for grade-school math problem solving
82.41 on HumanEval, measuring Python code generation accuracy

Moreover, the instruct model supports multilingual interaction across 12 languages, making it viable for global deployments in customer service, enterprise automation, and educational tools.

Open-Source Availability and Ecosystem Integration

IBM has made both models publicly available on Hugging Face:

The models are accompanied by full model weights, configuration files, and sample usage scripts under the Apache 2.0 license, encouraging transparent experimentation, fine-tuning, and integration across downstream NLP workflows.

Outlook: Laying the Groundwork for Granite 4.0

Granite 4.0 Tiny Preview serves as an early glimpse into IBM’s broader strategy for its next-generation language model suite. By combining efficient MoE architectures, long-context support, and instruction-focused tuning, the model family aims to deliver state-of-the-art capabilities in a controllable and resource-efficient package.

As more variants of Granite 4.0 are released, we can expect IBM to deepen its investment in responsible, open AI—positioning itself as a key player in shaping the future of transparent, high-performance language models for enterprise and research.

Check out the Technical details, Granite 4.0 Tiny Base Preview and Granite 4.0 Tiny Instruct Preview. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks appeared first on MarkTechPost.

Vision Foundation Models: Implementation and Business Applications

Mohammad Asjad — Sat, 03 May 2025 19:59:58 +0000

In this tutorial, we’ll explore implementing various vision foundation models for business applications. We’ll focus on practical code implementation, technical details, and business use cases rather than theoretical aspects.

Setup and Environment Configuration

First, let’s set up our environment and install the necessary libraries:

Copy Code

!pip install torch torchvision transformers timm pillow matplotlib opencv-python tensorflow-hub tensorflow
!pip install huggingface_hub sentence-transformers ftfy regex tqdm
!pip install accelerate

# Verify CUDA availability for GPU acceleration

Copy Code

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
   print(f"CUDA device: {torch.cuda.get_device_name(0)}")

1. CLIP: Contrastive Language-Image Pre-training

CLIP by OpenAI excels at connecting images with natural language, making it powerful for zero-shot image classification and retrieval tasks.

Business Applications:

Product image search and recommendation
Content moderation
Visual brand monitoring
Cross-modal retrieval systems

Copy Code

import torch
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
import matplotlib.pyplot as plt
import numpy as np


# Load model and processor
model_id = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)


# Function to get image embeddings
def get_clip_image_embedding(image_path):
   image = Image.open(image_path) if isinstance(image_path, str) else image_path
   inputs = processor(images=image, return_tensors="pt")
   with torch.no_grad():
       image_features = model.get_image_features(**inputs)
   return image_features


# Function to perform zero-shot classification
def classify_image_with_clip(image_path, categories):
   image = Image.open(image_path) if isinstance(image_path, str) else image_path
   inputs = processor(
       text=categories,
       images=image,
       return_tensors="pt",
       padding=True
   )


   with torch.no_grad():
       outputs = model(**inputs)
       logits_per_image = outputs.logits_per_image
       probs = logits_per_image.softmax(dim=1)


   # Return dict of categories and probabilities
   return {categories[i]: probs[0][i].item() for i in range(len(categories))}


# Example: Product categorization
url = "https://images.unsplash.com/photo-1542291026-7eec264c27ff?q=80&w=1470&auto=format&fit=crop"
image = Image.open(requests.get(url, stream=True).raw)


product_categories = [
   "sneakers", "formal shoes", "sandals", "boots",
   "sports equipment", "casual wear", "luxury item"
]


results = classify_image_with_clip(image, product_categories)


# Sort results by probability
sorted_results = dict(sorted(results.items(), key=lambda x: x[1], reverse=True))


# Display the image and classification results
plt.figure(figsize=(12, 6))


# Plot the image on the left
plt.subplot(1, 2, 1)
plt.imshow(np.array(image))
plt.title("Input Image")
plt.axis("off")


# Plot the classification results on the right
plt.subplot(1, 2, 2)
categories = list(sorted_results.keys())
scores = list(sorted_results.values())


y_pos = np.arange(len(categories))
plt.barh(y_pos, scores, align="center")
plt.yticks(y_pos, categories)
plt.xlabel("Probability")
plt.title("CLIP Classification Results")


plt.tight_layout()
plt.show()


# Also print results to console
print("Classification Results:")
for category, score in sorted_results.items():
   print(f"{category}: {score:.4f}")

Output

2. DINO v2: Self-supervised Vision Transformer

DINO v2 by Meta AI Research provides powerful visual features without requiring labeled data, making it excellent for various downstream tasks.

Business Applications:

Visual similarity search
Anomaly detection
Product clustering
Image feature extraction for downstream ML tasks

Copy Code

import torch
import torchvision.transforms as T
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from torch.nn import functional as F
import requests
from io import BytesIO


# Load DINOv2 model
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vits14.eval()


# Preprocess images for DINOv2
transform = T.Compose([
   T.Resize(256),
   T.CenterCrop(224),
   T.ToTensor(),
   T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])


# Function to extract features
def extract_dinov2_features(image_path):
   image = Image.open(image_path).convert('RGB') if isinstance(image_path, str) else image_path
   img_tensor = transform(image).unsqueeze(0)


   with torch.no_grad():
       features = dinov2_vits14(img_tensor)


   return features


# Function to compute similarity between images
def compute_similarity(img1_path, img2_path):
   feat1 = extract_dinov2_features(img1_path)
   feat2 = extract_dinov2_features(img2_path)


   # Normalize features
   feat1 = F.normalize(feat1, dim=1)
   feat2 = F.normalize(feat2, dim=1)


   # Compute cosine similarity
   similarity = torch.mm(feat1, feat2.transpose(0, 1)).item()
   return similarity


# Function to download image from URL
def download_image(url):
   response = requests.get(url, stream=True)
   return Image.open(BytesIO(response.content)).convert('RGB')


# Function to visualize image pair with similarity score
def visualize_similarity(img1_path, img2_path, title=None):
   # Load images
   if img1_path.startswith(('http://', 'https://')):
       img1 = download_image(img1_path)
   else:
       img1 = Image.open(img1_path).convert('RGB')


   if img2_path.startswith(('http://', 'https://')):
       img2 = download_image(img2_path)
   else:
       img2 = Image.open(img2_path).convert('RGB')


   # Compute similarity
   similarity = compute_similarity(img1, img2)


   # Create figure for visualization
   fig, axes = plt.subplots(1, 2, figsize=(12, 6))


   # Display images
   axes[0].imshow(np.array(img1))
   axes[0].set_title("Image 1")
   axes[0].axis("off")


   axes[1].imshow(np.array(img2))
   axes[1].set_title("Image 2")
   axes[1].axis("off")


   # Add similarity score as figure title
   fig_title = f"Similarity Score: {similarity:.4f}"
   if title:
       fig_title = f"{title}n{fig_title}"
   fig.suptitle(fig_title, fontsize=16)


   plt.tight_layout()
   plt.show()


   return similarity


# Example: Use direct URLs instead of downloading files first
# Sample sneaker images from Unsplash
url1 = "https://images.unsplash.com/photo-1560769629-975ec94e6a86?w=500"  # Red sneaker
url2 = "https://images.unsplash.com/photo-1600185365926-3a2ce3cdb9eb?w=500"  # White sneaker
url3 = "https://images.unsplash.com/photo-1491553895911-0055eca6402d?w=500"  # Another sneaker


# Visualize pairs with similarity scores
print("Comparing Product 1 and Product 2:")
similarity_1_2 = visualize_similarity(url1, url2, "Red Sneaker vs White Sneaker")


print("nComparing Product 1 and Product 3:")
similarity_1_3 = visualize_similarity(url1, url3, "Red Sneaker vs Another Sneaker")


print("nComparing Product 2 and Product 3:")
similarity_2_3 = visualize_similarity(url2, url3, "White Sneaker vs Another Sneaker")


# Print summary of all similarities
print("nSummary of Similarity Scores:")
print(f"Similarity between product 1 and 2: {similarity_1_2:.4f}")
print(f"Similarity between product 1 and 3: {similarity_1_3:.4f}")
print(f"Similarity between product 2 and 3: {similarity_2_3:.4f}")

Output

3. Segment Anything Model (SAM): Advanced Image Segmentation

SAM by Meta AI provides powerful zero-shot segmentation capabilities for various business applications.

Business Applications:

Automated image cataloging

Precise product measurement in retail

Medical image analysis

Agricultural crop monitoring

Content creation and editing

Copy Code

# Install required libraries for SAM
!pip install git+https://github.com/facebookresearch/segment-anything.git


import torch
import numpy as np
import matplotlib.pyplot as plt
from segment_anything import sam_model_registry, SamPredictor
import cv2
from PIL import Image
import requests


# Download SAM checkpoint
!wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth


# Load SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
device = "cuda" if torch.cuda.is_available() else "cpu"
sam.to(device)
predictor = SamPredictor(sam)


# Function to perform automatic segmentation
def segment_image(image_path):
   # Load image
   image = cv2.imread(image_path)
   image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)


   # Set image for SAM
   predictor.set_image(image_rgb)


   # Generate automatic masks
   masks, scores, logits = predictor.predict(
       point_coords=None,
       point_labels=None,
       multimask_output=True,
       box=None
   )


   return image_rgb, masks, scores


# Function to visualize segmentation results
def visualize_segmentation(image, masks, scores, limit=5):
   plt.figure(figsize=(15, 10))


   # Display original image
   plt.subplot(1, limit+1, 1)
   plt.imshow(image)
   plt.title("Original Image")
   plt.axis('off')


   # Display top masks
   top_indices = np.argsort(scores)[-limit:][::-1]
   for i, idx in enumerate(top_indices):
       plt.subplot(1, limit+1, i+2)
       plt.imshow(image)
       plt.imshow(masks[idx], alpha=0.7, cmap='jet')
       plt.title(f"Mask {i+1}nScore: {scores[idx]:.3f}")
       plt.axis('off')


   plt.tight_layout()
   plt.show()


# Example: Product segmentation for e-commerce
!wget -q -O product_image.jpg "https://images.unsplash.com/photo-1525966222134-fcfa99b8ae77?w=800"


image_rgb, masks, scores = segment_image("product_image.jpg")
visualize_segmentation(image_rgb, masks, scores)


# Business application: Calculate precise product measurements
def calculate_object_dimensions(mask):
   # Find contours in the mask
   contours, _ = cv2.findContours((mask * 255).astype(np.uint8),
                                  cv2.RETR_EXTERNAL,
                                  cv2.CHAIN_APPROX_SIMPLE)


   if not contours:
       return None


   # Get the largest contour
   largest_contour = max(contours, key=cv2.contourArea)


   # Get bounding rectangle
   x, y, w, h = cv2.boundingRect(largest_contour)


   # Calculate aspect ratio
   aspect_ratio = w / h


   # Calculate area in pixels
   area_pixels = cv2.contourArea(largest_contour)


   return {
       'width': w,
       'height': h,
       'aspect_ratio': aspect_ratio,
       'area_pixels': area_pixels
   }


# Apply to the highest scoring mask
best_mask_idx = np.argmax(scores)
dimensions = calculate_object_dimensions(masks[best_mask_idx])


print("Product Dimensions:")
print(f"Width: {dimensions['width']} pixels")
print(f"Height: {dimensions['height']} pixels")
print(f"Aspect Ratio: {dimensions['aspect_ratio']:.2f}")
print(f"Area: {dimensions['area_pixels']} square pixels")

Output

4. BLIP-2: Vision-Language Model for Business Intelligence

BLIP-2 provides advanced vision-language capabilities for multimodal business applications.

Business Applications:

Automated product description generation
Image-based customer service automation
Visual content analysis for marketing
Social media content understanding

Copy Code

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
import numpy as np
from io import BytesIO


# Load BLIP-2 model
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)


if torch.cuda.is_available():
   model = model.to("cuda")


# Function to download image from URL
def download_image(url):
   response = requests.get(url, stream=True)
   return Image.open(BytesIO(response.content)).convert('RGB')


# Function for image captioning
def generate_caption(image_path):
   # Load image from path or URL
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   inputs = processor(images=image, return_tensors="pt")


   if torch.cuda.is_available():
       inputs = {k: v.to("cuda") for k, v in inputs.items()}


   generated_ids = model.generate(**inputs, max_new_tokens=50)
   generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()


   return generated_text


# Function for visual question answering
def visual_qa(image_path, question):
   # Load image from path or URL
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # FIX: Properly format the question for the model
   # BLIP-2 needs a specific prompt format for QA
   prompt = f"Question: {question} Answer:"
   inputs = processor(images=image, text=prompt, return_tensors="pt")


   if torch.cuda.is_available():
       inputs = {k: v.to("cuda") for k, v in inputs.items()}


   generated_ids = model.generate(
       **inputs,
       max_new_tokens=30,
       do_sample=False  # Use greedy decoding for more precise answers
   )


   answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
   # Remove the prompt part from the answer
   answer = answer.replace(prompt, "").strip()


   return answer


# Function to visualize image with caption and QA
def visualize_product_analysis(image_path, questions=None):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Generate caption
   caption = generate_caption(image)


   # Default questions if none provided
   if questions is None:
       questions = [
           "What color is this product?",
           "What material is this product made of?",
           "What is the target demographic for this product?",
           "What is a key feature of this product?"
       ]


   # Get answers
   answers = []
   for question in questions:
       answer = visual_qa(image, question)
       answers.append((question, answer))


   # Create visualization
   plt.figure(figsize=(12, 10))


   # Display image
   plt.subplot(2, 1, 1)
   plt.imshow(np.array(image))
   plt.title("Product Image", fontsize=14)
   plt.axis('off')


   # Display caption and Q&A
   plt.subplot(2, 1, 2)
   plt.axis('off')


   text_content = f"Generated Description: {caption}nn"
   text_content += "Product Analysis:n"
   for q, a in answers:
       text_content += f"Q: {q}nA: {a}nn"


   plt.text(0.01, 0.99, text_content, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top', wrap=True)


   plt.tight_layout()
   plt.show()


   return caption, answers


# Business application: Automated product listing
def create_product_listing(image_path):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Get basic caption
   caption = generate_caption(image)


   # Extract product attributes with more specific prompting
   color = visual_qa(image, "What colors are visible in this product?")
   material = visual_qa(image, "What material does this product appear to be made of?")
   use_case = visual_qa(image, "What would be the main use case for this product?")
   unique_features = visual_qa(image, "What are any unique or notable features of this product?")


   # Create structured listing
   listing = {
       "title": caption,
       "attributes": {
           "color": color,
           "material": material,
           "primary_use": use_case,
           "unique_features": unique_features
       }
   }


   # Visualize the listing
   plt.figure(figsize=(14, 10))


   # Display image
   plt.subplot(1, 2, 1)
   plt.imshow(np.array(image))
   plt.title("Product Image", fontsize=14)
   plt.axis('off')


   # Display listing details
   plt.subplot(1, 2, 2)
   plt.axis('off')


   listing_text = f"PRODUCT LISTINGnn"
   listing_text += f"Title: {listing['title']}nn"
   listing_text += "Product Attributes:n"
   for attr, value in listing['attributes'].items():
       listing_text += f"{attr.replace('_', ' ').title()}: {value}n"


   plt.text(0.01, 0.99, listing_text, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top')


   plt.tight_layout()
   plt.show()


   return listing


# Function for marketing content analysis
def analyze_marketing_content(image_path):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Marketing-specific questions
   marketing_questions = [
       "What emotions does this image evoke?",
       "What brand values are communicated in this image?",
       "What target audience would this image appeal to?",
       "What call to action would pair well with this image?",
       "What marketing channel would this image be most effective on?"
   ]


   # Get answers
   marketing_insights = {}
   for question in marketing_questions:
       answer = visual_qa(image, question)
       key = question.split("?")[0].strip().lower().replace(" ", "_")
       marketing_insights[key] = answer


   # Visualize the analysis
   plt.figure(figsize=(14, 10))


   # Display image
   plt.subplot(1, 2, 1)
   plt.imshow(np.array(image))
   plt.title("Marketing Visual", fontsize=14)
   plt.axis('off')


   # Display marketing insights
   plt.subplot(1, 2, 2)
   plt.axis('off')


   insights_text = "MARKETING CONTENT ANALYSISnn"
   for question, key in zip(marketing_questions, marketing_insights.keys()):
       insights_text += f"{question}n{marketing_insights[key]}nn"


   plt.text(0.01, 0.99, insights_text, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top')


   plt.tight_layout()
   plt.show()


   return marketing_insights


# Function for social media understanding
def analyze_social_media_content(image_path):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Generate caption
   caption = generate_caption(image)


   # Social media specific analysis
   engagement_potential = visual_qa(image, "How likely is this image to engage viewers on social media?")
   suggested_hashtags = visual_qa(image, "What hashtags would be appropriate for this image on social media?")
   platform_fit = visual_qa(image, "Which social media platform would this image perform best on?")
   content_type = visual_qa(image, "What type of social media post would this image be suitable for?")


   # Create analysis dict
   social_analysis = {
       "caption": caption,
       "engagement_potential": engagement_potential,
       "suggested_hashtags": suggested_hashtags,
       "platform_fit": platform_fit,
       "content_type": content_type
   }


   # Visualize the analysis
   plt.figure(figsize=(14, 10))


   # Display image
   plt.subplot(1, 2, 1)
   plt.imshow(np.array(image))
   plt.title("Social Media Content", fontsize=14)
   plt.axis('off')


   # Display social media insights
   plt.subplot(1, 2, 2)
   plt.axis('off')


   insights_text = "SOCIAL MEDIA CONTENT ANALYSISnn"
   insights_text += f"Caption: {social_analysis['caption']}nn"
   insights_text += f"Engagement Potential: {social_analysis['engagement_potential']}nn"
   insights_text += f"Suggested Hashtags: {social_analysis['suggested_hashtags']}nn"
   insights_text += f"Best Platform: {social_analysis['platform_fit']}nn"
   insights_text += f"Content Type: {social_analysis['content_type']}n"


   plt.text(0.01, 0.99, insights_text, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top')


   plt.tight_layout()
   plt.show()


   return social_analysis


# Example usage
if __name__ == "__main__":
   # Example: E-commerce product analysis
   product_url = "https://images.unsplash.com/photo-1598033129183-c4f50c736f10?w=800"


   print("1. Basic Product Analysis")
   caption, qa_results = visualize_product_analysis(product_url)


   print("n2. Creating Automated Product Listing")
   product_listing = create_product_listing(product_url)


   print("n3. Marketing Content Analysis")
   marketing_url = "https://images.unsplash.com/photo-1581252584837-9f0b1d3bf82c?ixlib=rb-4.0.3&q=80"
   marketing_insights = analyze_marketing_content(marketing_url)


   print("n4. Social Media Content Analysis")
   social_url = "https://images.unsplash.com/photo-1534442072653-dbbf80c5e1ae?ixlib=rb-4.0.3&q=80"
   social_analysis = analyze_social_media_content(social_url)

Output 1

Output 2

Conclusion

This tutorial provides hands-on implementation guidance for deploying four key computer vision foundation models into business applications: CLIP (zero-shot classification), DINO v2 (self-supervised learning), SAM (image segmentation), and BLIP-2 (vision-language tasks).Future experimentation could explore model ensemble techniques, fine-tuning on domain-specific datasets, edge deployment optimization, and integration with business intelligence platforms to maximize ROI on vision AI investments.

Check out the Notebook here. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Vision Foundation Models: Implementation and Business Applications appeared first on MarkTechPost.