Asif Razzaq

LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model

Asif Razzaq — Tue, 06 May 2025 23:13:01 +0000

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture

LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

Speech Encoder: Utilizes Whisper-large-v3 to transform input speech into token-level acoustic representations.
Speech Adapter: Processes encoder outputs using a downsampling layer and a feed-forward network to align with the language model’s input space.
Core LLM: The Qwen2.5 models serve as the main reasoning engine.
Streaming TTS Decoder: Converts LLM outputs into speech tokens using an autoregressive Transformer and then generates mel spectrograms through a causal flow matching model inspired by CosyVoice2.

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Streaming Generation with Read-Write Scheduling

The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach

Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

Stage I: Independently optimizes the speech-to-text and text-to-speech modules.
Stage II: Fine-tunes the speech-to-speech generation path, including the gating and autoregressive decoding components.

Benchmark Results

The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

Model	Llama Q (S2S)	Web Q (S2S)	GPT-4o Score	ASR-WER	Latency (ms)
GLM-4-Voice (9B)	50.7	15.9	4.09	3.48	1562.8
LLaMA-Omni (8B)	49.0	23.7	3.52	3.67	346.7
LLaMA-Omni2-7B	60.7	31.3	4.15	3.26	582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses

Gate Fusion Module: Removing the gating mechanism increases ASR-WER and reduces speech quality, confirming its role in aligning textual and contextual signals.
TTS Pretraining: Initializing the TTS model from Qwen2.5 and fine-tuning in a streaming setup yields the best performance. Training from scratch fails to converge effectively.
Read/Write Strategies: Adjusting the R:W ratio impacts latency and quality. Larger W improves UTMOS but at the cost of response delay.

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion

LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.

Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

Asif Razzaq — Tue, 06 May 2025 05:47:32 +0000

NVIDIA has unveiled Parakeet TDT 0.6B, a state-of-the-art automatic speech recognition (ASR) model that is now fully open-sourced on Hugging Face. With 600 million parameters, a commercially permissive CC-BY-4.0 license, and a staggering real-time factor (RTF) of 3386, this model sets a new benchmark for performance and accessibility in speech AI.

Blazing Speed and Accuracy

At the heart of Parakeet TDT 0.6B’s appeal is its unmatched speed and transcription quality. The model can transcribe 60 minutes of audio in just one second, a performance that’s over 50x faster than many existing open ASR models. On Hugging Face’s Open ASR Leaderboard, Parakeet V2 achieves a 6.05% word error rate (WER)—the best-in-class among open models.

This performance represents a significant leap forward for enterprise-grade speech applications, including real-time transcription, voice-based analytics, call center intelligence, and audio content indexing.

Technical Overview

Parakeet TDT 0.6B builds on a transformer-based architecture fine-tuned with high-quality transcription data and optimized for inference on NVIDIA hardware. Here are the key highlights:

600M parameter encoder-decoder model
Quantized and fused kernels for maximum inference efficiency
Optimized for TDT (Transducer Decoder Transformer) architecture
Supports accurate timestamp formatting, numerical formatting, and punctuation restoration
Pioneers song-to-lyrics transcription, a rare capability in ASR models

The model’s high-speed inference is powered by NVIDIA’s TensorRT and FP8 quantization, enabling it to reach a real-time factor of RTF = 3386, meaning it processes audio 3386 times faster than real-time.

Benchmark Leadership

On the Hugging Face Open ASR Leaderboard—a standardized benchmark for evaluating speech models across public datasets—Parakeet TDT 0.6B leads with the lowest WER recorded among open-source models. This positions it well above comparable models like Whisper from OpenAI and other community-driven efforts.

Data based on May 5 2025

This performance makes Parakeet V2 not only a leader in quality but also in deployment readiness for latency-sensitive applications.

Beyond Conventional Transcription

Parakeet is not just about speed and word error rate. NVIDIA has embedded unique capabilities into the model:

Song-to-lyrics transcription: Unlocks transcription for sung content, expanding use cases into music indexing and media platforms.
Numerical and timestamp formatting: Improves readability and usability in structured contexts like meeting notes, legal transcripts, and health records.
Punctuation restoration: Enhances natural readability for downstream NLP applications.

These features elevate the quality of transcripts and reduce the burden on post-processing or human editing, especially in enterprise-grade deployments.

Strategic Implications

The release of Parakeet TDT 0.6B represents another step in NVIDIA’s strategic investment in AI infrastructure and open ecosystem leadership. With strong momentum in foundational models (e.g., Nemotron for language and BioNeMo for protein design), NVIDIA is positioning itself as a full-stack AI company—from GPUs to state-of-the-art models.

For the AI developer community, this open release could become the new foundation for building speech interfaces in everything from smart devices and virtual assistants to multimodal AI agents.

Getting Started

Parakeet TDT 0.6B is available now on Hugging Face, complete with model weights, tokenizer, and inference scripts. It runs optimally on NVIDIA GPUs with TensorRT, but support is also available for CPU environments with reduced throughput.

Whether you’re building transcription services, annotating massive audio datasets, or integrating voice into your product, Parakeet TDT 0.6B offers a compelling open-source alternative to commercial APIs.

Check out the Model on Hugging Face. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
ML News Community – r/machinelearningnews (92k+ members)

The post NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second appeared first on MarkTechPost.

OpenAI Releases a Strategic Guide for Enterprise AI Adoption: Practical Lessons from the Field

Asif Razzaq — Tue, 06 May 2025 03:29:27 +0000

OpenAI has published a comprehensive 24-page document titled AI in the Enterprise, offering a pragmatic framework for organizations navigating the complexities of large-scale AI deployment. Rather than focusing on abstract theories, the report presents seven implementation strategies based on field-tested insights from collaborations with leading companies including Morgan Stanley, Klarna, Lowe’s, and Mercado Libre.

The document reads less like promotional material and more like an operational guidebook—emphasizing systematic evaluation, infrastructure readiness, and domain-specific integration.

1. Establish a Rigorous Evaluation Process

The first recommendation is to initiate AI adoption through well-defined evaluations (“evals”) that benchmark model performance against targeted use cases. Morgan Stanley applied this approach by assessing language translation, summarization, and knowledge retrieval in financial advisory contexts. The outcome was measurable: improved document access, reduced search latency, and broader AI adoption among advisors.

Evals not only validate models for deployment but also help refine workflows with empirical feedback loops, enhancing both safety and model alignment.

2. Integrate AI at the Product Layer

Rather than treating AI as an auxiliary function, the report stresses embedding it directly into user-facing experiences. For instance, Indeed utilized GPT-4o mini to personalize job matching, supplementing recommendations with contextual “why” statements. This increased user engagement and hiring success rates while maintaining cost-efficiency through fine-tuned, token-optimized models.

The key takeaway: model performance alone is insufficient—impact scales when AI is embedded into product logic and tailored to domain-specific needs.

3. Invest Early to Capture Compounding Returns

Klarna’s early investment in AI yielded substantial gains in operational efficiency. A GPT-powered assistant now handles two-thirds of support chats, reducing resolution times from 11 minutes to 2. The company also reports that 90% of employees are using AI in their workflows, a level of adoption that enables rapid iteration and organizational learning.

This illustrates how early engagement not only improves tooling but accelerates institutional adaptation and compound value capture.

4. Leverage Fine-Tuning for Contextual Precision

Generic models can deliver strong baselines, but domain adaptation often requires customization. Lowe’s achieved notable improvements in product search relevance by fine-tuning GPT models on their internal product data. The result: a 20% increase in tagging accuracy and a 60% improvement in error detection.

OpenAI highlights this approach as a low-latency pathway to achieve brand consistency, domain fluency, and efficiency across content generation and search tasks.

5. Empower Internal Experts, Not Just Technologists

BBVA exemplifies a decentralized AI adoption model by enabling non-technical employees to build custom GPT-based tools. In just five months, over 2,900 internal GPTs were created, addressing legal, compliance, and customer service needs without requiring engineering support.

This bottom-up strategy empowers subject-matter experts to iterate directly on their workflows, yielding more relevant solutions and reducing development cycles.

6. Streamline Developer Workflows with Dedicated Platforms

Engineering bandwidth remains a bottleneck in many organizations. Mercado Libre addressed this by building Verdi, a platform powered by GPT-4o mini, enabling 17,000 developers to prototype and deploy AI applications using natural language interfaces. The system integrates guardrails, APIs, and reusable components—allowing faster, standardized development.

The platform now supports high-value functions such as fraud detection, multilingual translation, and automated content tagging, demonstrating how internal infrastructure can accelerate AI velocity.

7. Automate Deliberately and Systematically

OpenAI emphasizes setting clear automation targets. Internally, they developed an automation platform that integrates with tools like Gmail to draft support responses and trigger actions. This system now handles hundreds of thousands of tasks monthly, reducing manual workload and enhancing responsiveness.

Their broader vision includes Operator, a browser-agent capable of autonomously interacting with web-based interfaces to complete multi-step processes—signaling a move toward agent-based, API-free automation.

Final Observations

The report concludes with a central theme: effective AI adoption requires iterative deployment, cross-functional alignment, and a willingness to refine strategies through experimentation. While the examples are enterprise-scale, the core principles—starting with evals, integrating deeply, and customizing with context—are broadly applicable.

Security and data governance are also addressed explicitly. OpenAI reiterates that enterprise data is not used for training, offers SOC 2 and CSA STAR compliance, and provides granular access control for regulated environments.

In an increasingly AI-driven landscape, OpenAI’s guide serves as both a mirror and a map—reflecting current best practices and helping enterprises chart a more structured, sustainable path forward.

Check out the Full Guide here. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

Newsletter– airesearchinsights.com/(30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)
ML News Community – r/machinelearningnews (92k+ members)

The post OpenAI Releases a Strategic Guide for Enterprise AI Adoption: Practical Lessons from the Field appeared first on MarkTechPost.

Building AI Agents Using Agno’s Multi-Agent Teaming Framework for Comprehensive Market Analysis and Risk Reporting

Asif Razzaq — Sun, 04 May 2025 20:27:40 +0000

In today’s fast-paced financial landscape, leveraging specialized AI agents to handle discrete aspects of analysis is key to delivering timely, accurate insights. Agno’s lightweight, model-agnostic framework empowers developers to rapidly spin up purpose-built agents, such as our Finance Agent for structured market data and Risk Assessment Agent for volatility and sentiment analysis, without boilerplate or complex orchestration code. By defining clear instructions and composing a multi-agent “Finance-Risk Team,” Agno handles the coordination, tool invocation, and context management behind the scenes, enabling each agent to focus on its domain expertise while seamlessly collaborating to produce a unified report.

Copy Code

!pip install -U agno google-genai duckduckgo-search yfinance

We install and upgrade the core Agno framework, Google’s GenAI SDK for Gemini integration, the DuckDuckGo search library for querying live information, and YFinance for seamless access to stock market data. By running it at the start of our Colab session, we ensure all necessary dependencies are available and up to date for building and running your finance and risk assessment agents.

Copy Code

from getpass import getpass
import os


os.environ["GOOGLE_API_KEY"] = getpass("Enter your Google API key: ")

The above code securely prompts you to enter your Google API key in Colab without echoing it to the screen, and then it is stored in the GOOGLE_API_KEY environment variable. Agno’s Gemini model wrapper and the Google GenAI SDK can automatically authenticate subsequent API calls by setting this variable.

Copy Code

from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.reasoning import ReasoningTools
from agno.tools.yfinance import YFinanceTools


agent = Agent(
    model=Gemini(id="gemini-1.5-flash"),  
    tools=[
        ReasoningTools(add_instructions=True),
        YFinanceTools(
            stock_price=True,
            analyst_recommendations=True,
            company_info=True,
            company_news=True
        ),
    ],
    instructions=[
        "Use tables to display data",
        "Only output the report, no other text",
    ],
    markdown=True,
)


agent.print_response(
    "Write a report on AAPL",
    stream=True,
    show_full_reasoning=True,
    stream_intermediate_steps=True
)

We initialize an Agno agent powered by Google’s Gemini (1.5 Flash) model, equip it with reasoning capabilities and YFinance tools to fetch stock data, analyst recommendations, company information, and news, and then stream a step-by-step, fully transparent report on AAPL, complete with chained reasoning and intermediate tool calls, directly to the Colab output.

Copy Code

finance_agent = Agent(
    name="Finance Agent",
    model=Gemini(id="gemini-1.5-flash"),
    tools=[
        YFinanceTools(
            stock_price=True,
            analyst_recommendations=True,
            company_info=True,
            company_news=True
        )
    ],
    instructions=[
        "Use tables to display stock price, analyst recommendations, and company info.",
        "Only output the financial report without additional commentary."
    ],
    markdown=True
)


risk_agent = Agent(
    name="Risk Assessment Agent",
    model=Gemini(id="gemini-1.5-flash"),
    tools=[
        YFinanceTools(
            stock_price=True,
            company_news=True
        ),
        ReasoningTools(add_instructions=True)
    ],
    instructions=[
        "Analyze recent price volatility and news sentiment to provide a risk assessment.",
        "Use tables where appropriate and only output the risk assessment section."
    ],
    markdown=True
)

These definitions create two specialized Agno agents using Google’s Gemini (1.5 Flash) model: the Finance Agent fetches and tabulates stock prices, analyst recommendations, company info, and news to deliver a concise financial report, while the Risk Assessment Agent analyzes price volatility and news sentiment, leveraging reasoning tools where needed, to generate a focused risk assessment section.

Copy Code

from agno.team.team import Team
from textwrap import dedent


team = Team(
    name="Finance-Risk Team",
    mode="coordinate",
    model=Gemini(id="gemini-1.5-flash"),
    members=[finance_agent, risk_agent],
    tools=[ReasoningTools(add_instructions=True)],
    instructions=[
        "Delegate financial analysis requests to the Finance Agent.",
        "Delegate risk assessment requests to the Risk Assessment Agent.",
        "Combine their outputs into one comprehensive report."
    ],
    markdown=True,
    show_members_responses=True,
    enable_agentic_context=True
)


task = dedent("""
1. Provide a financial overview of AAPL.
2. Provide a risk assessment for AAPL based on volatility and recent news.
""")


response = team.run(task)
print(response.content)

We assemble a coordinated “Finance-Risk Team” using Agno and Google Gemini. It delegates financial analyses to the Finance Agent and volatility/news assessments to the Risk Assessment Agent, then synthesizes their outputs into a single, comprehensive report. By calling team.run on a two-part AAPL task, it transparently orchestrates each expert agent and prints the unified result.

Copy Code

team.print_response(
    task,
    stream=True,
    stream_intermediate_steps=True,
    show_full_reasoning=True
)

We instruct the Finance-Risk Team to execute the AAPL task in real time, streaming each agent’s internal reasoning, tool invocations, and partial outputs as they happen. By enabling stream_intermediate_steps and show_full_reasoning, we’ll see exactly how Agno coordinates the Finance and Risk Assessment Agents step-by-step before delivering the final, combined report.

In conclusion, harnessing Agno’s multi-agent teaming capabilities transforms what would traditionally be a monolithic AI workflow into a modular, maintainable system of experts. Each agent in the team can specialize in fetching financial metrics, parsing analyst sentiment, or evaluating risk factors. At the same time, Agno’s Team API orchestrates delegation, context-sharing, and final synthesis. The result is a robust, extensible architecture ranging from simple two-agent setups to complex ensembles with minimal code changes and maximal clarity.

Check out the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Building AI Agents Using Agno’s Multi-Agent Teaming Framework for Comprehensive Market Analysis and Risk Reporting appeared first on MarkTechPost.

Meta AI Releases Llama Prompt Ops: A Python Toolkit for Prompt Optimization on Llama Models

Asif Razzaq — Sun, 04 May 2025 04:20:04 +0000

Meta AI has released Llama Prompt Ops, a Python package designed to streamline the process of adapting prompts for Llama models. This open-source tool is built to help developers and researchers improve prompt effectiveness by transforming inputs that work well with other large language models (LLMs) into forms that are better optimized for Llama. As the Llama ecosystem continues to grow, Llama Prompt Ops addresses a critical gap: enabling smoother and more efficient cross-model prompt migration while enhancing performance and reliability.

Why Prompt Optimization Matters

Prompt engineering plays a crucial role in the effectiveness of any LLM interaction. However, prompts that perform well on one model—such as GPT, Claude, or PaLM—may not yield similar results on another. This discrepancy is due to architectural and training differences across models. Without tailored optimization, prompt outputs can be inconsistent, incomplete, or misaligned with user expectations.

Llama Prompt Ops solves this challenge by introducing automated and structured prompt transformations. The package makes it easier to fine-tune prompts for Llama models, helping developers unlock their full potential without relying on trial-and-error tuning or domain-specific knowledge.

What Is Llama Prompt Ops?

At its core, Llama Prompt Ops is a library for systematic prompt transformation. It applies a set of heuristics and rewriting techniques to existing prompts, optimizing them for better compatibility with Llama-based LLMs. The transformations consider how different models interpret prompt elements such as system messages, task instructions, and conversation history.

This tool is particularly useful for:

Migrating prompts from proprietary or incompatible models to open Llama models.
Benchmarking prompt performance across different LLM families.
Fine-tuning prompt formatting for improved output consistency and relevance.

Features and Design

Llama Prompt Ops is built with flexibility and usability in mind. Its key features include:

Prompt Transformation Pipeline: The core functionality is organized into a transformation pipeline. Users can specify the source model (e.g., gpt-3.5-turbo) and target model (e.g., llama-3) to generate an optimized version of a prompt. These transformations are model-aware and encode best practices that have been observed in community benchmarks and internal evaluations.
Support for Multiple Source Models: While optimized for Llama as the output model, Llama Prompt Ops supports inputs from a wide range of common LLMs, including OpenAI’s GPT series, Google’s Gemini (formerly Bard), and Anthropic’s Claude.
Test Coverage and Reliability: The repository includes a suite of prompt transformation tests that ensure transformations are robust and reproducible. This ensures confidence for developers integrating it into their workflows.
Documentation and Examples: Clear documentation accompanies the package, making it easy for developers to understand how to apply transformations and extend the functionality as needed.

How It Works

The tool applies modular transformations to the prompt’s structure. Each transformation rewrites parts of the prompt, such as:

Replacing or removing proprietary system message formats.
Reformatting task instructions to suit Llama’s conversational logic.
Adapting multi-turn histories into formats more natural for Llama models.

The modular nature of these transformations allows users to understand what changes are made and why, making it easier to iterate and debug prompt modifications.

Conclusion

As large language models continue to evolve, the need for prompt interoperability and optimization grows. Meta’s Llama Prompt Ops offers a practical, lightweight, and effective solution for improving prompt performance on Llama models. By bridging the formatting gap between Llama and other LLMs, it simplifies adoption for developers while promoting consistency and best practices in prompt engineering.

Check out the GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta AI Releases Llama Prompt Ops: A Python Toolkit for Prompt Optimization on Llama Models appeared first on MarkTechPost.

A Step-by-Step Tutorial on Connecting Claude Desktop to Real-Time Web Search and Content Extraction via Tavily AI and Smithery using Model Context Protocol (MCP)

Asif Razzaq — Sun, 04 May 2025 03:53:01 +0000

In this hands-on tutorial, we’ll learn how to seamlessly connect Claude Desktop to real-time web search and content-extraction capabilities using Tavily AI’s Model Context Protocol (MCP) server and the Smithery client. We’ll begin by reviewing the Tavily homepage and dashboard, where you’ll generate your Developer API key. Next, we’ll explore the Tavily MCP server in Smithery’s interface, install and configure the tavily-mcp package for Claude via the Smithery “Add Server” flow, and verify the installation with a simple PowerShell command. Finally, you’ll see how Claude can invoke Tavily tools, tavily-search and tavily-extract, to fetch and parse live content from sites. By the end of this tutorial, we’ll have a fully integrated pipeline that empowers your AI workflows with up-to-the-minute information directly from the web.

Step 01: Go to the Tavily AI Homepage to sign up and access the Tavily API to set up the MCP server on the Claude desktop.

Step 2: Here you see the Tavily dashboard under the “Researcher” plan, with an API usage bar (0/1,000 credits) and the generated dev key (tvly-dev-…) ready to be copied for authenticating your requests.

Step 3: In Smithery’s server list, the Tavily MCP Server appears as a remote, scanned integration, with its two primary tools, tavily-search and tavily-extract, detailed under the Tools section.

Step 4: Clicking “Add Server” opens Smithery’s client selector in Auto mode, listing supported integrations such as Claude Desktop, Cursor, VS Code, and more.

Step 5: The Claude Desktop configuration modal shows the “Personal” profile selected by default and prompts you to enter your Tavily API key to enable the MCP connection.

Step 6: A Windows PowerShell window confirms successful resolution and installation of the Tavily MCP package for the Claude client, indicating you can now trust and use this server integration.

Step 7: Now, Tavily MCP would have been set up in Claude. Just close and exit the Claude desktop and restart to see it in settings.

Step 8: The tool-toggle menu in Claude lets you enable or disable tavily-search and tavily-extract on the fly, offering granular control over which MCP tools the assistant may call.

Step 9: Within Claude’s chat UI, you can observe the assistant invoking the tavily-search and tavily-extract tool calls inline as it searches marktechpost.com for recent AI articles and extracts their content.

In conclusion, Integrating Tavily’s MCP server with Claude Desktop via Smithery has unlocked a powerful synergy of real-time web search and content extraction within your AI workflows. This setup doesn’t just keep your models up to date, it empowers them to source, analyze, and synthesize fresh information on the fly, whether you’re conducting market research, fueling a RAG pipeline, or automating domain-specific insights. To take full advantage, revisit the Tavily dashboard and Smithery tool configuration to fine-tune query parameters, combine tavily-search and tavily-extract in your prompts, and explore advanced features like custom filters or scheduled queries.

Don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post A Step-by-Step Tutorial on Connecting Claude Desktop to Real-Time Web Search and Content Extraction via Tavily AI and Smithery using Model Context Protocol (MCP) appeared first on MarkTechPost.

IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks

Asif Razzaq — Sun, 04 May 2025 01:36:20 +0000

IBM has introduced a preview of Granite 4.0 Tiny, the smallest member of its upcoming Granite 4.0 family of language models. Released under the Apache 2.0 license, this compact model is designed for long-context tasks and instruction-following scenarios, striking a balance between efficiency, transparency, and performance. The release reflects IBM’s continued focus on delivering open, auditable, and enterprise-ready foundation models.

Granite 4.0 Tiny Preview includes two key variants: the Base-Preview, which showcases a novel decoder-only architecture, and the Tiny-Preview (Instruct), which is fine-tuned for dialog and multilingual applications. Despite its reduced parameter footprint, Granite 4.0 Tiny demonstrates competitive results on reasoning and generation benchmarks—underscoring the benefits of its hybrid design.

Architecture Overview: A Hybrid MoE with Mamba-2-Style Dynamics

At the core of Granite 4.0 Tiny lies a hybrid Mixture-of-Experts (MoE) structure, with 7 billion total parameters and only 1 billion active parameters per forward pass. This sparsity allows the model to deliver scalable performance while significantly reducing computational overhead—making it well-suited for resource-constrained environments and edge inference.

The Base-Preview variant employs a decoder-only architecture augmented with Mamba-2-style layers—a linear recurrent alternative to traditional attention mechanisms. This architectural shift enables the model to scale more efficiently with input length, enhancing its suitability for long-context tasks such as document understanding, dialogue summarization, and knowledge-intensive QA.

Another notable design decision is the use of NoPE (No Positional Encodings). Instead of fixed or learned positional embeddings, the model integrates position handling directly into its layer dynamics. This approach improves generalization across varying input lengths and helps maintain consistency in long-sequence generation.

Benchmark Performance: Efficiency Without Compromise

Despite being a preview release, Granite 4.0 Tiny already exhibits meaningful performance gains over prior models in IBM’s Granite series. On benchmark evaluations, the Base-Preview demonstrates:

+5.6 improvement on DROP (Discrete Reasoning Over Paragraphs), a benchmark for multi-hop QA
+3.8 on AGIEval, which assesses general language understanding and reasoning

These improvements are attributed to both the model’s architecture and its extensive pretraining—reportedly on 2.5 trillion tokens, spanning diverse domains and linguistic structures.

Instruction-Tuned Variant: Designed for Dialogue, Clarity, and Multilingual Reach

The Granite-4.0-Tiny-Preview (Instruct) variant extends the base model through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), using a Tülu-style dataset consisting of both open and synthetic dialogues. This variant is tailored for instruction-following and interactive use cases.

Supporting 8,192 token input windows and 8,192 token generation lengths, the model maintains coherence and fidelity across extended interactions. Unlike encoder–decoder hybrids that often trade off interpretability for performance, the decoder-only setup here yields clearer and more traceable outputs—a valuable feature for enterprise and safety-critical applications.

Evaluation Scores:

86.1 on IFEval, indicating strong performance in instruction-following benchmarks
70.05 on GSM8K, for grade-school math problem solving
82.41 on HumanEval, measuring Python code generation accuracy

Moreover, the instruct model supports multilingual interaction across 12 languages, making it viable for global deployments in customer service, enterprise automation, and educational tools.

Open-Source Availability and Ecosystem Integration

IBM has made both models publicly available on Hugging Face:

The models are accompanied by full model weights, configuration files, and sample usage scripts under the Apache 2.0 license, encouraging transparent experimentation, fine-tuning, and integration across downstream NLP workflows.

Outlook: Laying the Groundwork for Granite 4.0

Granite 4.0 Tiny Preview serves as an early glimpse into IBM’s broader strategy for its next-generation language model suite. By combining efficient MoE architectures, long-context support, and instruction-focused tuning, the model family aims to deliver state-of-the-art capabilities in a controllable and resource-efficient package.

As more variants of Granite 4.0 are released, we can expect IBM to deepen its investment in responsible, open AI—positioning itself as a key player in shaping the future of transparent, high-performance language models for enterprise and research.

Check out the Technical details, Granite 4.0 Tiny Base Preview and Granite 4.0 Tiny Instruct Preview. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post IBM AI Releases Granite 4.0 Tiny Preview: A Compact Open-Language Model Optimized for Long-Context and Instruction Tasks appeared first on MarkTechPost.

Building a Zapier AI-Powered Cursor Agent to Read, Search, and Send Gmail Messages using Model Context Protocol (MCP) Server

Asif Razzaq — Fri, 02 May 2025 21:13:19 +0000

In this tutorial, we’ll learn how to harness the power of the Model Context Protocol (MCP) alongside Zapier AI to build a responsive email agent directly on Cursor, no complex coding required. We’ll walk through configuring MCP connectors to bridge Cursor and Zapier AI, connecting your Gmail account, defining intents for reading, searching, and sending messages, and training the agent to recognize and act on your commands via MCP’s unified interface. By the end of this guide, you’ll have a fully functional MCP-enabled Cursor AI agent that can automatically draft replies, fetch important threads, and dispatch emails on your behalf, streamlining your day-to-day communication so you can focus on what truly matters.

Step 1: Download and install the cursor application on your desktop.

Step 2: Create a Zapier account and search for “connect MCP” on Cursor. The first link will take you to the part shown in the image below. Copy the code in the snippet, as we will use it on the cursor to connect Zapier to Cursor.

Step 3: Go to the left pane on the cursor and click on MCP.

Step 4: Then, click on Add new global MCP Server.

Step 5: Add the copied code from the Zapier site and save the file.

Copy Code

{
  "mcpServers": {
    "Zapier MCP": {
      "url": "Add your URL here"
    }
  }
}

Code Sample

Step 6: Now, go to my actions on Zapier’s action page and click on edit actions of MCP.

Step 7: Add the action you want your MCP server to perform here.

Step 8: Select the options from the drop-down menu to add the action and provide the permissions for these actions by giving access to the Google account.

Step 9: Refresh your MCP server on the cursor to see the added actions on Zapier that your Agent can perform.

Step 10: Finally, type into the chat the cursor whatever action you want your MCP server to perform from the added ones. In our case, we sent an email.

In conclusion, by integrating MCP into your Zapier AI and Cursor setup, you’ve created an email agent that speaks the same protocol language across all services, ensuring reliable, scalable automation. With your MCP-powered agent in place, you’ll enjoy greater efficiency, faster response times, and seamless communication, all without lifting a finger. Keep refining your MCP triggers and Zapier workflows to adapt to evolving needs, and watch as your email management becomes smarter, more consistent, and entirely hands-off.

Don’t forget to follow us on Twitter and join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Building a Zapier AI-Powered Cursor Agent to Read, Search, and Send Gmail Messages using Model Context Protocol (MCP) Server appeared first on MarkTechPost.

AI Agents Are Here—So Are the Threats: Unit 42 Unveils the Top 10 AI Agent Security Risks

Asif Razzaq — Fri, 02 May 2025 20:39:48 +0000

As AI agents transition from experimental systems to production-scale applications, their growing autonomy introduces novel security challenges. In a comprehensive new report, “AI Agents Are Here. So Are the Threats,” Palo Alto Networks’ Unit 42 reveals how today’s agentic architectures—despite their innovation—are vulnerable to a wide range of attacks, most of which stem not from the frameworks themselves, but from the way agents are designed, deployed, and connected to external tools.

To evaluate the breadth of these risks, Unit 42 researchers constructed two functionally identical AI agents—one built using CrewAI and the other with AutoGen. Despite architectural differences, both systems exhibited the same vulnerabilities, confirming that the underlying issues are not framework-specific. Instead, the threats arise from misconfigurations, insecure prompt design, and insufficiently hardened tool integrations—issues that transcend implementation choices.

Understanding the Threat Landscape

The report outlines ten core threats that expose AI agents to data leakage, tool exploitation, remote code execution, and more:

Prompt Injection and Overly Broad Prompts
Prompt injection remains a potent vector, enabling attackers to manipulate agent behavior, override instructions, and misuse integrated tools. Even without classic injection syntax, loosely defined prompts are prone to exploitation.
Framework-Agnostic Risk Surfaces
The majority of vulnerabilities originate not in the frameworks (e.g., CrewAI or AutoGen), but in application-layer design: insecure role delegation, improper tool access policies, and ambiguous prompt scoping.
Unsafe Tool Integrations
Many agentic applications integrate tools (e.g., code execution modules, SQL clients, web scrapers) with minimal access control. These integrations, when not properly sanitized, dramatically expand the agent’s attack surface.
Credential Exposure
Agents can inadvertently expose service credentials, tokens, or API keys—allowing attackers to escalate privileges or impersonate agents across environments.
Unrestricted Code Execution
Code interpreters within agents, if not sandboxed, permit execution of arbitrary payloads. Attackers can use these to access file systems, networks, or metadata services—frequently bypassing traditional security layers.
Lack of Layered Defense
Single-point mitigations are insufficient. A robust security posture demands defense-in-depth strategies that combine prompt hardening, runtime monitoring, input validation, and container-level isolation.
Prompt Hardening
Agents must be configured with strict role definitions, rejecting requests that fall outside predefined scopes. This reduces the likelihood of successful goal manipulation or instruction disclosure.
Runtime Content Filtering
Real-time input and output inspection—such as filtering prompts for known attack patterns—is critical for detecting and mitigating dynamic threats as they emerge.
Tool Input Sanitization
Structured input validation—checking formats, enforcing types, and limiting values—is essential to prevent SQL injections, malformed payloads, or cross-agent misuse.
Code Executor Sandboxing
Execution environments must restrict network access, drop unnecessary system capabilities, and isolate temporary storage to reduce the impact of potential breaches.

Simulated Attacks and Practical Implications

To illustrate these risks, Unit 42 deployed a multi-agent investment assistant and simulated nine attack scenarios. These included:

Extracting Agent Instructions and Tool Schemas
By leveraging prompt engineering, attackers could enumerate all internal agents, retrieve their task definitions, and understand tool APIs—facilitating downstream attacks.
Credential Theft via Metadata Services
Using malicious Python scripts injected into code interpreters, attackers accessed GCP metadata endpoints and exfiltrated service account tokens.
SQL Injection and BOLA Exploits
Agents relying on unvalidated input for database queries were susceptible to both SQL injection and broken object-level authorization (BOLA), allowing attackers to read arbitrary user data.
Indirect Prompt Injection
Malicious websites embedded instructions that caused agents to send user conversation histories to attacker-controlled domains, highlighting risks tied to autonomous browsing or reading tools.

Each of these scenarios exploited common design oversights, not novel zero-days. This underscores the urgent need for standardized threat modeling and secure agent development practices.

Defense Strategies: Moving Beyond Patchwork Fixes

The report emphasizes that mitigating these threats requires holistic controls:

Prompt hardening should limit instruction leakage, restrict tool access, and enforce task boundaries.
Content filtering must be applied both pre- and post-inference, detecting anomalous patterns in agent interactions.
Tool integrations should be rigorously tested using static (SAST), dynamic (DAST), and dependency (SCA) analysis.
Code execution environments must employ strict sandboxing, including network egress filtering, syscall restrictions, and memory capping.

Palo Alto Networks recommends its AI Runtime Security and AI Access Security platforms as part of a layered defense approach. These solutions provide visibility into agent behaviors, monitor for misuse of third-party generative AI tools, and enforce enterprise-level policies on agent interactions.

Conclusion

The rise of AI agents marks a significant evolution in autonomous systems. But as Unit 42’s findings reveal, their security must not be an afterthought. Agentic applications extend the vulnerability surface of LLMs by integrating external tools, enabling self-modification, and introducing complex communication patterns—any of which can be exploited without sufficient safeguards.

Securing these systems demands more than robust frameworks—it requires deliberate design choices, continuous monitoring, and layered defenses. As enterprises begin to adopt AI agents at scale, now is the time to establish security-first development practices that evolve alongside the intelligence they’re building.

Check out the Full Guide. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post AI Agents Are Here—So Are the Threats: Unit 42 Unveils the Top 10 AI Agent Security Risks appeared first on MarkTechPost.

JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks

Asif Razzaq — Fri, 02 May 2025 07:43:42 +0000

JetBrains has officially open-sourced Mellum, a purpose-built 4-billion-parameter language model tailored for software development tasks. Developed from the ground up, Mellum reflects JetBrains’ engineering-first approach, offering a domain-specialized model trained for practical usage across codebases and programming environments. With its release on Hugging Face under the Apache 2.0 license, JetBrains extends an invitation to the broader research and developer community to experiment, adapt, and advance Mellum’s capabilities.

A Focal Model for Code Understanding

Unlike general-purpose LLMs, Mellum is classified by JetBrains as a “focal model”—a term they use to describe models with a narrow yet deep specialization. Mellum is optimized specifically for programming-related tasks such as autocompletion, infilling, and structural understanding of source code. This focused design avoids the overhead of broader linguistic modeling and enables the model to perform efficiently in IDE-like environments.

The model supports a wide array of languages including Java, Kotlin, Python, Go, PHP, C, C++, C#, JavaScript, TypeScript, CSS, HTML, Rust, and Ruby—reflecting the polyglot nature of modern development teams.

Model Architecture and Training Pipeline

Mellum follows a LLaMA-style architecture and was trained from scratch using over 4.2 trillion tokens drawn from code-rich sources such as The Stack, StarCoder, CommitPack, and English Wikipedia. It features an 8K token context window and was trained using bf16 mixed precision across a high-throughput cluster of 256 NVIDIA H200 GPUs connected via Infiniband.

The training process spanned approximately 20 days and leveraged modern infrastructure for scalable model development. The architecture and training procedure were designed with reproducibility and deployment flexibility in mind, making Mellum usable in both cloud inference setups (e.g., vLLM) and on local environments (e.g., llama.cpp, Ollama).

Benchmarking and Evaluation

JetBrains evaluated Mellum across a range of benchmarks that reflect its primary use cases—code infilling and completion. The model’s performance indicates strong alignment with the design goals:

RepoBench v1.1 (8K context):
- Python EM: 27.97%
- Java EM: 31.08%
SAFIM (Syntax-Aware Fill-in-the-Middle):
- pass@1: 38.11%
HumanEval Infilling:
- Single-line: 66.21%
- Multi-line: 38.52%
- Random-span: 29.70%

These results reflect Mellum’s specialization for structured code understanding, especially in scenarios involving partial or interrupted code, which are common in real-world development workflows.

Rationale for Open Sourcing

JetBrains’ decision to release Mellum as open-source is grounded in several practical motivations:

Transparency: Enables scrutiny of both training data and architectural decisions.
Reusability: Supports integration in custom development environments and research experiments.
Community Collaboration: Facilitates contribution from external developers to refine model behavior.
Pedagogical Value: Provides educators and students with a hands-on artifact for understanding how domain-specific LLMs are constructed and applied.

The release includes both the base model (Mellum-4b-base) and a fine-tuned variant for Python (Mellum-4b-sft-python).

Implications for Developer Tooling

The availability of a compact, performant model optimized for source code opens new opportunities in the IDE space and beyond. JetBrains envisions Mellum as part of a broader strategy involving multiple focal models, each optimized for specific programming tasks such as diff generation or code review assistance. This approach aligns with the growing need for deployable, cost-effective, and context-aware AI tooling that can augment developer productivity without introducing opaque or oversized general-purpose models.

Conclusion

Mellum represents a deliberate shift toward smaller, specialized language models that prioritize utility, transparency, and efficiency. By making the model openly available, JetBrains offers a high-quality foundation for building the next generation of AI-assisted developer tools. Its architecture, training methodology, and benchmark performance signal a practical step forward in the evolving space of LLMs tailored for software engineering.

The release includes both the base model (Mellum-4b-base) and a fine-tuned variant for Python (Mellum-4b-sft-python). Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post JetBrains Open Sources Mellum: A Developer-Centric Language Model for Code-Related Tasks appeared first on MarkTechPost.