Jean-marc Mommessin, Author at MarkTechPost

NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in Robotics

Jean-marc Mommessin — Fri, 04 Apr 2025 17:20:28 +0000

The future of robotics has advanced significantly. For many years, there have been expectations of human-like robots that can navigate our environments, perform complex tasks, and work alongside humans. Examples include robots conducting precise surgical procedures, building intricate structures, assisting in disaster response, and cooperating efficiently with humans in various settings such as factories, offices, and homes. However, actual progress has historically been limited.

Researchers from NVIDIA, Carnegie Mellon University, UC Berkeley, UT Austin, and UC San Diego introduced HOVER, a unified neural controller aimed at enhancing humanoid robot capabilities. This research proposes a multi-mode policy distillation framework, integrating different control strategies into one cohesive policy, thereby making a notable advancement in humanoid robotics.

The Achilles Heel of Humanoid Robotics: The Control Conundrum

Imagine a robot that can execute a perfect backflip but then struggles to grasp a doorknob.

The problem? Specialization.

Humanoid robots are incredibly versatile platforms, capable of supporting a wide range of tasks, including bimanual manipulation, bipedal locomotion, and complex whole-body control. However, despite impressive advances in these areas, researchers have typically employed different control formulations designed for specific scenarios.

Some controllers excel at locomotion, using “root velocity tracking” to guide movement. This approach focuses on controlling the robot’s overall movement through space.
Others prioritize manipulation, relying on “joint angle tracking” for precise movements. This approach allows for fine-grained control of the robot’s limbs.
Still others use “kinematic tracking” of key points for teleoperation. This method enables a human operator to control the robot by tracking their own movements.

Each speaks a different control language, creating a fragmented landscape where robots are masters of one task and inept at others. Switching between tasks has been clunky, inefficient, and often impossible. This specialization creates practical limitations. For example, a robot designed for bipedal locomotion on uneven terrain using root velocity tracking would struggle to transition smoothly to precise bimanual manipulation tasks that require joint angle or end-effector tracking.

In addition to that, many pre-trained manipulation policies operate across different configuration spaces, such as joint angles and end-effector positions. These constraints highlight the need for a unified low-level humanoid controller capable of adapting to diverse control modes.

HOVER: The Unified Field Theory of Robotic Control

HOVER is a paradigm shift. It’s a “generalist policy”—a single neural network that harmonizes diverse control modes, enabling seamless transitions and unprecedented versatility. HOVER supports diverse control modes, including over 15 useful configurations for real-world applications on a 19-DOF humanoid robot. This versatile command space encompasses most of the modes used in previous research.

Learning from the Masters: Human Motion Imitation

HOVER‘s brilliance lies in its foundation: learning from human movement itself. By training an “oracle motion imitator” on a massive dataset of human motion capture data (MoCap), HOVER absorbs the fundamental principles of balance, coordination, and efficient movement. This approach utilizes human movements’ natural adaptability and efficiency, providing the policy with rich motor priors that can be reused across multiple control modes.

The researchers ground the training process in human-like motion, allowing the policy to develop a deeper understanding of balance, coordination, and motion control, crucial elements for effective whole-body humanoid behavior.
From Oracle to Prodigy: Policy Distillation

The magic truly happens through “policy distillation.” The oracle policy, the master imitator, teaches a “student policy” (HOVER) its skills. Through a process involving command masking and a DAgger framework, HOVER learns to master diverse control modes, from kinematic position tracking to joint angle control and root tracking. This creates a “generalist” capable of handling any control scenario.

Through policy distillation, these motor skills are transferred from the oracle policy into a single “generalist policy” capable of handling multiple control modes. The resulting multi-mode policy supports diverse control inputs and outperforms policies trained individually for each mode. The researchers hypothesize this superior performance stems from the policy using shared physical knowledge across modes, such as maintaining balance, human-like motion, and precise limb control. These shared skills enhance generalization, leading to better performance across all modes, while single-mode policies often overfit specific reward structures and training environments.

HOVER‘s implementation involves training an Oracle policy followed by knowledge distillation to create a versatile controller. The oracle policy processes proprioceptive information, including position, orientation, velocities, and previous actions alongside reference poses, to generate optimal movements. The oracle achieves robust motion imitation using a carefully designed reward system with penalty, regularization, and task components. The student policy then learns from this oracle through a DAgger framework, incorporating model-based and sparsity-based masking techniques that allow selective tracking of different body parts. This distillation process minimizes the action difference between teacher and student, creating a unified controller capable of handling diverse control scenarios.

The researchers formulate humanoid control as a goal-conditioned reinforcement learning task where the policy is trained to track real-time human motion. The state includes the robot’s proprioception and a unified target goal state. Using these inputs, they define a reward function for policy optimization. The actions represent target joint positions that are fed into a PD controller. The system employs Proximal Policy Optimization (PPO) to maximize cumulative discounted rewards, essentially training the humanoid to follow target commands at each timestep.

The research methodology utilizes motion retargeting techniques to create feasible humanoid movements from human motion datasets. This three-step process begins with computing keypoint positions through forward kinematics, fitting the SMPL model to align with these key points, and retargeting the AMASS dataset by matching corresponding points between models using gradient descent. The “sim-to-data” procedure converts the large-scale human motion dataset into feasible humanoid motions, establishing a strong foundation for training the controller.

The research team designed a comprehensive command space for humanoid control that overcomes the limitations of previous approaches. Their unified framework accommodates multiple control modes simultaneously, including kinematic position tracking, joint angle tracking, and root tracking. This design satisfies key criteria of generality (supporting various input devices) and atomicity (enabling arbitrary combinations of control options).

HOVER Unleashed: Performance That Redefines Robotics

HOVER‘s capabilities are proven by rigorous testing:

Dominating the Specialists:

HOVER outperforms specialized controllers across the board. The research team evaluated HOVER against specialist policies and alternative multi-mode training approaches through comprehensive tests in both IsaacGym simulation and real-world implementations using the Unitree H1 robot.

To address whether HOVER could outperform specialized policies, they compared it against various specialists, including ExBody, HumanPlus, H2O, and OmniH2O – each designed for different tracking objectives such as joint angles, root velocity, or specific key points.

In evaluations using the retargeted AMASS dataset, HOVER consistently demonstrated superior generalization, outperforming specialists in at least 7 out of 12 metrics in every command mode. HOVER performed better than specialists trained for specific useful control modes like left-hand, right-hand, two-hand, and head tracking.

Multi-Mode Mastery: A Clean Sweep

When compared to other multi-mode training methods, they implemented a baseline that used the same masking process but trained from scratch with reinforcement learning. Radar charts visualizing tracking errors across eight distinct control modes showed HOVER consistently achieving lower errors across all 32 metrics and modes. HOVER achieved consistently lower tracking errors across all 32 metrics and 8 distinct control modes. This decisive victory underscores the power of HOVER’s distillation approach. This comprehensive performance advantage underscores the effectiveness of distilling knowledge from an oracle policy that tracks full-body kinematics rather than training with reinforcement learning from scratch.
From Simulation to Reality: Real-World Validation

HOVER‘s prowess is not confined to the digital world. The experimental setup included motion tracking evaluations using the retargeted AMASS dataset in simulation and 20 standing motion sequences for the real-world tests on the 19-DOF Unitree H1 platform, weighing 51.5kg and standing 1.8m tall. The experiments were structured to answer three key questions about HOVER’s generalizability, comparative performance, and real-world transferability.

On the Unitree H1 robot, a 19-DOF humanoid weighing 51.5kg and standing 1.8m tall, HOVER flawlessly tracked complex standing motions, dynamic running movements, and smoothly transitioned between control modes during locomotion and teleoperation. Experiments conducted in both simulation and on a physical humanoid robot show that HOVER achieves seamless transitions between control modes and delivers superior multi-mode control compared to baseline approaches.

HOVER: The Future of Humanoid Potential

HOVERunlocks the vast potential of humanoid robots. The multi-mode generalist policy also enables seamless transitions between modes, making it robust and versatile.

Imagine a future where humanoids:

Perform intricate surgery with unparalleled precision.
Construct complex structures with human-like dexterity.
Respond to disasters with agility and resilience.
Collaborate seamlessly with humans in factories, offices, and homes.

The age of truly versatile, capable, and intelligent humanoids is on the horizon, and HOVER is leading the way. Their evaluations collectively illustrate HOVER‘s ability to handle diverse real-world control modes, offering superior performance compared to specialist policies.

Sources:

Thanks to the NVIDIA team for the thought leadership/ Resources for this article. NVIDIA team has supported and sponsored this content/article.

The post NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in Robotics appeared first on MarkTechPost.

Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions

Jean-marc Mommessin — Tue, 18 Mar 2025 06:25:15 +0000

At NVIDIA GTC25, Gnani.ai experts unveiled groundbreaking advancements in voice AI, focusing on the development and deployment of Speech-to-Speech Foundation Models. This innovative approach promises to overcome the limitations of traditional cascaded voice AI architectures, ushering in an era of seamless, multilingual, and emotionally aware voice interactions.

The Limitations of Cascaded Architectures

Current state-of-the-art architecture powering voice agents involves a three-stage pipeline: Speech-to-Text (STT), Large Language Models (LLMs), and Text-to-Speech (TTS). While effective, this cascaded architecture suffers from significant drawbacks, primarily latency and error propagation. A cascaded architecture has multiple blocks in the pipeline, and each block will add its own latency. The cumulative latency across these stages can range from 2.5 to 3 seconds, leading to a poor user experience. Moreover, errors introduced in the STT stage propagate through the pipeline, compounding inaccuracies. This traditional architecture also loses critical paralinguistic features such as sentiment, emotion, and tone, resulting in monotonous and emotionally flat responses.

Introducing Speech-to-Speech Foundation Models

To address these limitations, Gnani.ai presents a novel Speech-to-Speech Foundation Model. This model directly processes and generates audio, eliminating the need for intermediate text representations. The key innovation lies in training a massive audio encoder with 1.5 million hours of labeled data across 14 languages, capturing nuances of emotion, empathy, and tonality. This model employs a nested XL encoder, retrained with comprehensive data, and an input audio projector layer to map audio features into textual embeddings. For real-time streaming, audio and text features are interleaved, while non-streaming use cases utilize an embedding merge layer. The LLM layer, initially based on Llama 8B, was expanded to include 14 languages, necessitating the rebuilding of tokenizers. An output projector model generates mel spectrograms, enabling the creation of hyper-personalized voices.

Key Benefits and Technical Hurdles

The Speech-to-Speech model offers several significant benefits. Firstly, it significantly reduces latency, moving from 2 seconds to approximately 850-900 milliseconds for the first token output. Secondly, it enhances accuracy by fusing ASR with the LLM layer, improving performance, especially for short and long speeches. Thirdly, the model achieves emotional awareness by capturing and modeling tonality, stress, and rate of speech. Fourthly, it enables improved interruption handling through contextual awareness, facilitating more natural interactions. Finally, the model is designed to handle low bandwidth audio effectively, which is crucial for telephony networks. Building this model presented several challenges, notably the massive data requirements. The team created a crowd-sourced system with 4 million users to generate emotionally rich conversational data. They also leveraged foundation models for synthetic data generation and trained on 13.5 million hours of publicly available data. The final model comprises a 9 billion parameter model, with 636 million for the audio input, 8 billion for the LLM, and 300 million for the TTS system.

NVIDIA’s Role in Development

The development of this model was heavily reliant on the NVIDIA stack. NVIDIA Nemo was used for training encoder-decoder models, and NeMo Curator facilitated synthetic text data generation. NVIDIA EVA was employed to generate audio pairs, combining proprietary information with synthetic data.

Use Cases

Gnani.ai showcased two primary use cases: real-time language translation and customer support. The real-time language translation demo featured an AI engine facilitating a conversation between an English-speaking agent and a French-speaking customer. The customer support demo highlighted the model’s ability to handle cross-lingual conversations, interruptions, and emotional nuances.

Speech-to-Speech Foundation Model

The Speech-to-Speech Foundation Model represents a significant leap forward in voice AI. By eliminating the limitations of traditional architectures, this model enables more natural, efficient, and emotionally aware voice interactions. As the technology continues to evolve, it promises to transform various industries, from customer service to global communication.

The post Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions appeared first on MarkTechPost.

Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer Assistance

Jean-marc Mommessin — Tue, 18 Mar 2025 06:16:01 +0000

Lowe’s, a leading home improvement retailer with 1,700 stores and 300,000 associates, is establishing itself as a pioneer in AI innovation. In a recent interview at Nvidia GTC25, Chandu Nair, Senior VP of Data, AI, and Innovation at Lowe’s, unveiled the company’s strategic vision, highlighting the transformative impact of AI on customer experience and operational efficiency.

A Holistic AI Approach: The “Hobby Shop” Strategy

Lowe’s has embraced a comprehensive AI strategy, centered around three pivotal pillars: enhancing the customer shopping journey, empowering store associates, and optimizing internal operations. This approach, aptly named the “hobby shop” strategy, aims to bridge the persistent information and expertise gaps inherent in home improvement. “Most of us live in a dwelling at home, and it’s times a torque of keeping them home, and usually the problems are not about product discovery. It’s really about the problem,” Chandu explained, emphasizing the shift from mere product acquisition to problem-solving.

Mylow: Personalized AI-Powered Shopping Assistance

A cornerstone of Lowe’s AI initiatives is Mylow, an AI-powered shopping assistant accessible via the Lowe’s app and website. Mylow provides personalized guidance for home improvement projects, addressing the common challenge of navigating complex tasks.

“It pulls all of our product catalog that curates the catalog of things that you need to actually build the race guard there,” Chandu elaborated, highlighting Mylow’s ability to curate relevant products. Furthermore, Mylow integrates instructional content, product recommendations, and pertinent YouTube videos, offering a holistic support system for customers. “If you are like me, then also watch a YouTube video to say, hey, how does it work?” Chandu added, reflecting the common practice of seeking visual guidance.

AI-Powered Store Companion for Associates

Lowe’s is equipping its 270,000 associates with an AI-powered store companion, a tool designed to enhance customer interactions and provide real-time assistance. Built on the same AI engine as Mylow, this companion answers customer queries and facilitates seamless support. Integrating computer vision (CNN) technology, the store’s video network identifies customers who have been dwelling in specific aisles and appear to require assistance. “you may be on an electrical island, you would not know, right? So you are leveraging computers so you can alert an Associate” Chandu detailed, illustrating the proactive nature of the system.

Computer Vision for Loss Prevention and Enhanced Security

In addition to customer assistance, Lowe’s employs computer vision at self-checkout stations to mitigate store theft, enhancing security and operational efficiency.

A Scalable AI Combination Approach

Lowe’s stands out for its successful deployment of a hybrid AI approach, combining traditional AI (CNN) with generative AI at scale. “We believe we are the only retailer out there using a combination of both Generative AI and configurations in the store. Such a scale to really assist our customers,” Chandu asserted, underscoring the company’s pioneering role. This integrated approach not only enhances customer service but also optimizes internal processes, solidifying Lowe’s position as a leader in AI-driven retail innovation.

The post Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer Assistance appeared first on MarkTechPost.

Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning

Jean-marc Mommessin — Fri, 14 Mar 2025 02:17:16 +0000

Google DeepMind has shattered conventional boundaries in robotics AI with the unveiling of Gemini Robotics, a suite of models built upon the formidable foundation of Gemini 2.0. This isn’t just an incremental upgrade; it’s a paradigm shift, propelling AI from the digital realm into the tangible world with unprecedented “embodied reasoning” capabilities.

Gemini Robotics: Bridging the Gap Between Digital Intelligence and Physical Action

At the heart of this innovation lies Gemini Robotics, an advanced vision-language-action (VLA) model that transcends traditional AI limitations. By introducing physical actions as a direct output modality, Gemini Robotics empowers robots to autonomously execute tasks with a level of understanding and adaptability previously unattainable. Complementing this is Gemini Robotics-ER (Embodied Reasoning), a specialized model engineered to refine spatial understanding, enabling roboticists to seamlessly integrate Gemini’s cognitive prowess into existing robotic architectures.

These models herald a new era of robotics, promising to unlock a diverse spectrum of real-world applications. Google DeepMind’s strategic partnerships with industry leaders like Apptronik, for the integration of Gemini 2.0 into humanoid robots, and collaborations with trusted testers, underscore the transformative potential of this technology.

Key Technological Advancements:

Unparalleled Generality: Gemini Robotics leverages Gemini’s robust world model to generalize across novel scenarios, achieving superior performance on rigorous generalization benchmarks compared to state-of-the-art VLA models.
Intuitive Interactivity: Built on Gemini 2.0’s language understanding, the model facilitates fluid human-robot interaction through natural language commands, dynamically adapting to environmental changes and user input.
Advanced Dexterity: The model demonstrates remarkable dexterity, executing complex manipulation tasks like origami folding and intricate object handling, showcasing a significant leap in robotic fine motor control.
Versatile Embodiment: Gemini Robotics’ adaptability extends to various robotic platforms, from bi-arm systems like ALOHA 2 and Franka arms to advanced humanoid robots like Apptronik’s Apollo.

Gemini Robotics-ER: Pioneering Spatial Intelligence

Gemini Robotics-ER elevates spatial reasoning, a critical component for effective robotic operation. By enhancing capabilities such as pointing, 3D object detection, and spatial understanding, this model enables robots to perform tasks with heightened precision and efficiency.

Gemini 2.0: Enabling Zero and Few-Shot Robot Control

A defining feature of Gemini 2.0 is its ability to facilitate zero and few-shot robot control. This eliminates the need for extensive robot action data training, enabling robots to perform complex tasks “out of the box.” By uniting perception, state estimation, spatial reasoning, planning, and control within a single model, Gemini 2.0 surpasses previous multi-model approaches.

Zero-Shot Control via Code Generation: Gemini Robotics-ER leverages its code generation capabilities and embodied reasoning to control robots using API commands, reacting and replanning as needed. The model’s enhanced embodied understanding results in a near 2x improvement in task completion compared to Gemini 2.0.
Few-Shot Control via In-Context Learning (ICL): By conditioning the model on a small number of demonstrations, Gemini Robotics-ER can quickly adapt to new behaviors.

Below is the perception and control APIs, and agentic orchestration during an episode. This system is used for zero-shot control:

Commitment to Safety

Google DeepMind prioritizes safety through a multi-layered approach, addressing concerns from low-level motor control to high-level semantic understanding. The integration of Gemini Robotics-ER with existing safety-critical controllers and the development of mechanisms to prevent unsafe actions underscore this commitment.

The release of the ASIMOV dataset and the framework for generating data-driven “Robot Constitutions” further demonstrates Google DeepMind’s dedication to advancing robotics safety research.

Intelligent robots are getting closer…

Check out the full Gemini Robotics report and Gemini Robotics. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning appeared first on MarkTechPost.

Aya Vision Unleashed: A Global AI Revolution in Multilingual Multimodal Power!

Jean-marc Mommessin — Fri, 14 Mar 2025 02:10:15 +0000

Cohere For AI has just dropped a bombshell: Aya Vision, a open-weights vision model that’s about to redefine multilingual and multimodal communication. Prepare for a seismic shift as we shatter language barriers and unlock the true potential of AI across the globe!

Smashing the Multilingual Multimodal Divide!

Let’s face it, AI has been speaking with a frustratingly limited vocabulary. But not anymore! Aya Vision explodes onto the scene, obliterating the performance gap between languages and modalities. This isn’t just an incremental improvement; it’s a quantum leap, extending multimodal magic to 23 languages, reaching over half the planet’s population. Imagine AI finally speaking your language, understanding the rich tapestry of your culture.

Aya Vision: Where Vision Meets Linguistic Brilliance!

This is not your average vision model. Aya Vision is a linguistic virtuoso, a visual maestro, and a global communicator all rolled into one. From crafting captivating image captions to answering complex visual questions, it’s a powerhouse of multimodal understanding. See above: you snap a photo of a stunning piece of art from your travels, and Aya Vision instantly unveils its history, style, and cultural significance, bridging worlds with a single image.

Performance That Will Blow Your Mind!

Multilingual Domination: Aya Vision obliterates the competition, leaving leading open-weights models in the dust when it comes to multilingual text generation and image understanding.
Parameter Prowess: The 8B model is a lean, mean, performance machine, crushing giants like Qwen2.5-VL 7B, Gemini Flash 1.5 8B, Llama-3.2 11B Vision, and Pangea 7B with jaw-dropping win rates!
32B Titan: The 32B model sets a new gold standard, outperforming even larger models like Llama-3.2 90B Vision, Molmo 72B, and Qwen2-VL 72B with breathtaking efficiency.
Efficiency Unleashed: Aya Vision proves you don’t need monstrous models to achieve monumental results, outperforming models 10x its size!
Algorithmic Alchemy: Secret ingredients like synthetic annotations, multilingual data scaling, and multimodal model merging have been masterfully combined to create this AI masterpiece.

Open Weights, Open Doors, Open World!

Cohere For AI isn’t just building groundbreaking AI; they’re democratizing it. Aya Vision’s 8B and 32B models are now freely available on Kaggle and Hugging Face.

Want to contribute?

Cohere For AI invites researchers worldwide to join the Aya initiative, apply for research grants, and collaborate in their open science community. Aya Vision is a huge step forward into the future of multilingual multimodal.

Check out Aya Vision blog post and Aya Initiative, Kaggle and Hugging Face. . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post Aya Vision Unleashed: A Global AI Revolution in Multilingual Multimodal Power! appeared first on MarkTechPost.

Limbic AI’s Generative AI–Enabled Therapy Support Tool Improves Cognitive Behavioral Therapy Outcomes

Jean-marc Mommessin — Wed, 12 Mar 2025 01:33:38 +0000

Recent advancements in generative AI are creating exciting new possibilities in healthcare, especially within mental health services, where patient engagement is often a significant challenge. A recent observational study published in the Journal of Medical Internet Research has demonstrated that Limbic AI, an innovative generative AI-enabled therapy support tool, can significantly enhance patient engagement and clinical outcomes in cognitive behavioral therapy (CBT).

Limbic Care, developed by Limbic AI, is a mobile AI powered app that provides patients with personalized, on-demand conversational support.

The app acts as an extension of the therapist-patient relationship, supporting rather than replacing human clinicians. It helps patients by offering:

On-demand support between therapy sessions
Faster recovery times
Enhanced engagement and post-care support

Results:

Patients using Limbic Care attended more therapy sessions and missed significantly fewer sessions compared to those using traditional static materials.
Dropout rates among users of Limbic Care decreased by 23%.
Clinical improvements for Limbic Care users included a 21% higher rate of reliable improvement, 25% higher recovery rate, and 21% higher reliable recovery rate.
Increased engagement with the AI-driven app directly correlated with better clinical outcomes, highlighting the importance of personalized, interactive support.

The study involved 244 patients from NHS Talking Therapies services provided by Everyturn Mental Health in the UK. Patients who actively engaged with Limbic Care attended more therapy sessions and reported significantly fewer missed appointments compared to those using traditional CBT worksheets.

A qualitative analysis with 113 additional users identified the following key benefits of using Limbic Care:

Increased emotional support and empathy.
Greater clarity and awareness about personal mental health issues.
Enhanced mindfulness and relaxation.
Practical coping strategies and CBT techniques.

Participants particularly appreciated the conversational and empathetic nature of the AI, highlighting its role in providing nonjudgmental support between sessions.

Overall, this study emphasizes the promising role generative AI can play in improving mental healthcare outcomes. Limbic Care has shown the potential to significantly enhance patient engagement, boost recovery rates, and reduce therapy dropout rates, offering meaningful clinical and economic benefits for mental health services.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Limbic AI’s Generative AI–Enabled Therapy Support Tool Improves Cognitive Behavioral Therapy Outcomes appeared first on MarkTechPost.

Inception Unveils Mercury: The First Commercial-Scale Diffusion Large Language Model

Jean-marc Mommessin — Sun, 09 Mar 2025 00:22:35 +0000

The landscape of generative AI and LLMs has experienced a remarkable leap forward with the launch of Mercury by the cutting-edge startup Inception Labs. Introducing the first-ever commercial-scale diffusion large language models (dLLMs), Inception labs promises a paradigm shift in speed, cost-efficiency, and intelligence for text and code generation tasks.

Mercury: Setting New Benchmarks in AI Speed and Efficiency

Inception’s Mercury series of diffusion large language models introduces unprecedented performance, operating at speeds previously unachievable with traditional LLM architectures. Mercury achieves remarkable throughput—over 1000 tokens per second on commodity NVIDIA H100 GPUs—a performance that was formerly exclusive to custom-designed hardware like Groq, Cerebras, and SambaNova. This translates to an astonishing 5-10x speed increase compared to current leading autoregressive models.

Diffusion Models: The Future of Text Generation

Traditional autoregressive LLMs generate text sequentially, token-by-token, causing significant latency and computational costs, especially in extensive reasoning and error-correction tasks. Diffusion models, however, leverage a unique “coarse-to-fine” generation process. Unlike autoregressive models restricted by sequential generation, diffusion models iteratively refine outputs from noisy approximations, enabling parallel token updates. This method significantly enhances reasoning, error correction, and overall coherence of the generated content.

While diffusion approaches have proven revolutionary in image, audio, and video generation—powering applications like Midjourney and Sora—their application in discrete data domains such as text and code was largely unexplored until Inception’s breakthrough.

Mercury Coder: High-Speed, High-Quality Code Generation

Inception’s flagship product, Mercury Coder, is optimized specifically for coding applications. Developers now have access to a high-quality, rapid-response model capable of generating code at more than 1000 tokens per second, a dramatic improvement over existing speed-focused models.

On standard coding benchmarks, Mercury Coder doesn’t just match but often surpasses the performance of other high-performing models such as GPT-4o Mini and Claude 3.5 Haiku. Moreover, Mercury Coder Mini secured a top-ranking position on Copilot Arena, tying for second place and outperforming established models like GPT-4o Mini and Gemini-1.5-Flash. Even more impressively, Mercury accomplishes this while maintaining approximately 4x faster speeds than GPT-4o Mini.

Versatility and Integration

Mercury dLLMs function seamlessly as drop-in replacements for traditional autoregressive LLMs. They effortlessly support use-cases including Retrieval-Augmented Generation (RAG), tool integration, and agent-based workflows. The diffusion model’s parallel refinement allows multiple tokens to be updated simultaneously, ensuring swift and accurate generation suitable for enterprise environments, API integration, and on-premise deployments.

Built by AI Innovators

Inception’s technology is underpinned by foundational research at Stanford, UCLA and Cornell from its pioneering founders, recognized for their crucial contributions to the evolution of generative AI. Their combined expertise includes the original development of image-based diffusion models and innovations such as Direct Preference Optimization, Flash Attention, and Decision Transformers—techniques widely acknowledged for their transformative impact on modern AI.

Inception’s introduction of Mercury marks a pivotal moment for enterprise AI, unlocking previously impossible performance levels, accuracy, and cost-efficiency.

Check out the Playground and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Inception Unveils Mercury: The First Commercial-Scale Diffusion Large Language Model appeared first on MarkTechPost.

Finer-CAM Revolutionizes AI Visual Explainability: Unlocking Precision in Fine-Grained Image Classification

Jean-marc Mommessin — Sun, 09 Mar 2025 00:11:18 +0000

Researchers at The Ohio State University have introduced Finer-CAM, an innovative method that significantly improves the precision and interpretability of image explanations in fine-grained classification tasks. This advanced technique addresses key limitations of existing Class Activation Map (CAM) methods by explicitly highlighting subtle yet critical differences between visually similar categories.

Current Challenge with Traditional CAM

Conventional CAM methods typically illustrate general regions influencing a neural network’s predictions but frequently fail to distinguish fine details necessary for differentiating closely related classes. This limitation poses significant challenges in fields requiring precise differentiation, such as species identification, automotive model recognition, and aircraft type differentiation.

Finer-CAM: Methodological Breakthrough

The central innovation of Finer-CAM lies in its comparative explanation strategy. Unlike traditional CAM methods that focus solely on features predictive of a single class, Finer-CAM explicitly contrasts the target class with visually similar classes. By calculating gradients based on the difference in prediction logits between the target class and its similar counterparts, it reveals unique image features, enhancing the clarity and accuracy of visual explanations.

Finer-CAM Pipeline

The methodological pipeline of Finer-CAM involves three main stages:

Feature Extraction:
- An input image first passes through neural network encoder blocks, generating intermediate feature maps.
- A subsequent linear classifier uses these feature maps to produce prediction logits, which quantify the confidence of predictions for various classes.
Gradient Calculation (Logit Difference):
- Standard CAM methods calculate gradients for a single class.
- Finer-CAM computes gradients based on the difference between the prediction logits of the target class and a visually similar class.
- This comparison identifies the subtle visual features specifically discriminative to the target class by suppressing commonly shared features.
Activation Highlighting:
- The gradients calculated from the logit difference are used to produce enhanced class activation maps that emphasize discriminative visual details crucial for distinguishing between similar categories.

Experimental Validation

B.1. Model Accuracy

Researchers evaluated Finer-CAM across two popular neural network backbones, CLIP and DINOv2. Experiments demonstrated that DINOv2 generally produces higher-quality visual embeddings, achieving superior classification accuracy compared to CLIP across all tested datasets.

B.2. Results on FishVista and Aircraft

Quantitative evaluations on the FishVista and Aircraft datasets further demonstrate Finer-CAM’s effectiveness. Compared to baseline CAM methods (Grad-CAM, Layer-CAM, Score-CAM), Finer-CAM consistently delivered improved performance metrics, notably in relative confidence drop and localization accuracy, underscoring its ability to highlight discriminative details crucial for fine-grained classification.

B.3. Results on DINOv2

Additional evaluations using DINOv2 as the backbone showed that Finer-CAM consistently outperformed baseline methods. These results indicate that Finer-CAM’s comparative method effectively enhances localization performance and interpretability. Due to DINOv2’s high accuracy, more pixels need to be masked to significantly impact predictions, resulting in larger deletion AUC values and occasionally smaller relative confidence drops compared to CLIP.

Visual and Quantitative Advantages

Highly Precise Localization: Clearly pinpoints discriminative visual features, such as specific coloration patterns in birds, detailed structural elements in cars, and subtle design variations in aircraft.
Reduction of Background Noise: Significantly reduces irrelevant background activations, increasing the relevance of explanations.
Quantitative Excellence: Outperforms traditional CAM approaches (Grad-CAM, Layer-CAM, Score-CAM) in metrics including relative confidence drop and localization accuracy.

Finer-CAM is extendable to multi-modal zero-shot learning scenarios. By intelligently comparing textual and visual features, it accurately localizes visual concepts within images, significantly expanding its applicability and interpretability.

Researchers have made Finer-CAM’s source code and colab demo available.

Check out the Paper, Github and Colab demo. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Finer-CAM Revolutionizes AI Visual Explainability: Unlocking Precision in Fine-Grained Image Classification appeared first on MarkTechPost.

CASS: Injecting Object-Level Context for Advanced Open-vocabulary semantic segmentation

Jean-marc Mommessin — Thu, 06 Mar 2025 19:28:16 +0000

This paper was just accepted at CVPR 2025. In short, CASS is as an elegant solution to Object-Level Context in open-world segmentation. They outperform several training-free approaches and even surpasses some methods that rely on extra training. The gains are especially notable in challenging setups where objects have intricate sub-parts or classes have high visual similarity. Results show that CASS consistently predicts correct labels down to the pixel level, underscoring its refined object-level awareness.

Want to know how they did it? Read below …code link is available at the end.

Distilling Spectral Graphs for Object-Level Context: A Novel Leap in Training-Free Open-Vocabulary Semantic Segmentation

Open-vocabulary semantic segmentation (OVSS) is shaking up the landscape of computer vision by allowing models to segment objects based on any user-defined prompt—without being tethered to a fixed set of categories. Imagine telling an AI to pick out every “Space Needle” in a cityscape or to detect and segment an obscure object you just coined. Traditional segmentation pipelines, typically restricted to a finite set of training classes, can’t handle such requests without extra finetuning or retraining. Enter CASS (Context-Aware Semantic Segmentation), a bold new approach that harnesses powerful large-scale, pre-trained models to achieve high-fidelity, object-aware segmentation entirely without additional training.

The Rise of Training-Free OVSS

Conventional supervised approaches for semantic segmentation require extensive labeled datasets. While they excel at known classes, they often struggle or overfit when faced with new classes not seen during training. In contrast, training-free OVSS methods—often powered by large-scale vision-language models like CLIP—are able to segment based on novel textual prompts in a zero-shot manner. This aligns naturally with the flexibility demanded by real-world applications, where it’s impractical or extremely costly to anticipate every new object that might appear. And because they are training-free, these methods require no further annotation or data collection every time the use case changes…making this a very scalable for production level solutions.

Despite these strengths, existing training-free methods face a fundamental hurdle: object-level coherence. They often nail the broad alignment between image patches and text prompts (e.g., “car” or “dog”) but fail to unify the entire object—like grouping the wheels, roof, and windows of a truck under a single coherent mask. Without an explicit way to encode object-level interactions, crucial details end up fragmented, limiting overall segmentation quality.

CASS: Injecting Object-Level Context for Coherent Segmentation

To address this shortfall, the authors from Yonsei University and UC Merced introduce CASS, a system that distills rich object-level knowledge from Vision Foundation Models (VFMs) and aligns it with CLIP’s text embeddings.

Two core insights power this approach:

Spectral Object-Level Context Distillation

While CLIP excels at matching textual prompts with global image features, it doesn’t capture fine-grained, object-centric context. On the other hand, VFMs like DINO do learn intricate patch-level relationships but lack direct text alignment.

CASS bridges these strengths by treating both CLIP and the VFM’s attention mechanisms as graphs and matching their attention heads via spectral decomposition. In other words, each attention head is examined through its eigenvalues, which reflect how patches correlate with one another. By pairing complementary heads—those that focus on distinct structure—CASS effectively transfers object-level context from the VFM into CLIP.

To avoid noise, the authors apply low-rank approximation on the VFM’s attention graph, followed by dynamic eigenvalue scaling. The result is a distilled representation that highlights core object boundaries while filtering out irrelevant details—enabling CLIP to finally “see” all parts of a truck (or any object) as one entity.

Object Presence Prior for Semantic Refinement

OVSS means the user can request any prompt, but this can lead to confusion among semantically similar categories. For example, prompts like “bus” vs. “truck” vs. “RV” might cause partial mix-ups if all are somewhat likely.

CASS tackles this by leveraging CLIP’s zero-shot classification capability. It computes an object presence prior, estimating how likely each class is to appear in the image overall. Then, it uses this prior in two ways:

Refining Text Embeddings: It clusters semantically similar prompts and identifies which labels are most likely in the image, steering the selected text embeddings closer to the actual objects.

Object-Centric Patch Similarity: Finally, CASS fuses the patch-text similarity scores with these presence probabilities to get sharper and more accurate predictions.

Taken together, these strategies offer a robust solution for true open-vocabulary segmentation. No matter how new or unusual the prompt, CASS efficiently captures both the global semantics and the subtle details that group an object’s parts.

Results are impressive, see below, Right column is CASS, you can clearly see object level segmentation..much better then CLIP

Under the Hood: Matching Attention Heads via Spectral Analysis

One of CASS’s most innovative points is how it matches CLIP and VFM attention heads. Each attention head behaves differently; some might home in on color/texture cues while others lock onto shape or position. So, the authors perform an eigenvalue decomposition on each attention map to reveal its unique “signature.”

A cost matrix is formed by comparing these signatures using the Wasserstein distance, a technique that measures the distance between distributions in a way that captures overall shape.
The matrix is fed to the Hungarian matching algorithm, which pairs heads that have contrasting structural distributions.
The VFM’s matched attention heads are low-rank approximated and scaled to emphasize object boundaries.
Finally, these refined heads are distilled into CLIP’s attention, augmenting its capacity to treat each object as a unified whole.

Qualitatively, you can think of this process as selectively injecting object-level coherence: after the fusion, CLIP now “knows” a wheel plus a chassis plus a window equals one truck.

Why Training-Free Matters

Generalization: Because CASS doesn’t need additional training or finetuning, it generalizes far better to out-of-domain images and unanticipated classes.
Immediate Deployment: Industrial or robotic systems benefit from the instant adaptability—no expensive dataset curation is needed for each new scenario.
Efficiency: With fewer moving parts and no annotation overhead, the pipeline is remarkably efficient for real-world use.

At the end of the day..for any production level solution training free is key to handle long tail use cases.

Empirical Results

CASS undergoes thorough testing on eight benchmark datasets, including PASCAL VOC, COCO, and ADE20K, which collectively cover over 150 object categories. Two standout metrics emerge:

Mean Intersection over Union (mIoU): CASS outperforms several training-free approaches and even surpasses some methods that rely on extra training. The gains are especially notable in challenging setups where objects have intricate sub-parts or classes have high visual similarity.
Pixel Accuracy (pAcc): Results show that CASS consistently predicts correct labels down to the pixel level, underscoring its refined object-level awareness.

Unlocking True Open-Vocabulary Segmentation

The release of CASS marks a leap forward for training-free OVSS. By distilling spectral information into CLIP and by fine-tuning text prompts with an object presence prior, it achieves a highly coherent segmentation that can unify an object’s scattered parts—something many previous methods struggled to do. Whether deployed in robotics, autonomous vehicles, or beyond, this ability to recognize and segment any object the user names is immensely powerful and frankly required.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post CASS: Injecting Object-Level Context for Advanced Open-vocabulary semantic segmentation appeared first on MarkTechPost.

MVGD from Toyota Research Institute: Zero Shot 3D Scene Reconstruction

Jean-marc Mommessin — Thu, 06 Mar 2025 06:03:29 +0000

Toyota Research Institute Researchers have unveiled Multi-View Geometric Diffusion (MVGD), a groundbreaking diffusion-based architecture that directly synthesizes high-fidelity novel RGB and depth maps from sparse, posed images, bypassing the need for explicit 3D representations like NeRF or 3D Gaussian splats. This innovation promises to redefine the frontier of 3D synthesis by offering a streamlined, robust, and scalable solution for generating realistic 3D content.

The core challenge MVGD addresses is achieving multi-view consistency: ensuring generated novel viewpoints seamlessly integrate in 3D space. Traditional methods rely on building complex 3D models, which often suffer from memory constraints, slow training, and limited generalization. MVGD, however, integrates implicit 3D reasoning directly into a single diffusion model, generating images and depth maps that maintain scale alignment and geometric coherence with input images without intermediate 3D model construction.

MVGD leverages the power of diffusion models, known for their high-fidelity image generation, to encode appearance and depth information simultaneously

Key innovative components include:

Pixel-Level Diffusion: Unlike latent diffusion models, MVGD operates at original image resolution using a token-based architecture, preserving fine details.
Joint Task Embeddings: A multi-task design enables the model to jointly generate RGB images and depth maps, leveraging a unified geometric and visual prior.
Scene Scale Normalization: MVGD automatically normalizes scene scale based on input camera poses, ensuring geometric coherence across diverse datasets.

Training on an unprecedented scale, with over 60 million multi-view image samples from real-world and synthetic datasets, empowers MVGD with exceptional generalization capabilities. This massive dataset enables:

Zero-Shot Generalization: MVGD demonstrates robust performance on unseen domains without explicit fine-tuning.
Robustness to Dynamics: Despite not explicitly modeling motion, MVGD effectively handles scenes with moving objects.

MVGD achieves state-of-the-art performance on benchmarks like RealEstate10K, CO3Dv2, and ScanNet, surpassing or matching existing methods in both novel view synthesis and multi-view depth estimation.

MVGD introduces incremental conditioning and scalable fine-tuning, enhancing its versatility and efficiency.

Incremental conditioning allows for refining generated novel views by feeding them back into the model.
Scalable fine-tuning enables incremental model expansion, boosting performance without extensive retraining.

The implications of MVGD are significant:

Simplified 3D Pipelines: Eliminating explicit 3D representations streamlines novel view synthesis and depth estimation.
Enhanced Realism: Joint RGB and depth generation provides lifelike, 3D-consistent novel viewpoints.
Scalability and Adaptability: MVGD handles varying numbers of input views, crucial for large-scale 3D capture.
Rapid Iteration: Incremental fine-tuning facilitates adaptation to new tasks and complexities.

MVGD represents a significant leap forward in 3D synthesis, merging diffusion elegance with robust geometric cues to deliver photorealistic imagery and scale-aware depth. This breakthrough signals the emergence of “geometry-first” diffusion models, poised to revolutionize immersive content creation, autonomous navigation, and spatial AI.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post MVGD from Toyota Research Institute: Zero Shot 3D Scene Reconstruction appeared first on MarkTechPost.

Jean-marc Mommessin, Author at MarkTechPost

NVIDIA AI Releases HOVER: A Breakthrough AI for Versatile Humanoid Control in Robotics

The Achilles Heel of Humanoid Robotics: The Control Conundrum

HOVER: The Unified Field Theory of Robotic Control

HOVER Unleashed: Performance That Redefines Robotics

HOVER: The Future of Humanoid Potential

Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions

Lowe’s Revolutionizes Retail with AI: From Personalized Shopping to Proactive Customer Assistance

Google DeepMind’s Gemini Robotics: Unleashing Embodied AI with Zero-Shot Control and Enhanced Spatial Reasoning

Aya Vision Unleashed: A Global AI Revolution in Multilingual Multimodal Power!

Limbic AI’s Generative AI–Enabled Therapy Support Tool Improves Cognitive Behavioral Therapy Outcomes

Inception Unveils Mercury: The First Commercial-Scale Diffusion Large Language Model

Mercury: Setting New Benchmarks in AI Speed and Efficiency

Diffusion Models: The Future of Text Generation

Mercury Coder: High-Speed, High-Quality Code Generation

Versatility and Integration

Built by AI Innovators

Finer-CAM Revolutionizes AI Visual Explainability: Unlocking Precision in Fine-Grained Image Classification

Current Challenge with Traditional CAM

Finer-CAM: Methodological Breakthrough

Finer-CAM Pipeline

Experimental Validation

B.1. Model Accuracy

B.2. Results on FishVista and Aircraft

B.3. Results on DINOv2

Visual and Quantitative Advantages

Extendable to multi-modal zero-shot learning scenarios

CASS: Injecting Object-Level Context for Advanced Open-vocabulary semantic segmentation

The Rise of Training-Free OVSS

CASS: Injecting Object-Level Context for Coherent Segmentation

Under the Hood: Matching Attention Heads via Spectral Analysis

Why Training-Free Matters

At the end of the day..for any production level solution training free is key to handle long tail use cases.

Empirical Results

Unlocking True Open-Vocabulary Segmentation

MVGD from Toyota Research Institute: Zero Shot 3D Scene Reconstruction