Computer Vision Category - MarkTechPost

Subject-Driven Image Evaluation Gets Simpler: Google Researchers Introduce REFVNLI to Jointly Score Textual Alignment and Subject Consistency Without Costly APIs

Sajjad Ansari — Fri, 02 May 2025 19:57:41 +0000

Text-to-image (T2I) generation has evolved to include subject-driven approaches, which enhance standard T2I models by incorporating reference images alongside text prompts. This advancement allows for more precise subject representation in generated images. Despite the promising applications, subject-driven T2I generation faces a significant challenge of lacking reliable automatic evaluation methods. Current metrics focus either on text-prompt alignment or subject consistency, when both are essential for effective subject-driven generation. While more correlative evaluation methods exist, they rely on costly API calls to models like GPT-4, limiting their practicality for extensive research applications.

Evaluation approaches for Visual Language Models (VLMs) include various frameworks, with text-to-image (T2I) assessments focusing on image quality, diversity, and text alignment. Researchers utilize embedding-based metrics like CLIP and DINO for subject-driven generation evaluation to measure subject preservation. Complex metrics such as VIEScore and DreamBench++ utilize GPT-4o to evaluate textual alignment and subject consistency, but at a higher computational cost. Subject-driven T2I methods have developed along two main paths: fine-tuning general models into specialized versions capturing specific subjects and styles, or enabling broader applicability through one-shot examples. These one-shot approaches include adapter-based and adapter-free techniques.

Researchers from Google Research and Ben Gurion University have proposed REFVNLI, a cost-efficient metric that simultaneously evaluates textual alignment and subject preservation in subject-driven T2I generation. It predicts two scores, textual alignment and subject consistency, in a single classification based on a triplet ref, prompt, image_tgt>. It is trained on an extensive dataset derived from video-reasoning benchmarks and image perturbations, outperforming or matching existing baselines across multiple benchmarks and subject categories. REFVNLI shows improvements of up to 6.4 points in textual alignment and 8.5 points in subject consistency. It is effective with lesser-known concepts, where it aligns with human preferences at over 87% accuracy.

For training REFVNLI, a large-scale dataset of triplets ref, prompt, image_tgt>, labeled with , is curated automatically. REFVNLI is evaluated on multiple human-labeled test sets for subject-driven generation, including DreamBench++, ImagenHub, and KITTEN. The evaluation spans diverse categories such as Humans, Animals, Objects, Landmarks, and multi-subject settings. The training process involves fine-tuning PaliGemma, a 3B Vision-Language Model, focusing on a variant adapted for multi-image inputs. During inference, the model takes two images and a prompt with special markups around the referenced subject, performing sequential binary classifications for textual alignment and subject preservation.

For subject consistency, REFVNLI ranks among the top two metrics across all categories and performs best in the Object category, exceeding the GPT4o-based DreamBench++ by 6.3 points. On ImagenHub, REFVNLI achieves top-two rankings for textual alignment in the Animals category and the highest score for Objects, outperforming the best non-finetuned model by 4 points. It also performs well in Multi-subject settings, ranking in the top three. REFVNLI achieves the highest textual alignment score on KITTEN, but has limitations in subject consistency due to its identity-sensitive training that penalizes even minor mismatches in identity-defining traits. Ablation studies reveal that joint training provides complementary benefits, with single-task training resulting in performance drops.

In this paper, researchers introduced REFVNLI, a reliable, cost-effective metric for subject-driven T2I generation that addresses both textual alignment and subject preservation challenges. Trained on an extensive auto-generated dataset, REFVNLI effectively balances robustness to identity-agnostic variations such as pose, lighting, and background with sensitivity to identity-specific traits, including facial features, object shape, and unique details. Future research directions include enhancing REFVNLI’s evaluation capabilities across artistic styles, handling textual modifications that explicitly alter identity-defining attributes, and improving the processing of multiple reference images for single and distinct subjects.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Subject-Driven Image Evaluation Gets Simpler: Google Researchers Introduce REFVNLI to Jointly Score Textual Alignment and Subject Consistency Without Costly APIs appeared first on MarkTechPost.

UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs

Sana Hassan — Tue, 29 Apr 2025 20:28:32 +0000

The CLIP framework has become foundational in multimodal representation learning, particularly for tasks such as image-text retrieval. However, it faces several limitations: a strict 77-token cap on text input, a dual-encoder design that separates image and text processing, and a limited compositional understanding that resembles bag-of-words models. These issues hinder its effectiveness in capturing nuanced, instruction-sensitive semantics. Although MLLMs like LLaVA, Qwen2-VL, and CogVLM offer significant advances in vision-language reasoning, their autoregressive next-token prediction objective restricts their ability to learn generalized, transferable embeddings. This has sparked growing interest in developing alternative methods that can combine the strengths of both contrastive learning and LLM-based reasoning.

Recent approaches aim to overcome these limitations by employing novel architectures and training strategies. For instance, E5-V proposes unimodal contrastive training for aligning cross-modal features, while VLM2Vec introduces the MMEB benchmark to convert advanced vision-language models into effective embedding generators. Models like LLM2Vec and NV-Embed enhance text-based representation learning by modifying the attention mechanisms in decoder-only LLMs. Despite these innovations, challenges such as handling long sequences, enabling better cross-modal fusion, and effectively distinguishing hard negatives in contrastive learning remain. As multimodal applications expand, there is a pressing need for representation learning methods that are both scalable and capable of fine-grained semantic alignment.

Researchers from institutions including The University of Sydney, DeepGlint, Tongyi Lab at Alibaba, and Imperial College London introduce UniME, a two-stage framework designed to improve multimodal representation learning using MLLMs. The first stage applies textual discriminative knowledge distillation from a strong LLM teacher to enhance the language encoder. The second stage employs hard negative enhanced instruction tuning, which involves filtering false negatives and sampling multiple challenging negatives per instance to improve the model’s discriminative and instruction-following abilities. Evaluations on the MMEB benchmark and various retrieval tasks show that UniME delivers consistent and significant improvements in both performance and compositional understanding.

The UniME framework introduces a two-stage method for learning universal multimodal embeddings using MLLMs. First, it employs textual discriminative knowledge distillation, where a student MLLM is trained using text-only prompts and supervised by a teacher model to enhance embedding quality. Then, a second stage—hard negative enhanced instruction tuning—improves cross-modal alignment and task performance by filtering false negatives and sampling hard negatives. This stage also leverages task-specific prompts to enhance instruction-following for various applications, such as retrieval and visual question answering. Together, these stages significantly boost UniME’s performance on both in- and out-of-distribution tasks.

The study evaluated UniME on Phi3.5-V and LLaVA-1.6 using PyTorch with DeepSpeed for efficient training across 8 NVIDIA A100 GPUs. Training consisted of two stages: a textual knowledge distillation phase using the NLI dataset (273,000 pairs) and a hard negative instruction tuning phase on 662,000 multimodal pairs. NV-Embed V2 served as the teacher model. UniME was evaluated on 36 MMEB benchmark datasets, achieving consistent improvements over baselines such as E5-V and VLM2Vec. Hard negatives significantly improved the model’s ability to distinguish subtle differences, thereby enhancing its performance, particularly in long-caption and compositional retrieval tasks. Ablation studies confirmed the effectiveness of both training stages and tuning parameters.

In conclusion, UniME is a two-stage framework designed to improve multimodal representation learning using MLLMs. In the first stage, UniME distills textual discriminative knowledge from a large language model to strengthen the language embeddings of the MLLM. In the second stage, it enhances learning through instruction tuning with multiple hard negatives per batch, reducing false negative interference and encouraging the model to distinguish challenging examples. Extensive evaluation on MMEB and various retrieval tasks demonstrates that UniME consistently boosts performance, offering strong discriminative and compositional abilities across tasks, thereby surpassing the limitations of prior models, such as CLIP.

Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs appeared first on MarkTechPost.

ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets

Sana Hassan — Mon, 28 Apr 2025 20:24:36 +0000

Video captioning models are typically trained on datasets consisting of short videos, usually under three minutes in length, paired with corresponding captions. While this enables them to describe basic actions like walking or talking, these models struggle with the complexity of long-form videos, such as vlogs, sports events, and movies that can last over an hour. When applied to such videos, they often generate fragmented descriptions focused on isolated actions rather than capturing the broader storyline. Efforts like MA-LMM and LaViLa have extended video captioning to 10-minute clips using LLMs, but hour-long videos remain a challenge due to a shortage of suitable datasets. Although Ego4D introduced a large dataset of hour-long videos, its first-person perspective limits its broader applicability. Video ReCap addressed this gap by training on hour-long videos with multi-granularity annotations, yet this approach is expensive and prone to annotation inconsistencies. In contrast, annotated short-form video datasets are widely available and more user-friendly.

Advancements in visual-language models have significantly enhanced the integration of vision and language tasks, with early works such as CLIP and ALIGN laying the foundation. Subsequent models, such as LLaVA and MiniGPT-4, extended these capabilities to images, while others adapted them for video understanding by focusing on temporal sequence modeling and constructing more robust datasets. Despite these developments, the scarcity of large, annotated long-form video datasets remains a significant hindrance to progress. Traditional short-form video tasks, like video question answering, captioning, and grounding, primarily require spatial or temporal understanding, whereas summarizing hour-long videos demands identifying key frames amidst substantial redundancy. While some models, such as LongVA and LLaVA-Video, can perform VQA on long videos, they struggle with summarization tasks due to data limitations.

Researchers from Queen Mary University and Spotify introduce ViSMaP, an unsupervised method for summarising hour-long videos without requiring costly annotations. Traditional models perform well on short, pre-segmented videos but struggle with longer content where important events are scattered. ViSMaP bridges this gap by using LLMs and a meta-prompting strategy to iteratively generate and refine pseudo-summaries from clip descriptions created by short-form video models. The process involves three LLMs working in sequence for generation, evaluation, and prompt optimisation. ViSMaP achieves performance comparable to fully supervised models across multiple datasets while maintaining domain adaptability and eliminating the need for extensive manual labelling.

The study addresses cross-domain video summarization by training on a labelled short-form video dataset and adapting to unlabelled, hour-long videos from a different domain. Initially, a model is trained to summarize 3-minute videos using TimeSFormer features, a visual-language alignment module, and a text decoder, optimized by cross-entropy and contrastive losses. To handle longer videos, they are segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting approach with multiple LLMs (generator, evaluator, optimizer) refines summaries. Finally, the model is fine-tuned on these pseudo-summaries using a symmetric cross-entropy loss to manage noisy labels and improve adaptation.

The study evaluates VisMaP across three scenarios: summarization of long videos using Ego4D-HCap, cross-domain generalization on MSRVTT, MSVD, and YouCook2 datasets, and adaptation to short videos using EgoSchema. VisMaP, trained on hour-long videos, is compared against supervised and zero-shot methods, such as Video ReCap and LaViLa+GPT3.5, demonstrating competitive or superior performance without supervision. Evaluations use CIDEr, ROUGE-L, METEOR scores, and QA accuracy. Ablation studies highlight the benefits of meta-prompting and component modules, such as contrastive learning and SCE loss. Implementation details include the use of TimeSformer, DistilBERT, and GPT-2, with training conducted on an NVIDIA A100 GPU.

In conclusion, ViSMaP is an unsupervised approach for summarizing long videos by utilizing annotated short-video datasets and a meta-prompting strategy. It first creates high-quality summaries through meta-prompting and then trains a summarization model, reducing the need for extensive annotations. Experimental results demonstrate that ViSMaP performs on par with fully supervised methods and adapts effectively across various video datasets. However, its reliance on pseudo labels from a source-domain model may impact performance under significant domain shifts. Additionally, ViSMaP currently relies solely on visual information. Future work could integrate multimodal data, introduce hierarchical summarization, and develop more generalizable meta-prompting techniques.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets appeared first on MarkTechPost.

Meta AI Introduces Token-Shuffle: A Simple AI Approach to Reducing Image Tokens in Transformers

Asif Razzaq — Sat, 26 Apr 2025 04:38:25 +0000

Autoregressive (AR) models have made significant advances in language generation and are increasingly explored for image synthesis. However, scaling AR models to high-resolution images remains a persistent challenge. Unlike text, where relatively few tokens are required, high-resolution images necessitate thousands of tokens, leading to quadratic growth in computational cost. As a result, most AR-based multimodal models are constrained to low or medium resolutions, limiting their utility for detailed image generation. While diffusion models have shown strong performance at high resolutions, they come with their own limitations, including complex sampling procedures and slower inference. Addressing the token efficiency bottleneck in AR models remains an important open problem for enabling scalable and practical high-resolution image synthesis.

Meta AI Introduces Token-Shuffle

Meta AI introduces Token-Shuffle, a method designed to reduce the number of image tokens processed by Transformers without altering the fundamental next-token prediction reach. The key insight underpinning Token-Shuffle is the recognition of dimensional redundancy in visual vocabularies used by multimodal large language models (MLLMs). Visual tokens, typically derived from vector quantization (VQ) models, occupy high-dimensional spaces but carry a lower intrinsic information density compared to text tokens. Token-Shuffle exploits this by merging spatially local visual tokens along the channel dimension before Transformer processing and subsequently restoring the original spatial structure after inference. This token fusion mechanism allows AR models to handle higher resolutions with significantly reduced computational cost while maintaining visual fidelity.

Technical Details and Benefits

Token-Shuffle consists of two operations: token-shuffle and token-unshuffle. During input preparation, spatially neighboring tokens are merged using an MLP to form a compressed token that preserves essential local information. For a shuffle window size sss, the number of tokens is reduced by a factor of s2s^2s2, leading to a substantial reduction in Transformer FLOPs. After the Transformer layers, the token-unshuffle operation reconstructs the original spatial arrangement, again assisted by lightweight MLPs.

By compressing token sequences during Transformer computation, Token-Shuffle enables the efficient generation of high-resolution images, including those at 2048×2048 resolution. Importantly, this approach does not require modifications to the Transformer architecture itself, nor does it introduce auxiliary loss functions or pretraining of additional encoders.

Furthermore, the method integrates a classifier-free guidance (CFG) scheduler specifically adapted for autoregressive generation. Rather than applying a fixed guidance scale across all tokens, the scheduler progressively adjusts guidance strength, minimizing early token artifacts and improving text-image alignment.

Results and Empirical Insights

Token-Shuffle was evaluated on two major benchmarks: GenAI-Bench and GenEval. On GenAI-Bench, using a 2.7B parameter LLaMA-based model, Token-Shuffle achieved a VQAScore of 0.77 on “hard” prompts, outperforming other autoregressive models such as LlamaGen by a margin of +0.18 and diffusion models like LDM by +0.15. In the GenEval benchmark, it attained an overall score of 0.62, setting a new baseline for AR models operating in the discrete token regime.

Large-scale human evaluation further supported these findings. Compared to LlamaGen, Lumina-mGPT, and diffusion baselines, Token-Shuffle showed improved alignment with textual prompts, reduced visual flaws, and higher subjective image quality in most cases. However, minor degradation in logical consistency was observed relative to diffusion models, suggesting avenues for further refinement.

In terms of visual quality, Token-Shuffle demonstrated the capability to produce detailed and coherent 1024×1024 and 2048×2048 images. Ablation studies revealed that smaller shuffle window sizes (e.g., 2×2) offered the best trade-off between computational efficiency and output quality. Larger window sizes provided additional speedups but introduced minor losses in fine-grained detail.

Conclusion

Token-Shuffle presents a straightforward and effective method to address the scalability limitations of autoregressive image generation. By leveraging the inherent redundancy in visual vocabularies, it achieves substantial reductions in computational cost while preserving, and in some cases improving, generation quality. The method remains fully compatible with existing next-token prediction frameworks, making it easy to integrate into standard AR-based multimodal systems.

The results demonstrate that Token-Shuffle can push AR models beyond prior resolution limits, making high-fidelity, high-resolution generation more practical and accessible. As research continues to advance scalable multimodal generation, Token-Shuffle provides a promising foundation for efficient, unified models capable of handling text and image modalities at large scales.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta AI Introduces Token-Shuffle: A Simple AI Approach to Reducing Image Tokens in Transformers appeared first on MarkTechPost.

Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 with Hybrid Reinforcement Learning

Sana Hassan — Fri, 25 Apr 2025 21:25:59 +0000

Recent advancements in multimodal AI have highlighted a persistent challenge: achieving strong specialized reasoning capabilities while preserving generalization across diverse tasks. “Slow-thinking” models such as OpenAI-o1 and Gemini-Thinking have made strides in deliberate analytical reasoning but often exhibit compromised performance on general visual understanding tasks, with increased tendencies toward visual hallucinations. As the field progresses toward building general-purpose AI systems, reconciling this tradeoff remains a critical research problem.

Skywork AI Introduces Skywork R1V2

Skywork AI has released Skywork R1V2, a next-generation multimodal reasoning model designed to address the reasoning-generalization tradeoff systematically. Building upon the foundation of Skywork R1V, R1V2 introduces a hybrid reinforcement learning framework, combining reward-model guidance with structured rule-based signals. The model bypasses the conventional reliance on teacher-student distillation by learning directly from multimodal interactions, offering an open and reproducible advancement through its release on Hugging Face.

Technical Approach and Innovations

Skywork R1V2 incorporates Group Relative Policy Optimization (GRPO) alongside a Selective Sample Buffer (SSB) to enhance training stability and efficiency. GRPO enables relative evaluation among candidate responses within the same query group, but convergence issues can diminish effective learning signals. The SSB mechanism addresses this by maintaining a cache of informative samples, ensuring continuous access to high-value gradients.

Additionally, the model adopts a Mixed Preference Optimization (MPO) strategy, integrating reward-model-based preferences with rule-based constraints. This hybrid optimization allows Skywork R1V2 to strengthen step-by-step reasoning quality while maintaining consistency in general perception tasks. A modular training approach, utilizing lightweight adapters between a frozen Intern ViT-6B vision encoder and a pretrained language model, preserves the language model’s reasoning capabilities while optimizing cross-modal alignment efficiently.

Empirical Results and Analysis

Skywork R1V2 demonstrates robust performance across a range of reasoning and multimodal benchmarks. On text reasoning tasks, the model achieves 78.9% on AIME2024, 63.6% on LiveCodeBench, 73.2% on LiveBench, 82.9% on IFEVAL, and 66.3% on BFCL. These results represent significant improvements over Skywork R1V1 and are competitive with substantially larger models, such as Deepseek R1 (671B parameters).

In multimodal evaluation, R1V2 achieves 73.6% on MMMU, 74.0% on MathVista, 62.6% on OlympiadBench, 49.0% on MathVision, and 52.0% on MMMU-Pro. The model consistently outperforms open-source baselines of comparable or larger size, including Qwen2.5-VL-72B and QvQ-Preview-72B, particularly excelling in tasks that require structured problem-solving across visual and textual inputs.

When compared against proprietary models, R1V2 demonstrates narrowing performance gaps. It surpasses Claude 3.5 Sonnet and Gemini 2 Flash on critical multimodal benchmarks such as MMMU and MathVista. Importantly, hallucination rates were substantially reduced to 8.7% through calibrated reinforcement strategies, maintaining factual integrity alongside complex reasoning.

Qualitative assessments further illustrate R1V2’s systematic problem-solving approach, with the model demonstrating methodical decomposition and verification behaviors in complex scientific and mathematical tasks, reinforcing its alignment with reflective cognitive patterns.

Conclusion

Skywork R1V2 advances the state of multimodal reasoning through a carefully designed hybrid reinforcement learning framework. By addressing the vanishing advantages problem with the Selective Sample Buffer and balancing optimization signals through Mixed Preference Optimization, the model achieves notable improvements in both specialized reasoning tasks and general multimodal understanding.

With benchmark-leading performances such as 62.6% on OlympiadBench and 73.6% on MMMU, Skywork R1V2 establishes a strong open-source baseline. Its design principles and training methodology offer a pragmatic approach toward developing robust, efficient multimodal AI systems. Future directions for Skywork AI include enhancing general visual understanding capabilities while preserving the sophisticated reasoning foundations laid by R1V2.

Check out the Paper and Model on HuggingFace. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 with Hybrid Reinforcement Learning appeared first on MarkTechPost.

Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models

Sana Hassan — Fri, 25 Apr 2025 06:23:31 +0000

Integrating long-context capabilities with visual understanding significantly enhances the potential of VLMs, particularly in domains such as robotics, autonomous driving, and healthcare. Expanding the context size enables VLMs to process extended video and text sequences, thereby enhancing temporal resolution and performance in complex tasks, such as video comprehension. However, one major limitation is the quadratic complexity of attention mechanisms during the pre-fill phase, which results in high latency before autoregressive decoding begins. This delay, known as Time-to-First-Token, makes real-world deployment of long-context VLMs challenging. Various sparse attention methods, such as Sparse Transformer, Swin Transformer, and StreamingLLM, overlook the specific sparse patterns found in VLMs with mixed modalities, thereby limiting their efficiency and effectiveness.

Unlike text-only inputs, visual and video data in VLMs demonstrate unique spatiotemporal attention structures, forming grid-like patterns due to local correlations. In mixed-modality scenarios, clear boundaries exist between different modalities, leading to distinct attention behaviors that general sparse methods fail to capture. Recent advancements, such as MInference and dynamic sparse attention approaches, aim to improve inference efficiency by adapting attention patterns online. Yet, these techniques often fall short in handling the intricacies of mixed-modality inputs. While vision token compression and RNN-Transformer hybrids have been explored to reduce computational load, most of these methods focus on long-video and short-text pairings, neglecting the more complex dynamics of multiturn, mixed-modality interactions, which are increasingly important in practical applications.

Researchers from the University of Surrey and Microsoft have introduced MMInference, a dynamic, sparse attention method designed to accelerate the pre-filling stage of long-context VLMs. By identifying grid-like sparsity patterns in video inputs and distinct modality boundaries, MMInference applies permutation-based strategies to optimize attention computation. It dynamically constructs sparse distributions for each input and utilizes custom GPU kernels for enhanced efficiency, all without requiring modifications to existing models. Tested on benchmarks like Video QA, Captioning, and Vision-NIAH, MMInference achieved up to 8.3× speedup at 1M tokens, outperforming previous methods while maintaining high accuracy across multiple state-of-the-art VLMs.

MMInference is a framework designed to speed up the pre-filling phase of long-context vision-language models by leveraging modality-aware sparse attention. It integrates three key components: (1) intra-modality sparse patterns like Grid, A-shape, and Vertical-Slash attention; (2) cross-modality patterns such as Q-Boundary and 2D-Boundary; and (3) a modality-aware sparse attention search algorithm. Instead of dense computation, it uses dynamic sparse attention with optimized GPU kernels and efficient tensor handling. The framework dynamically identifies attention patterns and permutes tensors based on modality, enabling efficient handling of multi-modal inputs and reducing computational overhead while maintaining strong performance.

The study evaluates MMInference’s performance and efficiency on long-video tasks, including captioning, question answering, and retrieval in both unimodal and mixed-modality settings. Experiments were conducted using state-of-the-art models, such as Llava-Video and LongVILA, with comparisons against several sparse attention baselines. Results show that MMInference achieves near full-attention performance while being more computationally efficient. It performs particularly well in the newly introduced Mixed-Modality Needle in a Haystack (MM-NIAH) task by leveraging inter-modality sparse patterns. Additionally, MMInference demonstrates significant speedups in end-to-end latency and maintains robustness across varying context lengths and input types.

In conclusion, MMInference is a modality-aware sparse attention technique designed to accelerate long-context VLMs without compromising accuracy. It employs a permutation-based grid attention pattern tailored for the spatial-temporal locality of video inputs, along with specialized handling for mixed-modality boundaries. A search algorithm identifies optimal sparse patterns per attention head, dynamically adapting to the input. The method integrates directly into current VLM pipelines without requiring model changes or fine-tuning. With optimized GPU kernels, MMInference achieves up to 8.3× acceleration during the pre-filling stage at 1M tokens across various tasks, including video QA, captioning, and mixed-modality benchmarks, while retaining full-attention performance.

Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Microsoft Research Introduces MMInference to Accelerate Pre-filling for Long-Context Vision-Language Models appeared first on MarkTechPost.

NVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video Captioning

Asif Razzaq — Wed, 23 Apr 2025 16:51:26 +0000

Challenges in Localized Captioning for Vision-Language Models

Describing specific regions within images or videos remains a persistent challenge in vision-language modeling. While general-purpose vision-language models (VLMs) perform well at generating global captions, they often fall short in producing detailed, region-specific descriptions. These limitations are amplified in video data, where models must account for temporal dynamics. Primary obstacles include a loss of fine-grained detail during visual feature extraction, insufficient annotated datasets tailored for regional description, and evaluation benchmarks that penalize accurate outputs due to incomplete reference captions.

Describe Anything 3B—A Model Tailored for Localized Descriptions

This AI work from NVIDIA presents Describe Anything 3B (DAM-3B), a multimodal large language model purpose-built for detailed, localized captioning across images and videos. Accompanied by DAM-3B-Video, the system accepts inputs specifying regions via points, bounding boxes, scribbles, or masks and generates contextually grounded, descriptive text. It is compatible with both static imagery and dynamic video inputs, and the models are publicly available via Hugging Face.

Core Architectural Components and Model Design

DAM-3B incorporates two principal innovations: a focal prompt and a localized vision backbone enhanced with gated cross-attention. The focal prompt fuses a full image with a high-resolution crop of the target region, retaining both regional detail and broader context. This dual-view input is processed by the localized vision backbone, which embeds the image and mask inputs and applies cross-attention to blend global and focal features before passing them to a large language model. These mechanisms are integrated without inflating token length, preserving computational efficiency.

DAM-3B-Video extends this architecture to temporal sequences by encoding frame-wise region masks and integrating them across time. This allows region-specific descriptions to be generated for videos, even in the presence of occlusion or motion.

Training Data Strategy and Evaluation Benchmarks

To overcome data scarcity, NVIDIA develops the DLC-SDP pipeline—a semi-supervised data generation strategy. This two-stage process utilizes segmentation datasets and unlabeled web-scale images to curate a training corpus of 1.5 million localized examples. Region descriptions are refined using a self-training approach, producing high-quality captions.

For evaluation, the team introduces DLC-Bench, which assesses description quality based on attribute-level correctness rather than rigid comparisons with reference captions. DAM-3B achieves leading performance across seven benchmarks, surpassing baselines like GPT-4o and VideoRefer. It demonstrates strong results in keyword-level (LVIS, PACO), phrase-level (Flickr30k Entities), and multi-sentence localized captioning (Ref-L4, HC-STVG). On DLC-Bench, DAM-3B achieves an average accuracy of 67.3%, outperforming other models in both detail and precision.

Conclusion

Describe Anything 3B addresses longstanding limitations in region-specific captioning by combining a context-aware architecture with a scalable, high-quality data pipeline. The model’s ability to describe localized content in both images and videos has broad applicability across domains such as accessibility tools, robotics, and video content analysis. With this release, NVIDIA provides a robust and reproducible benchmark for future research and sets a refined technical direction for the next generation of multimodal AI systems.

Check out the Paper, Model on Hugging Face and Project Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post NVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video Captioning appeared first on MarkTechPost.

Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing

Sana Hassan — Tue, 22 Apr 2025 22:56:44 +0000

Diffusion Transformers have demonstrated outstanding performance in image generation tasks, surpassing traditional models, including GANs and autoregressive architectures. They operate by gradually adding noise to images during a forward diffusion process and then learning to reverse this process through denoising, which helps the model approximate the underlying data distribution. Unlike the commonly used UNet-based diffusion models, Diffusion Transformers apply the transformer architecture, which has proven effective after sufficient training. However, their training process is slow and computationally intensive. A key limitation lies in their architecture: during each denoising step, the model must balance encoding low-frequency semantic information while simultaneously decoding high-frequency details using the same modules—this creates an optimization conflict between the two tasks.

To address the slow training and performance bottlenecks, recent work has focused on improving the efficiency of Diffusion Transformers through various strategies. These include utilizing optimized attention mechanisms, such as linear and sparse attention, to reduce computational costs, and introducing more effective sampling techniques, including log-normal resampling and loss reweighting, to stabilize the learning process. Additionally, methods like REPA, RCG, and DoD incorporate domain-specific inductive biases, while masked modeling enforces structured feature learning, boosting the model’s reasoning capabilities. Models like DiT, SiT, SD3, Lumina, and PixArt have extended the diffusion transformer framework to advanced areas such as text-to-image and text-to-video generation.

Researchers from Nanjing University and ByteDance Seed Vision introduce the Decoupled Diffusion Transformer (DDT), which separates the model into a dedicated condition encoder for semantic extraction and a velocity decoder for detailed generation. This decoupled design enables faster convergence and improved sample quality. On the ImageNet 256×256 and 512×512 benchmarks, their DDT-XL/2 model achieves state-of-the-art FID scores of 1.31 and 1.28, respectively, with up to 4× faster training. To further accelerate inference, they propose a statistical dynamic programming method that optimally shares encoder outputs across denoising steps with minimal impact on performance.

The DDT introduces a condition encoder and a velocity decoder to handle low- and high-frequency components in image generation separately. The encoder extracts semantic features (zt) from noisy inputs, timesteps, and class labels, which are then used by the decoder to estimate the velocity field. To ensure consistency of zt across steps, representation alignment and decoder supervision are applied. During inference, a shared self-condition mechanism reduces computation by reusing zt at certain timesteps. A dynamic programming approach identifies the optimal timesteps for recomputing zt, minimizing performance loss while accelerating the sampling process.

The researchers trained their models on 256×256 ImageNet using a batch size of 256 without gradient clipping or warm-up. Using VAE-ft-EMA and Euler sampling, they evaluated performance using FID, sFID, IS, Precision, and Recall. They built improved baselines with SwiGLU, RoPE, RMSNorm, and lognorm sampling. Their DDT models consistently outperformed prior baselines, particularly in larger sizes, and converged significantly faster than REPA. Further gains were achieved through encoder sharing strategies and careful tuning of the encoder-decoder ratio, resulting in state-of-the-art FID scores on both 256×256 and 512×512 ImageNet.

In conclusion, the study presents the DDT, which addresses the optimization challenge in traditional diffusion transformers by separating semantic encoding and high-frequency decoding into distinct modules. By scaling encoder capacity relative to the decoder, DDT achieves notable performance gains, especially in larger models. The DDT-XL/2 model sets new benchmarks on ImageNet, achieving faster training convergence and lower FID scores for both 256×256 and 512×512 resolutions. Additionally, the decoupled design enables encoder sharing across denoising steps, significantly improving inference efficiency. A dynamic programming strategy further enhances this by determining optimal sharing points, maintaining image quality while reducing computational load.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing appeared first on MarkTechPost.

Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B Parameters

Asif Razzaq — Tue, 22 Apr 2025 06:36:37 +0000

In recent years, vision-language models (VLMs) have advanced significantly in bridging image, video, and textual modalities. Yet, a persistent limitation remains: the inability to effectively process long-context multimodal data such as high-resolution imagery or extended video sequences. Many existing VLMs are optimized for short-context scenarios and struggle with performance degradation, inefficient memory usage, or loss of semantic detail when scaled to handle longer inputs. Addressing these limitations requires not only architectural flexibility but also dedicated strategies for data sampling, training, and evaluation.

Eagle 2.5: A Generalist Framework for Long-Context Learning

NVIDIA introduces Eagle 2.5, a family of vision-language models designed for long-context multimodal learning. Unlike models that simply accommodate more input tokens, Eagle 2.5 demonstrates measurable and consistent performance improvements as input length increases. The system is developed with a focus on both video and image understanding at scale, targeting tasks where the richness of long-form content is critical.

Eagle 2.5 operates with a relatively compact 8B parameter count and yet achieves strong results across established benchmarks. On Video-MME (with 512-frame input), the model scores 72.4%, approaching or matching results from significantly larger models such as Qwen2.5-VL-72B and InternVL2.5-78B. Notably, these gains are achieved without relying on task-specific compression modules, reflecting the model’s generalist design philosophy.

Training Strategy: Context-Aware Optimization

The effectiveness of Eagle 2.5 stems from two complementary training strategies: information-first sampling and progressive post-training.

Information-First Sampling prioritizes retention of critical visual and semantic content. It introduces Image Area Preservation (IAP), a tiling scheme that maintains over 60% of the original image area while minimizing aspect ratio distortion. Additionally, Automatic Degradation Sampling (ADS) dynamically balances visual and textual inputs based on context length constraints, preserving full textual sequences and adaptively optimizing visual granularity.
Progressive Post-Training incrementally increases the model’s context window—moving through 32K, 64K, and 128K token stages. This gradual exposure allows the model to develop consistent capabilities across input lengths. The method avoids overfitting to any single context range and helps maintain stable performance in diverse inference scenarios.

These approaches are underpinned by an architecture based on SigLIP for vision encoding and MLP projection layers for alignment with the language model backbone. The system forgoes domain-specific compression components to retain flexibility across varied task types.

Eagle-Video-110K: Structured Data for Extended Video Comprehension

A key component of Eagle 2.5 is its training data pipeline, which integrates both open-source resources and a custom-curated dataset: Eagle-Video-110K. This dataset is constructed to support long-form video understanding and adopts a dual annotation scheme:

A top-down approach introduces story-level segmentation using human-annotated chapter metadata and GPT-4-generated dense captions and question-answer pairs.
A bottom-up method generates QA pairs for short clips using GPT-4o, augmented with time and textual context anchors to capture spatiotemporal detail.

The dataset collection emphasizes diversity over redundancy. A cosine similarity-based selection process filters novel content from sources such as InternVid, Shot2Story, and VidChapters. This results in a corpus with both narrative coherence and granular annotations, enabling models to capture hierarchical information across time.

Performance and Benchmarking

Eagle 2.5-8B exhibits robust performance across multiple video and image understanding tasks. On video benchmarks, it scores 74.8 on MVBench, 77.6 on MLVU, and 66.4 on LongVideoBench. On image benchmarks, the model attains 94.1 on DocVQA, 87.5 on ChartQA, and 80.4 on InfoVQA, among others.

Ablation studies confirm the importance of Eagle’s sampling strategies. Removal of IAP leads to performance degradation in high-resolution benchmarks, while omitting ADS reduces effectiveness in tasks requiring dense supervision. The model also benefits from progressive training: sequentially increasing context lengths provides more stable gains compared to one-shot long-context training. Importantly, the addition of Eagle-Video-110K notably enhances performance at higher frame counts (≥128 frames), underscoring the value of dedicated long-form datasets.

Conclusion

Eagle 2.5 presents a technically grounded approach to long-context vision-language modeling. Its emphasis on preserving contextual integrity, gradual training adaptation, and dataset diversity enables it to achieve strong performance while maintaining architectural generality. Without relying on model scaling alone, Eagle 2.5 demonstrates that careful training strategies and data design can yield competitive, efficient systems for complex multimodal understanding tasks. This positions Eagle 2.5 as a valuable step forward in building more context-aware AI systems suited for real-world multimedia applications.

Check out the Paper, GitHub Page and Project Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Long-Context Multimodal Understanding No Longer Requires Massive Models: NVIDIA AI Introduces Eagle 2.5, a Generalist Vision-Language Model that Matches GPT-4o on Video Tasks Using Just 8B Parameters appeared first on MarkTechPost.

Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and Sampling

Sana Hassan — Mon, 21 Apr 2025 16:46:43 +0000

Video generation, a branch of computer vision and machine learning, focuses on creating sequences of images that simulate motion and visual realism over time. It requires models to maintain coherence across frames, capture temporal dynamics, and generate new visuals conditioned on prior frames or inputs. This domain has seen rapid advances, especially with the integration of DL techniques such as diffusion models and transformers. These models have empowered systems to produce increasingly longer and higher-quality video sequences. However, generating coherent frames across extended sequences remains computationally intensive and prone to degradation in quality due to issues like memory limitations and accumulated prediction errors.

A major challenge in video generation is maintaining visual consistency while minimizing computational overhead. As frames are generated sequentially, any error in earlier frames tends to propagate, leading to noticeable visual drift in longer sequences. Simultaneously, models struggle to retain memory of initial frames, causing inconsistencies in motion and structure, often referred to as the forgetting problem. Efforts to address one issue tend to worsen the other. Increasing memory depth enhances temporal coherence but also accelerates the spread of errors. Reducing dependence on prior frames helps curb error accumulation but increases the likelihood of inconsistency. Balancing these conflicting requirements is a fundamental obstacle in next-frame prediction tasks.

Various techniques have emerged to mitigate forgetting and drifting. Noise scheduling and augmentation methods adjust the input conditions to modulate the influence of past frames, as seen in frameworks like DiffusionForcing and RollingDiffusion. Anchor-based planning methods and guidance using history frames have also been tested. Also, a range of architectures aim to improve efficiency, linear and sparse attention mechanisms, low-bit computations, and distillation approaches help reduce resource demands. Long video generation frameworks like Phenaki, NUWA-XL, and StreamingT2V introduce structural changes or novel generation paradigms to extend temporal coherence. Despite these innovations, the field still lacks a unified and computationally efficient approach that can reliably balance memory and error control.

Researchers at Stanford University introduced a new architecture called FramePack to address these interlinked challenges. This structure hierarchically compresses input frames based on their temporal importance, ensuring that recent frames receive higher fidelity representation while older ones are progressively downsampled. By doing so, the method maintains a fixed transformer context length regardless of the video’s duration. This effectively removes the context length bottleneck and allows for efficient scaling without exponential growth in computation. In parallel, FramePack incorporates anti-drifting sampling techniques that utilize bi-directional context by generating anchor frames first, particularly the beginning and end of a sequence, before interpolating the in-between content. Another variant even reverses the generation order, starting from the last known high-quality frame and working backward. This inverted sampling proves particularly effective in scenarios such as image-to-video generation, where a static image is used to generate a full motion sequence.

The FramePack design is built around a prioritized compression system that limits the transformer’s total context length. In standard video diffusion models like Hunyuan or Wan, each 480p frame generates approximately 1560 tokens of context. When predicting the next frame using a Diffusion Transformer (DiT), the total context length increases linearly with the number of input and output frames. For example, with 100 input frames and one predicted frame, the context length could exceed 157,000 tokens, which becomes computationally impractical.

FramePack addresses this by applying a progressive compression schedule based on frame importance. More recent frames are considered more relevant and are allocated higher resolution, while older frames are increasingly downsampled. The compression follows a geometric progression controlled by a parameter, typically set to 2, which reduces the context length for each earlier frame by half. For instance, the most recent frame may use full resolution, the next one half, then a quarter, and so on. This design ensures that the total context length stays within a fixed limit, no matter how many frames are input.

Compression is implemented using 3D patchifying kernels, such as (2, 4, 4), (4, 8, 8), and (8, 16, 16), which control how frames are broken into smaller patches before processing. These kernels are trained with independent parameters to stabilize learning. For cases where the input sequence is extremely long, low-importance tail frames are either dropped, minimally included, or globally pooled to avoid unnecessary overhead. This allows FramePack to manage videos of arbitrary length efficiently while maintaining high model performance.

Performance metrics confirm the practical value of FramePack. When integrated into pretrained diffusion models like HunyuanVideo and Wan, FramePack reduced the memory usage per step while enabling larger batch sizes, up to the scale commonly used in image diffusion training. The anti-drifting techniques substantially improved visual quality. By reducing the diffusion scheduler’s aggressiveness and balancing the shift timesteps, the models showed fewer artifacts and greater frame-to-frame coherence. The inverted sampling approach, particularly, resulted in better approximation of known frames, enabling high-fidelity generation when a target image is known. These improvements occurred without additional training from scratch, demonstrating the adaptability of the FramePack module as a plug-in enhancement to existing architectures.

This research thoroughly examines and addresses the core difficulties of next-frame video generation. The researchers developed FramePack, an approach that applies progressive input compression and modified sampling strategies to ensure scalable, high-quality video generation. Through fixed context lengths, adaptive patchifying, and innovative sampling order, FramePack succeeds in preserving both memory and visual clarity over long sequences. Its modular integration into pretrained models highlights its practical utility and future potential across varied video generation applications.

Several Key Takeaways from the Research on Framepack include:

FramePack ensures a fixed transformer context length, allowing models to scale to longer video sequences without increased computational cost.
Uses a geometric progression (λ = 2) to compress earlier frames, significantly reducing the context length even for large numbers of input frames.
Applies 3D patchify kernels like (2, 4, 4), (4, 8, 8), and (8, 16, 16), each trained with independent parameters to ensure stable learning.
Anti-drifting sampling methods leverage bi-directional context and early endpoint generation, improving overall video quality.
Inverted temporal sampling excels in image-to-video generation tasks by anchoring on high-quality user input frames.
Enables image-diffusion scale batch sizes in training, leading to efficient learning and higher throughput.
Integrates with existing models like HunyuanVideo and Wan without requiring full retraining.
Provides multiple tail-handling strategies (e.g., global pooling, minimal inclusion), showing negligible impact on visual fidelity.

Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Stanford Researchers Propose FramePack: A Compression-based AI Framework to Tackle Drifting and Forgetting in Long-Sequence Video Generation Using Efficient Context Management and Sampling appeared first on MarkTechPost.