Divyesh Vitthal Jawkhede, Author at MarkTechPost

STORM (Spatiotemporal TOken Reduction for Multimodal LLMs): A Novel AI Architecture Incorporating a Dedicated Temporal Encoder between the Image Encoder and the LLM

Divyesh Vitthal Jawkhede — Tue, 11 Mar 2025 07:24:04 +0000

Understanding videos with AI requires handling sequences of images efficiently. A major challenge in current video-based AI models is their inability to process videos as a continuous flow, missing important motion details and disrupting continuity. This lack of temporal modeling prevents tracing changes; therefore, events and interactions are partially unknown. Long videos also make the process difficult, with high computational expenses and requiring techniques like frame skipping, which loses valuable information and reduces accuracy. Overlap among data within frames also does not compress well, resulting in redundancy and wastage of resources.

Currently, video-language models treat videos as static frame sequences with image encoders and vision-language projectors, which is challenging to represent motion and continuity. Language models have to infer temporal relations independently, resulting in partial comprehension. Subsampling of frames reduces the computational load at the expense of removing useful details, affecting accuracy. Token reduction methods like recursive KV cache compression and frame selection add complexity without yielding much improvement. Though advanced video encoders and pooling methods assist, they remain inefficient and not scalable, rendering long-video processing computationally intensive.

To address these challenges, researchers from NVIDIA, Rutgers University, UC Berkeley, MIT, Nanjing University, and KAIST proposed STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a Mamba-based temporal projector architecture for efficient processing of long videos. Unlike traditional methods, where temporal relations are inferred separately on each video frame, and language models are utilized for inferring the temporal relations, STORM adds temporal information at the video tokens level to eliminate computation redundancy and enhance efficiency. The model improves video representations with a bidirectional spatiotemporal scanning mechanism while mitigating the burden of temporal reasoning from the LLM.

The framework uses Mamba layers to enhance temporal modeling, incorporating a bidirectional scanning module to capture dependencies across spatial and temporal dimensions. The temporal encoder processes the image and video inputs differently, acting as a spatial scanner for images to integrate global spatial context and as a spatiotemporal scanner for videos to capture temporal dynamics. During training, token compression techniques improved computational efficiency while maintaining essential information, allowing inference on a single GPU. Training-free token subsampling at test time reduced computational burdens further while retaining important temporal details. This technique facilitated efficient processing of long videos without requiring specialized equipment or deep adaptations.

Experiments were conducted to evaluate the STORM model for video understanding. Training was performed using pre-trained SigLIP models, with a temporal projector introduced through random initialization. The process involved two stages: an alignment stage, where the image encoder and LLM were frozen while only the temporal projector was trained using image-text pairs, and a supervised fine-tuning stage (SFT) with a diverse dataset of 12.5 million samples, including text, image-text, and video-text data. Token compression methods, including temporal and spatial pooling, decreased computational burden. The last model was evaluated on long-video benchmarks such as EgoSchema, MVBench, MLVU, LongVideoBench, and VideoMME, with the performance being compared with other video LLMs.

Upon evaluation, STORM outperformed existing models, achieving state-of-the-art results on benchmarks. The Mamba module improved efficiency by compressing visual tokens while retaining key information, reducing inference time by up to 65.5%. Temporal pooling worked best on long videos, optimizing performance with few tokens. STORM also performed greatly better than the baseline VILA model, particularly in tasks that involved understanding the global context. Results verified the significance of Mamba for optimized token compression, with performance boosts rising along with the video length from 8 to 128 frames.

In summary, the proposed STORM model improved long-video understanding using a Mamba-based temporal encoder and efficient token reduction. It enabled strong compression without losing key temporal information, recording state-of-the-art performance on long-video benchmarks while keeping computation low. The method can act as a baseline for future research, facilitating innovation in token compression, multimodal alignment, and real-world deployment to improve video-language model accuracy and efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post STORM (Spatiotemporal TOken Reduction for Multimodal LLMs): A Novel AI Architecture Incorporating a Dedicated Temporal Encoder between the Image Encoder and the LLM appeared first on MarkTechPost.

Revolutionizing Code Generation: µCODE’s Single-Step Approach to Multi-Turn Feedback

Divyesh Vitthal Jawkhede — Mon, 10 Mar 2025 19:49:47 +0000

Generating code with execution feedback is difficult because errors often require multiple corrections, and fixing them in a structured way is not simple. Training models to learn from execution feedback is necessary but approaches face challenges. Some methods attempt to correct errors in a single step but fail when multiple refinements are needed. Others use complex learning techniques to optimize long-term improvements. Still, these methods struggle with weak learning signals, making training slow and inefficient—the lack of an effective method for handling iterative corrections results in unstable learning and poor performance.

Currently, prompting-based systems try to solve multi-step tasks using self-debugging, test generation, and reflection but improve only slightly. Some methods train reward models like CodeRL for fixing errors and ARCHER for structured decision-making, while others use Monte Carlo Tree Search (MCTS) but require too much computation. Verifier-based approaches, like “Let’s Verify Step by Step” and AlphaCode, help find mistakes or create test cases, but some models rely only on syntax checks, which are not enough for proper training. Score limits training steps, and RISE uses complex corrections, making learning inefficient. Fine-tuned agents like FireAct, LEAP and feedback-based models like RL4VLM and GLAM try to improve performance. However, current techniques either fail to refine code properly over multiple steps or are too unstable and inefficient.

To mitigate these issues, researchers proposed µCODE, a multi-turn code generation method that improves using execution feedback. Existing approaches face challenges with execution errors and reinforcement learning complexity, but µCODE overcomes these by following an expert iteration framework with a local search expert. A verifier assesses code quality, while a generator learns from the best solutions, refining its output over multiple iterations. During inference, a Best-of-N search strategy helps generate and improve code based on execution results, ensuring better performance.

The framework first trains a verifier through supervised learning to score code snippets, making evaluations more reliable. Binary Cross-Entropy predicts correctness, while Bradley-Terry ranks solutions for better selection. The generator then learns iteratively by relabeling past outputs with expert-selected solutions, improving accuracy. Multiple solutions are produced at inference, and the verifier selects the best, refining outputs until all tests pass. By treating code generation as an imitation learning problem, µCODE eliminates complex exploration and enables efficient optimization.

Researchers evaluated µCODE’s effectiveness by comparing it with state-of-the-art methods, analyzing the impact of the learned verifier during training and inference, and assessing different loss functions for verifier training. The generator was initialized using Llama models, and experiments were conducted on MBPP and HumanEval datasets. The training was performed on MBPP’s training set, with evaluations on its test set and HumanEval. Comparisons included single-turn and multi-turn baselines such as STaR and Multi–STaR, where fine-tuning was based on correctly generated solutions. Performance was measured using Best-of-N (BoN) accuracy, with the verifier ranking candidate solutions at each turn.

Results indicated that multi-turn approaches performed better than single-turn methods, highlighting the benefits of execution feedback. µCODE outperformed Multi-STaR, achieving a 1.9% improvement on HumanEval with a 1B model. Bon search further enhanced performance, with µCODE showing a 12.8% gain over greedy decoding. The learned verifier (LV) improved training outcomes, surpassing oracle verifiers (OV) alone. Further analysis showed that the learned verifier helped select better solutions during inference, particularly in the absence of public tests. Inference-time scaling revealed diminishing performance gains beyond a certain number of candidate solutions. A hierarchical verification strategy (PT+LV) integrating public test results with learned verifier scores provided the highest performance, showing the effectiveness of the verifier in eliminating erroneous solutions and making iterative predictions.

In conclusion, the proposed µCODE framework provides a scalable approach to multi-turn code generation using single-step rewards and a learned verifier for iterative improvement. Results indicate µCODE performs better than oracle-based approaches, producing more precise code. Though constrained by model size, dataset size, and Python focus, it can be a solid baseline for future work. Expanding training data, scaling to larger models, and applying it to multiple programming languages can further enhance its effectiveness.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Revolutionizing Code Generation: µCODE’s Single-Step Approach to Multi-Turn Feedback appeared first on MarkTechPost.

Researchers from AMLab and CuspAI Introduced Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems

Divyesh Vitthal Jawkhede — Fri, 07 Mar 2025 20:13:41 +0000

Deep learning faces difficulties when applied to large physical systems on irregular grids, especially when interactions occur over long distances or at multiple scales. Handling these complexities becomes harder as the number of nodes increases. Several techniques have difficulty tackling these big problems, resulting in high computational costs and inefficiency. Some major issues are capturing long-range effects, handling multi-scale dependencies, and efficient computation with minimal resource usage. These issues make it difficult to apply deep learning models effectively to fields like molecular simulations, weather prediction, and particle mechanics, where large datasets and complex interactions are common.

Currently, Deep learning methods struggle with scaling attention mechanisms for large physical systems. Traditional self-attention computes interactions between all points, leading to extremely high computational costs. Some methods apply attention to small patches, like SwinTransformer for images, but irregular data needs extra steps to structure it. Techniques like PointTransformer use space-filling curves, but this can break spatial relationships. Hierarchical methods, such as H-transformer and OctFormer, group data at different levels but rely on costly operations. Cluster attention methods reduce complexity by aggregating points, but this process loses fine details and struggles with multi-scale interactions.

To address these problems, researchers from AMLab, University of Amsterdam and CuspAI introduced Erwin, a hierarchical transformer that enhances data processing efficiency through ball tree partitioning. The attention mechanism enables parallel computation across clusters through ball tree partitions that partition data hierarchically to structure its computations. This approach minimizes computational complexity without sacrificing accuracy, bridging the gap between the efficiency of tree-based methods and the generality of attention mechanisms. Erwin uses self-attention in localized regions with positional encoding and distance-based attention bias to capture geometric structures. Cross-ball connections facilitate communication among various sections, with tree coarsening and refinement mechanisms balancing global and local interactions. Scalability and expressivity with minimal computational expense are guaranteed through this organized process.

Researchers conducted experiments to evaluate Erwin. It outperformed equivariant and non-equivariant baselines in cosmological simulations, capturing long-range interactions and improving with larger training datasets. For molecular dynamics, it accelerated simulations by 1.7–2.5 times without compromising accuracy, surpassing MPNN and PointNet++ in runtime while maintaining competitive test loss. Erwin outperformed MeshGraphNet, GAT, DilResNet, and EAGLE in turbulent fluid dynamics, excelling in pressure prediction while being three times faster and using eight times less memory than EAGLE. Larger ball sizes in cosmology enhanced performance by retaining long-range dependencies but increased the computational runtime, and applying MPNN at the embedding step improved the local interactions in molecular dynamics.

The hierarchical transformer design proposed here effectively handles large-scale physical systems with ball tree partitioning and obtains state-of-the-art cosmology and molecular dynamics results. Although its optimized structure compromises between expressivity and runtime, it has computational overhead from padding and high memory requirements. Future work can investigate learnable pooling and other geometric encoding strategies to enhance efficiency. Erwin’s performance and scalability in all domains make it a reference point for developments in modeling large particle systems, computational chemistry, and molecular dynamics.

The post Researchers from AMLab and CuspAI Introduced Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems appeared first on MarkTechPost.

Beyond Monte Carlo Tree Search: Unleashing Implicit Chess Strategies with Discrete Diffusion

Divyesh Vitthal Jawkhede — Wed, 05 Mar 2025 06:15:26 +0000

Large language models (LLMs) generate text step by step, which limits their ability to plan for tasks requiring multiple reasoning steps, such as structured writing or problem-solving. This lack of long-term planning affects their coherence and decision-making in complex scenarios. Some approaches evaluate various alternatives before making a choice, which improves prediction precision. However, they have higher computational costs and are prone to errors if future forecasts were incorrect.

Apparent search algorithms like Monte Carlo Tree Search (MCTS) and beam search are well-liked in AI planning and decision-making but lack inherent limitations. They use repeated simulations of the future, with rising computation costs and rendering them unsuitable for real-time systems. They also depend on a value model to estimate every state, which, if incorrect, propagates the error along the search. Since longer predictions create more errors, these errors build up and decrease decision accuracy. This is particularly problematic in complicated tasks necessitating long-term planning, where it becomes challenging to maintain accurate foresight, resulting in inferior outcomes.

To mitigate these issues, researchers from The University of Hong Kong, Shanghai Jiaotong University, Huawei Noah’s Ark Lab, and Shanghai AI Laboratory proposed DIFFUSEARCH. This discrete diffusion-based framework eliminates explicit search algorithms like MCTS. Instead of relying on costly search processes, DIFFUSEARCH trains the policy to directly predict and utilize future representations, refining predictions iteratively using diffusion models. Integrating the world model and policy into a single framework reduces computational overhead while improving efficiency and accuracy in long-term planning.

The framework trains the model using supervised learning, leveraging Stockfish as an oracle to label board states from chess games. Different future representations are examined, with the action-state (s-asa) method selected for simplicity and efficiency. Rather than directly predicting future sequences, the model utilizes discrete diffusion modeling, applying self-attention and iterative denoising to improve action predictions gradually. DIFFUSEARCH avoids costly marginalization over future states during inference by directly sampling from the trained model. An easy-first decoding strategy prioritizes more predictable tokens for denoising, enhancing accuracy.

Researchers evaluated DIFFUSEARCH against three transformer-based baselines: State-Action (S-A), State-Value (S-V), and Action-Value (SA-V) models trained using behavioral cloning, value-based decision-making, and legal action comparison, respectively. Using a dataset of 100k chess games, with states encoded in FEN format and actions in UCI notation, they implemented GPT-2-based models with an Adam optimizer, a 3e-4 learning rate, a batch size of 1024, an 8-layer architecture (7M parameters), a horizon of 4, and diffusion timesteps set to 20. Evaluations included action accuracy, puzzle accuracy, and Elo ratings from a 6000-game internal tournament. DIFFUSEARCH outperformed S-A by 653 Elo and 19% in action accuracy and exceeded SA-V despite using 20 times fewer data records. Discrete diffusion with linear λt achieved the highest accuracy (41.31%), surpassing autoregressive and Gaussian methods. DIFFUSEARCH retained predictive ability in future moves, though accuracy declined over steps, and performance improved with more attention layers and refined decoding. Positioned as an implicit search method, it demonstrated competitiveness with explicit MCTS-based approaches.

In summary, the proposed model established that implicit search via discrete diffusion could effectively replace explicit search and improve chess decision-making. The model surpassed searchless and explicit policies and showed its potential to learn future-imitative strategies. Although using an external oracle and a limited data set, the model indicated future possibilities for improvement through self-play and long-context modeling. More generally, this method can be applied to improve next-token prediction in language models. As a starting point for further investigation, it forms a basis for investigating implicit search in AI planning and decision-making.

Check out the Paper, and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Beyond Monte Carlo Tree Search: Unleashing Implicit Chess Strategies with Discrete Diffusion appeared first on MarkTechPost.

Accelerating AI: How Distilled Reasoners Scale Inference Compute for Faster, Smarter LLMs

Divyesh Vitthal Jawkhede — Tue, 04 Mar 2025 01:30:55 +0000

Improving how large language models (LLMs) handle complex reasoning tasks while keeping computational costs low is a challenge. Generating multiple reasoning steps and selecting the best answer increases accuracy, but this process demands a lot of memory and computing power. Dealing with long reasoning chains or huge batches is computationally expensive and slows down models, rendering them inefficient under bounded computational resources. Other models of varying architectures have faster information processing and less memory, but their performance capability in reasoning tasks is unknown. Understanding whether these models can match or exceed existing ones under limited resources is important for making LLMs more efficient.

Currently, methods to improve reasoning in large language models rely on generating multiple reasoning steps and selecting the best answer using techniques like majority voting and trained reward models. The methods improve accuracy levels, although they need large computation systems, which makes them ill-suited for massive data processing. The processing power requirements and the memory needs of Transformer models slow down inference operations. Recurrent models and linear attention methods work faster in processing but lack effectiveness in reasoning operations. Knowledge distillation helps transfer knowledge from large to smaller models, but whether strong reasoning abilities transfer across different model types is unclear.

To mitigate these issues, researchers from University of Geneva, Together AI, Cornell University, EPFL, Carnegie Mellon University, Cartesia.ai, META and Princeton University proposed a distillation method to create subquadratic models with strong reasoning skills, improving efficiency while preserving reasoning capabilities. The distilled models outperformed their Transformer teachers on MATH and GSM8K tasks, achieving similar accuracy with 2.5× lower inference time. This demonstrated that reasoning and mathematical skills could transfer across architectures while reducing computational costs.

The framework included two model types: pure Mamba models (Llamba) and hybrid models (MambaInLlama). Llamba used the MOHAWK distillation method, aligning matrices, matching hidden states, and transferring weights while training on an 8B-token dataset. MambaInLlama retained Transformer attention layers but replaced others with Mamba layers, using reverse KL divergence for distillation. Experiments demonstrated dataset choice had a large effect on performance, with certain datasets lowering Llamba-1B accuracy by 10% and showing a poor correlation between general benchmarks and mathematical reasoning, emphasizing the importance of improved training data.

Researchers evaluated distilled models for generating multiple chains of thought (CoTs) in math problem-solving, focusing on instruction-following retention. They measured coverage using pass@k, estimated the probability of finding a correct solution among k samples, and assessed accuracy through majority voting and Best-of-N selection with a Llama-3.1 8B-based reward model. Benchmarks showed distilled models performed up to 4.2× faster than Llama models while maintaining comparable coverage, generating more completions within fixed compute budgets, and outperforming smaller transformer baselines in speed and accuracy. Furthermore, supervised fine-tuning (SFT) after distillation enhanced performance, validating their effectiveness in structured reasoning tasks such as coding and formal proofs.

In summary, the proposed Distilled Mamba models enhanced reasoning efficiency by retaining accuracy while cutting inference time and memory consumption. When computational budgets were fixed, the models outperformed Transformers; hence, they are suitable for scalable inference. This method can serve as a basis for future research in training good reasoning models, improving distillation methods, and building reward models. Inference scaling advancements would further enhance their application in AI systems that demand faster and more effective reasoning.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Accelerating AI: How Distilled Reasoners Scale Inference Compute for Faster, Smarter LLMs appeared first on MarkTechPost.

Unveiling Hidden PII Risks: How Dynamic Language Model Training Triggers Privacy Ripple Effects

Divyesh Vitthal Jawkhede — Mon, 03 Mar 2025 05:19:26 +0000

Handling personally identifiable information (PII) in large language models (LLMs) is especially difficult for privacy. Such models are trained on enormous datasets with sensitive data, resulting in memorization risks and accidental disclosure. Managing PII is complex because datasets are constantly updated with new information, and some users may request data removal. In fields like healthcare, eliminating PII is not always feasible. Fine-tuning models for specific tasks further increases the risk of retaining sensitive data. Even after training, there can be residual information, which needs specialized techniques for deletion, and privacy protection is a never-ending challenge.

Currently, methods to reduce PII memorization rely on filtering sensitive data and machine unlearning, where models retrain without certain information. These approaches face major issues, especially in constantly changing datasets. Fine-tuning increases the risk of memorization, and unlearning may unintentionally expose data instead of removing it completely. Membership inference attacks, which attempt to determine if specific data was used in training, remain a serious concern. Even when models forget certain data over time, they retain hidden patterns that can be extracted. Existing techniques lack a full understanding of how memorization happens during training, making privacy risks harder to control.

To address these challenges, researchers from Northeastern University, Google DeepMind, and the University of Washington proposed “assisted memorization,” analyzing how personal data is retained in LLMs over time. Unlike existing methods focused solely on whether memorization occurs, this approach examines when and why it happens. Researchers categorized different types of PII memorization-immediate, retained, forgotten, and assisted—to understand these risks better. Results indicated that PII is not necessarily memorized instantly but may become extractable later, especially when overlapping new training data with earlier information. This undermines current data deletion strategies that ignore long-term memorization implications.

The framework exhaustively tracked PII memorization throughout continuous training through experiments on diverse models and datasets. It analyzed the impact of different training approaches on memorization and extraction risks, demonstrating that adding new data could raise the likelihood of PII extraction. Efforts to reduce memorization for one individual sometimes inadvertently heightened risks for others. Researchers evaluated fine-tuning, retraining, and unlearning techniques using GPT-2-XL, Llama 3 8B, and Gemma 2B models trained on modified WikiText-2 and Pile of Law datasets containing unique emails. Extraction tests assessed memorization, revealing that assisted memorization occurred in 35.7% of cases, indicating it was influenced by training dynamics rather than inevitable.

Further experiments examined how increasing PII in fine-tuning datasets affected extraction risks by training ten models on datasets with varying PII percentages. Results confirmed that higher PII content led to greater extraction risks, with a superlinear increase in extraction under top-k sampling. Additionally, iterative unlearning introduced the “Onion Effect,” where removing extracted PII caused previously unmemorized PII to become extractable. This confirmed that the effect results from systematic exposure of borderline-memorized information rather than random variation. The findings highlight the challenges of adding and removing PII, showing the complexities of privacy protection in language models.

In conclusion, the proposed method highlighted privacy risks in large language models, showing how fine-tuning, retraining, and unlearning can unintentionally expose personally identifiable information (PII). Assisted memorization was identified, where PII that was not initially extracted could later become accessible. Increased PII in training data raised risks of extraction, and removal of specific PII at times unveiled other information. These findings lay a basis for improving privacy-preserving techniques and unlearning methods, providing stronger protection for data in machine learning models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Unveiling Hidden PII Risks: How Dynamic Language Model Training Triggers Privacy Ripple Effects appeared first on MarkTechPost.

Convergence AI Releases WebGames: A Comprehensive Benchmark Suite Designed to Evaluate General-Purpose Web-Browsing AI Agents

Divyesh Vitthal Jawkhede — Fri, 28 Feb 2025 06:11:40 +0000

AI agents are becoming more advanced and capable of handling complex tasks across different platforms. Websites and desktop applications are intended for human use, which demands knowledge of visual arrangements, interactive components, and time-based behavior. Handling such systems requires monitoring user actions, from clicks to sophisticated drag-and-drop actions. Such challenges are difficult for AI to handle and cannot compete with human capability regarding web tasks. A broader evaluation system is necessary to measure and improve AI agents for web browsing.

Existing benchmarks evaluate AI performance in specific web tasks like online shopping and flight booking but fail to capture the complexity of modern web interactions. Models such as GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL struggle with navigation and task execution. Initially based on reinforcement learning, traditional evaluation frameworks expanded to web tasks but remained limited to short-context scenarios, leading to quick saturation and incomplete assessments. Modern web interaction requires advanced skills like tool usage, planning, and environmental reasoning, which are not fully tested. While multi-agent interactions are gaining attention, current methods do not effectively evaluate collaboration and competition between AI systems.

To address the limitations of current AI benchmarks in web interaction, researchers from Convergence Labs Ltd. and Clusterfudge Ltd. proposed WebGames, a framework designed to evaluate web-browsing AI agents through over 50 interactive challenges. These challenges include basic browser usage, complex input management, mental thinking, workflow automation, and interactive amusement. Compared to the prior benchmarks, WebGames intends to measure correctly by separating interaction skills and providing tested AI with control. Its client-side design prevents dependencies on external resources, providing uniform and reproducible tests.

WebGames is modular in design. It specifies problems in a standardized JSONL format for effortless integration with automated test frameworks and extension with additional tasks. All problems follow a deterministic verification structure that ensures task verifiability when it is done. The structure examines AI performance in a systematic way through web interactions, quantifying navigation, decision-making, and adaptability ability in dynamic environments.

Researchers evaluated leading vision-language foundation models, including GPT-4o, Claude Computer-Use (Sonnet 3.5), Gemini-1.5-Pro, Qwen2-VL, and a Proxy assistant, using WebGames to assess their web interaction capabilities. Since most models were not designed for web interactions, they required scaffolding through a Chromium browser using Playwright. Except for Claude, the models lacked sufficient GUI grounding to determine exact pixel locations, so a Set-of-Marks (SoMs) approach was used to highlight relevant elements. The models operated within a partially observed Markov decision process (POMDP), receiving JPEG screenshots and text-based SoM elements while executing tool-based actions through a ReAct-style prompting method. The evaluation showed that Claude scored lower than GPT-4 despite having more precise web control, likely due to Anthropic’s training restrictions preventing actions resembling human behavior. Human participants from Prolific completed tasks easily, averaging 80 minutes and earning £18, with some achieving 100% scores. The findings revealed a wide capability gap between human and AI abilities, much like the ARC challenge, with some activities such as “Slider Symphony” demanding exacting drag-and-drop capabilities that proved difficult for models to accomplish, revealing limitations in AI abilities to interact on real-world websites.

In summary, the proposed benchmark found a significant gap in human vs. AI performance for web interaction tasks. The best-performing AI model, GPT-4o, only achieved 41.2% success, whereas humans achieved 95.7%. The results revealed that current AI systems struggle with intuitive web interaction, and constraints on models like Claude Computer-Use still impede the task’s success. This approach can be used as a reference point for further research, with improvements in AI flexibility, reasoning, and web interaction efficiency being directed.

The post Convergence AI Releases WebGames: A Comprehensive Benchmark Suite Designed to Evaluate General-Purpose Web-Browsing AI Agents appeared first on MarkTechPost.

Simplifying Self-Supervised Vision: How Coding Rate Regularization Transforms DINO & DINOv2

Divyesh Vitthal Jawkhede — Thu, 27 Feb 2025 19:30:21 +0000

Learning useful features from large amounts of unlabeled images is important, and models like DINO and DINOv2 are designed for this. These models work well for tasks like image classification and segmentation, but their training process is difficult. A key challenge is avoiding representation collapse, where the model produces the same output for different images. Many settings must be carefully adjusted to prevent this, making training unstable and hard to manage. DINOv2 tries to solve this by directly using negative samples, but the training setup remains complex. Because of this, improving these models or using them in new areas is difficult, even though their learned features are very effective.

Currently, methods for learning image features rely on complex and unstable training setups. Techniques like SimCLR, SimSiam, VICReg, MoCo, and BYOL attempt to discover useful representations but face various challenges. SimCLR and MoCo require large batch sizes and explicit negative samples, making them computationally expensive. SimSiam and BYOL try to avoid collapse by modifying the gradient structure, which requires careful tuning. VICReg penalizes feature alignment and covariance but does not address feature variance effectively. Techniques like I-JEPA and C-JEPA focus on patch-based learning but add more complexity. These methods struggle to preserve simplicity, stability, and efficiency, complicating training and limiting flexibility.

To solve DINO’s complexities, researchers from UC Berkeley, TranscEngram, Microsoft Research and HKU proposed SimDINO and SimDINOv2. These models simplify training by incorporating a coding rate regularization term into the loss function, which prevents representation collapse and removes the need for heavy post-processing and hyperparameter tuning. By preventing unnecessary design choices, SimDINO improves training stability and efficiency. SimDINOv2 enhances performance by handling small and large regions of an image without applying high-dimensional transformations and eliminating the teacher-student paradigm, rendering the method more robust and efficient than existing methods.

This framework maximizes learning by directly controlling feature representations to be useful throughout training without intricate adaptations. The coding rate term gives the model structured and informative features, leading to better generalization and downstream task performance. This simplifies the training pipeline and removes the teacher-student paradigm. SimDINO reduces computational overhead while maintaining high-quality results, making it a more efficient alternative for self-supervised learning in vision tasks.

Researchers evaluated SimDINO and SimDINOv2 against DINO and DINOv2 on ImageNet–1K, COCO val2017, ADE20K, and DAVIS–2017 using ViT architectures with a patch size 16. SimDINO achieved higher k-NN and linear accuracy while maintaining stable training, unlike DINO, which showed performance drops. SimDINO outperformed DINO on COCO val2017 using MaskCut in object detection and segmentation. For semantic segmentation on ADE20K, SimDINOv2 improved DINOv2 by 4.4 mIoU on ViT-B. On DAVIS-2017, SimDINO variants performed better, though DINOv2 and SimDINOv2 underperformed their predecessors due to evaluation sensitivity. Stability tests showed that DINO was more sensitive to hyperparameters and dataset variations, diverging on ViT-L, while SimDINO remained robust, significantly outperforming DINO when trained on COCO train 2017.

In conclusion, the proposed SimDINO and SimDINOv2 models simplified the complex design choices of DINO and DINOv2 by introducing a coding-rate-related regularization term, making training pipelines more stable and robust while improving performance on downstream tasks. These models enhanced Pareto over their ancestors by eliminating unnecessary complexities, showing the advantages of directly dealing with trade-offs in vision self-supervised learning. The efficient framework establishes a foundation to analyze the geometric structure of self-supervised learning losses and model optimization without self-distillation. These ideas can also be applied to other self-supervised learning models to make training more stable and efficient, which makes SimDINO a strong starting point for developing better deep-learning models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Simplifying Self-Supervised Vision: How Coding Rate Regularization Transforms DINO & DINOv2 appeared first on MarkTechPost.

SongGen: A Fully Open-Source Single-Stage Auto-Regressive Transformer Designed for Controllable Song Generation

Divyesh Vitthal Jawkhede — Thu, 27 Feb 2025 06:35:54 +0000

Creating songs from text is difficult because it involves generating vocals and instrumental music together. Songs are unique as they combine lyrics and melodies to express emotions, making the process more complex than generating speech or instrumental music alone. The challenge is intensified by the insufficient availability of quality open-source data, which restrains research and development in the area. Some approaches incorporate several steps, with vocals generated in the first place and the accompaniment generated separately. Such a method hinders the process of training and prediction and lessens the control of the final song. A major challenge is whether a single-step model can simplify this process while maintaining quality and flexibility.

Currently, text-to-music generation models use descriptive text to create music, but most methods struggle to generate realistic vocals. Transformer-based models process audio as discrete tokens and diffusion models produce high-quality instrumental music, but both approaches face issues with vocal generation. Song generation, which combines vocals with instrumental music, relies on multi-stage methods like Jukebox, Melodist, and MelodyLM. These methods produce vocals and accompaniment independently, so the process is complicated and hard to manage. Without a common strategy, flexibility is restricted, and inefficiencies in training and inference are enhanced.

To generate a song from text descriptions, lyrics, and optional reference voice, researchers proposed SongGen, an auto-regressive transformer decoder with an integrated neural audio codec. The model predicts audio token sequences, which are synthesized into songs. SongGen supports two generation modes: Mixed Mode and Dual-Track Mode. In Mixed Mode, X-Codec encodes raw audio into discrete tokens, with training loss emphasizing earlier codebooks to improve vocal clarity. A variant, Mixed Pro, introduces an auxiliary loss for vocals to enhance their quality. Dual-Track Mode separately generates vocals and accompaniment, synchronizing them through Parallel or Interleaving patterns. Parallel mode aligns tokens frame-by-frame, while Interleaving mode enhances interaction between vocals and accompaniment across layers.

For conditioning, lyrics are processed using a VoiceBPE tokenizer, voice features are extracted via a frozen MERT encoder, and text attributes are encoded using FLAN-T5. These embeddings guide song generation via cross-attention. Due to the lack of public text-to-song datasets, an automated pipeline processes 8,000 hours of audio from multiple sources, ensuring quality data through filtering strategies.

Researchers evaluated SongGen with Stable Audio Open, MusicGen, Parler-tts, and Suno for text-to-song generation. MusicGen produced only instrumental music, while Stable Audio Open generated unclear vocal sounds, and fine-tuning Parler-tts for singing proved ineffective. Despite using only 2,000 hours of labeled data, SongGen outperformed these models in text relevance and vocal control. Among its modes, the “Mixed Pro” approach enhanced vocal quality (VQ) and phoneme error rate (PER), while the “Interleaving (A-V)” dual-track method excelled in vocal quality but had slightly lower harmony (HAM). Attention analysis revealed that SongGen effectively captured musical structures. The model maintained coherence with minor performance drops even without a reference voice. Ablation studies confirmed that high-quality fine-tuning (HQFT), curriculum learning (CL), and VoiceBPE-based lyric tokenization improved stability and accuracy.

In conclusion, the proposed model simplified text-to-song generation by introducing a single-stage, auto-regressive transformer that supported mixed and dual-track modes, demonstrating strong performance. Its open-source feature made it more accessible so that beginners and experts could produce music with precision control over voice and instrument components. However, the model’s capability to mimic voices is ethically problematic, calling for protection from abuse. As a foundational work in controllable text-to-song generation, SongGen can serve as a baseline for future research, guiding improvements in audio quality, lyric alignment, and expressive singing synthesis while addressing ethical and legal challenges.

Check out the Technical Details and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post SongGen: A Fully Open-Source Single-Stage Auto-Regressive Transformer Designed for Controllable Song Generation appeared first on MarkTechPost.

Optimizing Imitation Learning: How X‑IL is Shaping the Future of Robotics

Divyesh Vitthal Jawkhede — Wed, 26 Feb 2025 06:26:51 +0000

Designing imitation learning (IL) policies involves many choices, such as selecting features, architecture, and policy representation. The field is advancing quickly, introducing many new techniques and increasing complexity, making it difficult to explore all possible designs and understand their impact. IL enables agents to learn through demonstrations rather than reward-based approaches. The increasing number of machine-learning breakthroughs in various domains makes their assessment and integration into IL challenging. The space of IL design is underexplored, making creating effective and robust IL policies challenging.

Currently, imitation learning is based on state-based and image-based methods, but both have limitations in practical use. State-based methods are inaccurate; image-based methods cannot represent 3D structures and have vague goal representation. Natural language has been added to enhance flexibility, but it is hard to incorporate it properly. Sequence models like RNNs suffer from vanishing gradients, making training inefficient, while Transformers offer better scalability. However, SSMs demonstrate higher efficiency but remain underutilized. Existing IL libraries do not support modern techniques like diffusion models, and tools such as CleanDiffuser are restricted to simple tasks, limiting overall progress in imitation learning.

To mitigate these issues, researchers from Karlsruhe Institute of Technology, Meta and University of Liverpool proposed X-IL, an open-source framework for imitation learning that allows flexible experimentation with modern techniques. Unlike existing methods that struggle with integrating novel architectures, X-IL systematically divides the IL process into four key modules: observation representations, backbones, architectures, and policy representations. This module-based architecture facilitates effortless component swapping, with the possibility to test alternative learning strategies. Unlike conventional IL frameworks that are entirely based on state-based or image-based strategies, X-IL can incorporate multi-modal learning, using RGB images, point clouds, and language for more comprehensive representation learning. It also integrates advanced sequence modeling techniques like Mamba and xLSTM, which improve efficiency over Transformers and RNNs.

The framework consists of interchangeable modules that allow customization at every stage of the IL pipeline. The observation module supports multiple input modalities, while the backbone module provides different sequence modeling approaches. Architectures consist of both decoder-only and encoder-decoder models with policy design flexibility. X-IL also optimizes policy learning by adopting diffusion-based and flow-based models, facilitating improved generalizability. Being capable of recent breakthroughs and enabling systematic assessment, X-IL is a scalable approach to effective IL model construction.

Researchers evaluated imitation learning architectures for robotic tasks using the LIBERO and RoboCasa benchmarks. In LIBERO, models were trained on four task suites with 10 and 50 trajectories, where xLSTM achieved the highest success rates of 74.5% with 20% of the data and 92.3% with full data, indicating its effectiveness in learning from limited demonstrations. RoboCasa presented more challenges due to diverse environments, where xLSTM outperformed BC-Transformer with a 53.6% success rate, demonstrating its adaptability. Results indicated that combining RGB and point cloud inputs improved performance, with xLSTM achieving a 60.9% success rate. Encoder-decoder architectures outperformed decoder-only models, and fine-tuned ResNet encoders performed better than frozen CLIP models, highlighting the importance of strong feature extraction. Flow matching methods like BESO and RF demonstrated inference efficiency comparable to DDPM.

In summary, the proposed framework provides a modular approach for exploring imitation learning policies across architectures, policy representations, and modalities. Supporting state-of-the-art encoders and efficient sequential models improves data efficiency and representation learning, achieving strong performance on LIBERO and RoboCasa. This framework can be a future research baseline, enabling policy design comparisons and advancing scalable imitation learning. Future work can refine encoders, integrate adaptive learning strategies, and enhance real-world generalization for diverse robotic tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Optimizing Imitation Learning: How X‑IL is Shaping the Future of Robotics appeared first on MarkTechPost.