Adeeba Alam Ansari, Author at MarkTechPost

All You Need to Know about Vision Language Models VLMs: A Survey Article

Adeeba Alam Ansari — Tue, 18 Feb 2025 21:28:46 +0000

Vision Language Models have been a revolutionizing milestone in the development of language models, which overcomes the shortcomings of predecessor pre-trained LLMs like LLama, GPT, etc. Vision Language Models explore a new territory beyond single modularity to combine inputs from text and image videos. VLMs thus bestow a better understanding of visual-spatial relationships by expanding the representational boundaries of input, supporting a richer worldview. With new opportunities come new challenges, which is the case with VLMs. Currently, researchers across the globe are encountering and solving new challenges to make VLMs better, one at a time. Based on a survey by researchers from the University of Maryland and the University of Southern California, this article gives an intricate glimpse of what is going on in this field and what we can expect in the future of vision language models.

This article discusses a structured examination of VLMs developed over the past five years, encompassing architectures, training methodologies, benchmarks, applications, and the challenges inherent in the field. To begin with, let’s familiarize ourselves with some of the SOTA models in VLM and where they come from -CLIP by OpenAI, BLIP by Salesforce, Flamingo by DeepMind, and Gemini. These are the big fish in this domain, which is expanding rapidly to support multimodality user interaction.

When we dissect a VLM to understand its structure, we find that certain blocks are fundamental to models, irrespective of their features or capabilities. These are Vision Encoder, Text Encoder, and Text Decoder. Further, the mechanism of cross-attention integrates information across modalities, but it is present in fewer. The architecture of VLMs is also evolving as developers now use pre-trained Large Language models as the backbone instead of training from scratch. Self-supervised methodologies such as masked image modeling and contrastive learning have been prevalent in the latter option. On the other hand, while using a pre-trained model backbone, the most common ways to align visual and pre-trained LLM text features are using a projector, joint training, and freezing training stages.

Another interesting development is how the latest models treat visual features as tokens. Besides, the transfusion treats discrete text tokens and continuous image vectors in parallel by introducing strategic breakpoints.

Now, we discuss the major categories of benchmarks in the domain that evaluate the various capabilities of a VLM. Most of the datasets are created via synthetic generation or human annotations. These benchmarks test various models’ capabilities, including visual text understanding, text-to-image generation, and multimodal general intelligence. There are also benchmarks that test challenges against hallucinations, etc. Answer matching, Multiple Choice Questions, and Image/Text Similarity scores have emerged as common evaluation techniques.

VLMs are adapted to a variety of tasks, from virtual-world applications such as virtual embodied agents to real-world applications such as robotics and autonomous driving. Embodied agents are an interesting use case that relies heavily on developing VLMs.Embodied agents are AI models with virtual or physical bodies that can interact with their environment.VLMs increase their user interaction and support system by enabling features like Visual Question Answering. Besides, Generative VLM models like GAN generate visual content like memes, etc. In robotics, VLMs find their use cases in ability manipulation, Navigation, Human-robot Interaction, and Autonomous Driving.

While VLMs have shown tremendous potential over their textual counterparts, researchers must overcome multiple limitations and challenges. There are considerable tradeoffs between flexibility and generalizability of models. Further Issues, such as visual hallucination, raise concerns about the model’s reliability. There are additional constraints on fairness and safety due to biases in training data. Furthermore, in technical challenges, we are yet to see an efficient training and fine-tuning paradigm when high-quality datasets are scarce. Also, the contextual deviations between modalities or misalignments lower the output quality.

Conclusion: The paper provides an overview of the ins and outs of Vision Language Models- a new field of research that integrates content from multiple modalities. We see the model’s architecture, innovations, and challenges in the present times.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

The post All You Need to Know about Vision Language Models VLMs: A Survey Article appeared first on MarkTechPost.

LIMO: The AI Model that Proves Quality Training Beats Quantity

Adeeba Alam Ansari — Thu, 13 Feb 2025 05:54:00 +0000

Reasoning tasks are yet a big challenge for most of the language models. Instilling a reasoning aptitude in models, particularly for programming and mathematical applications that require solid sequential reasoning, seems far distant. This problem could be attributed to the inherent complexity of these tasks that require a multi-step logical deduction approach planned with domain knowledge to find a structured solution path.

LLMs are, therefore, supervised on massive amounts of data with hundreds of thousands of examples. For this reason, training is further based on two assumptions: the first is that learning such a cognitive skill is possible only with multiple supervised examples, and the second is that this training inevitably leads to memorization rather than generalization. Besides, this approach also brings high computational costs and the burden of data collection. This article discusses an approach that utilizes advancements in knowledge foundations and inference-time costs of LLM to eradicate the enormous data requirements.

Researchers from Shanghai Jiao Tong University present a hypothesis Less-Is-More(LIMO), which says that in foundation models where domain knowledge has been comprehensively encoded during the pre-training process, we can instill sophisticated reasoning capabilities in the model through minimal and precise demonstrations of cognitive processes. This hypothesis stems from the recent developments in the LLM space where developers incorporate unprecedented amounts of mathematical content during pre-training, enriching them with maths and programming logic before they step into the work field. Furthermore, the emergence of techniques scaling longer reasoning chains has motivated this research significantly.

According to the LIMO hypothesis, the elicitation threshold for complex reasoning is determined by two key factors:

The latent presence of prerequisite knowledge within the model’s parameter space (the domain knowledge instilled during the pre-training)
The effectiveness of minimal exemplars in demonstrating systematic problem-solving processes (post-training inference examples that act as cognitive prompts for solving reasoning tasks with available knowledge)

Thus, LIMO leverages the rich embedded pre-training knowledge and provides detailed reasoning chains through minimal but well-structured chains. The proposed method focuses on the quality and structure of prompts over their quantity, forcing the model to “think” with the help of past lessons rather than simply recalling them. This way, the pipeline challenges the underlying notion that supervised fine-tuning makes the model memorized. The authors further investigated the relationship between reasoning and data and identified critical factors, including the synergy between pre-trained knowledge foundations and test-time computation scaling.

The authors released a comprehensive open-source suite to ensure reproducibility, including their fine-tuned models, evaluation pipelines, training code, and carefully curated datasets with varying quality levels.

Authors in their experiments attempted to teach models reasoning with just hundreds of examples instead of the previous hundreds of thousands. The authors evaluated LIMO’s performance across 10 benchmarks to assess its out-of-distribution generalization capabilities. LIMO’s performance on these datasets was impressive and promising. Notably, with only 817 curated training samples, LIMO achieved 57.1% accuracy on the highly challenging American Invitational Mathematics Examination (AIME) benchmark and 94.8% on the MATH dataset, superseding the SFT methods that gained 6.5% and 59.2% on respective benchmarks.LIMO thus achieved a 40.5% absolute improvement over models trained on 100 times more data, refuting the first assumption of supervised training to instill reasoning

Conclusion: Researchers gave an insightful hypothesis regarding the reasoning training regime of LLMs through a model LIMO. It challenged the underlying assumptions in SFT to instill reasoning.LIMO demonstrates that less can be more and shows commendable performance on challenging datasets, superseding SFT with skillfully orchestrated cognitive templates.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

The post LIMO: The AI Model that Proves Quality Training Beats Quantity appeared first on MarkTechPost.

Researchers from ETH Zurich and TUM Share Everything You Need to Know About Multimodal AI Adaptation and Generalization

Adeeba Alam Ansari — Fri, 07 Feb 2025 03:40:05 +0000

There is no gainsaying that artificial intelligence has developed tremendously in various fields. However, the accurate evaluation of its progress would be incomplete without considering the generalizability and adaptability of AI models for specific domains. Domain Adaptation (DA) and Domain Generalization (DG) have garnered ample attention from researchers across the globe. Given that training is an exhaustive process and that the world has realized the scarcity of “good” data, it is imperative for models trained on limited source domains to perform well in novel areas.A considerable amount of research has been conducted in DA and DG. However, most of this research is based on unimodal data, such as images or time series. With the emergence of large-scale multimodal datasets, researchers are now striving to find a solution that addresses multimodal domain adaptation (MMDA) and generalization (MMDG) across multiple modalities, where the challenges become even more profound due to differences in characteristics. This article provides a comprehensive overview of the recent advances in MMDA and MMDG, from the traditional vanilla approaches to the use of foundation models and beyond.

Researchers from ETH, Zurich, and TUM, Germany, along with others, presented a comprehensive and exhaustive survey on advances in Multimodal Adaptation and Generalization. This survey covers the problem statement, challenges, datasets, applications, work done yet, and future directions for the following five topics in great detail:

(1) Multimodal domain adaptation -: The objective is to improve cross-domain knowledge transfer, i.e., train a model on a labeled source domain while ensuring it adapts effectively to an unlabeled target domain—despite distribution shifts. Researchers have struggled with the distinct characteristics of various modalities and ways to combine them. Furthermore, more often than not, inputs between modalities are missing.

To combat this issue, researchers have worked on various aspects, such as adversarial learning, contrastive learning, and cross-modal interaction techniques. Some significant works in this area are the MM-SADA and xMUDA frameworks.

(2) Multimodal test-time adaptation- Unlike MMDA, which adapts models before deployment, Multimodal Test-Time Adaptation (MMTTA) focuses on the model’s ability to self-adjust dynamically during inference without needing labeled data. The major obstacle in this direction is the scarcity of source domain data. Additionally, continuous distribution shifts in data cannot work if the model requires retraining every time. Researchers have used self-supervised learning and uncertainty estimation techniques to solve this problem. Some notable contributions in this field are READ (Reliability-Aware Attention Distribution) and Adaptive Entropy Optimization (AEO).

(3) Multimodal domain generalization: Multimodal Domain Generalization (MMDG) aims to train AI models that can generalize to entirely new domains without prior exposure. Similar to the previous two, the absence of target domain data during training also creates problems in this objective. Moreover, the inconsistencies in feature distributions make it difficult for models to learn domain-invariant representations. This fieldwork has been done on Feature Disentanglement and Cross-Modal Knowledge Transfer with algorithms like SimMMDG, MOOSA, etc.

(4) Domain adaptation and generalization with the help of multimodal foundation models: This section mainly discusses the ascent of foundation models like CLIP in improving DA and DG. Foundation models are pre-trained and have a rich understanding of diverse modalities, which makes them suitable candidates. While these models seem the perfect solution to all the problems above, their usage remains challenging due to high computational demands and adaptability constraints. To combat this problem, researchers have proposed elegant methods like feature-space augmentation, knowledge distillation, and synthetic data generation through contributions such as CLIP-based feature augmentation and diffusion-driven synthetic data generation.

(5) Adaptation of multimodal foundation models: This subtopic deals with the issue of fine-tuning foundation models for adaptation purposes. Researchers have proposed techniques like Prompt-Based Learning and Adapter-Based Tuning to combat the computational expenses and dearth of domain data. Some recent and noteworthy works are CoOp and CoCoOp for the first and CLIP-Adapter and Tip-Adapter for the latter technique.

Conclusion: This article discussed the problem of generalizability and adaptability in multimodal applications. We saw numerous subdomains of this research area and various works, from naive augmentation approaches to foundation models that solve the challenges. Besides, this survey paper presented all the pertinent information and highlighted the future scope of work to develop more efficient, robust frameworks and self-learning models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

The post Researchers from ETH Zurich and TUM Share Everything You Need to Know About Multimodal AI Adaptation and Generalization appeared first on MarkTechPost.

University of Bath Researchers Developed an Efficient and Stable Machine Learning Training Method for Neural ODEs with O(1) Memory Footprint

Adeeba Alam Ansari — Tue, 04 Feb 2025 03:43:20 +0000

Neural Ordinary Differential Equations are significant in scientific modeling and time-series analysis where data changes every other moment. This neural network-inspired framework models continuous-time dynamics with a continuous transformation layer governed by differential equations, which sets them apart from vanilla neural nets. While Neural ODEs have cracked down on handling dynamic series efficiently, cost-effective gradient calculation for backpropagation is a big challenge that limits its utility.

Until now, the standard method for N-ODEs has been recursive checkpointing that finds a middle ground between memory usage and computation. However, this method often presents inefficiencies, leading to an increase in both memory and processing time. This article discusses the latest research that tackles this problem through a class of algebraically reversible ODE solvers.

Researchers from the University of Bath introduce a novel machine learning framework to address the problem of backpropagation in the State-of-the-art recursive checkpoint methods in Neural ODE solvers. The authors introduce a class of algebraically reversible solvers that allows for the exact reconstruction of the solver state at any time step without storing intermediate numerical operations. These innovations lead to a significant improvement in the overall efficiency of the process with reduced memory consumption and computational overhead. The contrasting feature of this research that outshines this approach is its space complexity. While conventional solvers operate O(n log n), the proposed solver has O(n) complexity for operation and O(1) memory consumption.

The proposed solver framework allows any single-step numerical solver to be made reversible by enabling dynamic recomputation of the forward solve during backpropagation. This approach, therefore, ensures exact gradient calculation while achieving high-order convergence and improved numerical stability. The method’s working is further detailed: Instead of storing every intermediate state during the forward pass, the algorithm mathematically reconstructs these in reverse order during the backward pass. Furthermore, by introducing a coupling parameter, λ, the solver maintains numerical stability while accurately tracing the computational path backward. This coupling ensures that information from both the current and previous states is retained in a compact form, enabling exact gradient calculation without the overhead of traditional storage requirements.

The research team conducted a series of experiments to validate the claims of these solvers. They performed three experiments focussing on scientific modeling and latent dynamics discovery from the data to compare the accuracy, runtime, and memory cost of reversible solvers to recursive checkpointing. The solvers were tested against the following three experimental setups:

Discovery of Generated data from Chandrasekhar’s White Dwarf Equation
Approximation of fundamental data dynamics from a coupled oscillator system through a neural ODE.
Identification of chaotic nonlinear dynamics using a chaotic double pendulum dataset

The results of the above experiments testified to the proposed solvers’ efficiency. Across all tests, these demonstrated superior performance, achieving up to 2.9 times faster training speeds and using up to 22 times less memory than traditional methods.

Moreover, the accuracy of the final model remained consistent when compared to the state of the art. The reversible solvers reduced memory usage dramatically and slashed runtime, proving its utility in large-scale, data-intensive applications. The authors also found that adding weight decay to the neural network vector field parameters improved numerical stability for both the reversible method and recursive checkpointing.

Conclusion: The paper introduced a new class of algebraic solvers that solves the issues of computational efficiency and gradient accuracy. The proposed framework has an operation complexity of O(n) and memory usage of O(1). This breakthrough in ODE solvers paves the way for more scalable and robust time series and dynamic data models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.

The post University of Bath Researchers Developed an Efficient and Stable Machine Learning Training Method for Neural ODEs with O(1) Memory Footprint appeared first on MarkTechPost.

Baidu Research Introduces EICopilot: An Intelligent Agent-based Chatbot to Retrieve and Interpret Enterprise Information from Massive Graph Databases

Adeeba Alam Ansari — Fri, 31 Jan 2025 03:59:48 +0000

Knowledge graphs have been used tremendously in the field of enterprise lately, with their applications realized in multiple data forms from legal persons to registered capital and shareholder’s details. Although graphs have high utility, they have been criticized for intricate text-based queries and manual exploration, which obstruct the extraction of pertinent information.

With the massive strides in natural language processing and generative intelligence in the past years, LLMs have been used to perform complex queries and summarization based on their language comprehension and exploration skill set. This article discusses the latest research that uses language models to streamline information extraction from graph databases.

Researchers from Baidu presented “EICopilot,” an agent-based solution that streamlines search, exploration, and summarization of corporate data stored in knowledge graph databases to gain valuable insights about enterprises efficiently. To appreciate the work more, we must look at the scale of data handled by EICopilot. A typical graph dataset of this nature consists of hundreds of millions of nodes, tens of billions of edges, hundreds of billions of attributes, and millions of subgraphs as company communities representing a country’s registered corporations, organizations, and companies.

EICopilot is an LLM-based chatbot that utilizes a novel data preprocessing pipeline that optimizes database queries. To achieve this, the authors first gather real-world queries related to companies from general-purpose search engines. Post collection, developers reserve some representative queries exclusively as seed datasets and write search scripts for every query using Gremlin language for the graph dataset. Finally, the authors systematically annotate and augment the above queries and scripts to form a vector database that enhances search accuracy.EICopilot utilizes this vector database to generate search spaces in real-time for effective retrieval and exploration of graphs.

In addition to the above data processing pipeline, EICopilot employs a comprehensive reasoning pipeline to provide precise query responses. This pipeline uses Chain-of-Thought (CoT) and In-Context Learning (ICL) to provide more accurate responses.

The authors also highlight the importance of an entity name in the query rather than the intent in a vector database query matching. The authors also proposed a novel query masking strategy that masks entity names in queries to combat this.EICopilot ensures that queries are understood in their complexity and executed with greater precision and relevance to user intent.

The authors provided us with an extensive empirical analysis and real-world experimentation that validate the utility of the proposed framework. They obtained data from Baidu’s internal data platform and processed it rigorously to construct a dataset involving a query and graph database query pair. The authors introduce a length complexity score based on the traversal length of the query. Based on the above score, the query was categorized as simple, moderate, or complex. To assess the performance of 𝐸𝐼𝐶𝑜𝑝𝑖𝑙𝑜𝑡, authors considered the SyntaxErrorRate and Execution Correctness of the generated Gremlin scripts. For the LLMs, EICopilot utilized ErnieBot, ErnieBot-Speed, and Llama3-8b models.

The empirical results from the above experiments proved the superior performance of EICopilot over baselines, especially in terms of speed and accuracy; notably, the Full Mask variant of EICopilot achieved a syntax error rate reduction to as low as 10.00% and an execution correctness of up to 82.14%. These results highlighted the critical role of the method’s components in enhancing query and summarization processes.

Conclusion: This paper introduced EICopilot, an agent-based chatbot that enhances querying and summarization processes from massive knowledge graph databases in corporations. The authors proposed a series of innovations like script generation, novel data pre-processing, and masking techniques. The proposed method superseded baseline methods in speed and accuracy, thus revolutionizing large-scale knowledge graph exploration.

The post Baidu Research Introduces EICopilot: An Intelligent Agent-based Chatbot to Retrieve and Interpret Enterprise Information from Massive Graph Databases appeared first on MarkTechPost.

Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy

Adeeba Alam Ansari — Tue, 28 Jan 2025 06:25:27 +0000

Large Language Models (LLMs) have become an indispensable part of contemporary life, shaping the future of nearly every conceivable domain. They are widely acknowledged for their impressive performance across tasks of varying complexity. However, instances have arisen where LLMs have been criticized for generating unexpected and unsafe responses. Consequently, ongoing research aims to align LLMs more closely with human preferences while fully leveraging their extensive training data.

Methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have proven effective. However, they still require iterative training, which is often impractical. Researchers are therefore focusing on modifying inference approaches to match the performance of training-based optimization methods. This article explores the latest research that enhances human preference alignment during inference time.

Researchers from Shanghai AI Laboratory have introduced Test-Time Preference Optimization (TPO), a novel framework designed to align LLM outputs with human preferences during inference. This framework can be conceptualized as an online, on-policy learning paradigm, where the policy model continuously interacts with a novel reward model to refine its outputs.

TPO incorporates a mechanism to leverage interpretable textual feedback for preference optimization instead of conventional numerical scoring. To achieve this, authors translate reward signals into textual rewards through critiques. The model then generates suggestions by the transformed rewards and updates its outputs to align with the signals during testing.

During the actual test, the newly generated responses are scored at each inference-time optimization step, and the extreme ends of response quality are classified as “chosen” or “rejected” outputs. The model then learns the strength from the best or “chosen” outputs and the shortfalls of rejected responses to compile a “textual loss.” The model then generates suggestions or “ textual gradients” for the next iteration. TPO thus improves the output iteratively based on interactions with text rewards.

The authors used aligned and unaligned policy models to validate the concept and determine whether the model had undergone preference optimization during training. Two key models included in the study were Llama-3.1-70B-SFT, an unaligned model that did not undergo preference optimization during training, and Llama-3.1-70B-Instruct, an aligned model trained with preference optimization. Additionally, experiments spanned many datasets to evaluate instruction following, preference alignment, safety, and mathematical reasoning.

Results from these experiments confirmed that a few TPO optimization steps significantly improved performance in both aligned and unaligned models. When comparing TPO-based inference optimization with traditional training optimization approaches, researchers found that the unaligned Llama-3.1-70B-SFT model outperformed its aligned counterpart Llama-3.1-70B-Instruct after undergoing TPO epochs. Furthermore, applying TPO to an aligned model with as few as 22 billion parameters achieved an LC score of 53.4% and a WR score of 72.2%

Conclusion: The research team introduced TPO, an online, on-policy learning framework to align outputs from LLMs by human preference. This framework optimized the responses in inference time and eliminated the hassle of retraining and weight updates. Additionally, TPO offered high scalability and flexibility, making it a promising approach for future LLM works.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

The post Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy appeared first on MarkTechPost.

SlideGar: A Novel AI Approach to Use LLMs in Retrieval Reranking, Solving the Challenge of Bound Recall

Adeeba Alam Ansari — Wed, 22 Jan 2025 04:19:21 +0000

Out of the various methods employed in document search systems, “retrieve and rank” has gained quite some popularity. Using this method, the results of a retrieval model are re-ordered according to a re-ranker. Additionally, in the wake of advancements in generative AI and the development of Large Language Models (LLMs), rankers are now capable of performing listwise reranking tasks after analyzing complex patterns in language. However, a crucial problem exists that appears trivial but limits the overall effectiveness of these cascading systems.

The challenge of the bounded recall problem, where a document is irrevocably excluded from the final ranked list if it wasn’t retrieved in the initial phase, causes the loss of high-potential information. To solve this problem, researchers came up with an adaptive retrieval process. Adaptive Retrieval (AR) differentiates itself from previous works by leveraging the ranker’s assessments to expand the retrieval set dynamically. A clustering hypothesis is applied in this process to group similar documents that may be relevant to a query. Adaptive Retrieval (AR) could be better understood as a pseudo-relevance feedback mechanism that enhances the likelihood of including pertinent documents that may have been omitted during the initial retrieval.

Although AR serves as a robust solution in cascading systems, contemporary work in this vertical operates under the assumption that the relevance score depends only on the document and query, implying that one document’s score is computed independently of others. On the other hand, LLM-based ranking methods use signals from the entire ranked list to determine relevance. This article discusses the latest research that merges the benefits of LLMs with AR.

Researchers from the L3S Research Center, Germany, and the University of Glasgow have put forth SlideGar: Sliding Window-based Adaptive Retrieval to integrate AR with LLMs while accounting for the fundamental differences between their pointwise and listwise approaches. SlideGar modifies AR such that the resulting ranking function outputs a ranked order of documents rather than discrete relevance scores. The proposed algorithm merges results from the initial ranking with feedback documents provided by the most relevant documents identified up to that point.

The SlideGar algorithm utilizes AR methods like graph-based adaptive retrieval (Gar) and query affinity modeling-based adaptive retrieval (Quam) to find document neighbors in a constant amount of time. For LLM ranking, the authors employ a sliding window to overcome the constraint of input context. SlideGar processes the initial pool of documents given by the retriever for a specific query and, for a predefined length and step size, ranks the top w documents from left to right using a listwise ranker. These documents are then removed from the pool. The authors used the reciprocal of the rank as a pseudo-score for the documents.

The authors employed the MSMARCO corpus data for practical purposes and evaluated its performance on REC Deep Learning 2019 and 2020 query sets. They also used the latest versions of these datasets and de-duplicated them to remove redundancies. A variety of sparse and dense retrievers were utilized. For rankers, the authors employed different listwise rankers, including both zero-shot and fine-tuned models. The authors leveraged the open-source Python library, ReRankers, to apply these listwise re-rankers.

After conducting an extensive set of experiments across diverse LLM re-rankers, first-stage retrievers, and feedback documents, the authors ascertained that SlideGar improved the nDGC@10 score by up to 13% and recall by 28%, with a constant number of LLM inferences over the SOTA listwise rankers. Furthermore, regarding computation, the authors discovered that the proposed method adds negligible latency (a mere 0.02%).

Conclusion: In this research paper, the authors propose a new algorithm, SlideGar, that allows LLM re-rankers to address the challenge of bounded recall in retrieval. SlideGar merges the functionalities of AR and LLM re-rankers to complement each other. This work paves the way for researchers to further explore and adapt LLMs for ranking purposes.

The post SlideGar: A Novel AI Approach to Use LLMs in Retrieval Reranking, Solving the Challenge of Bound Recall appeared first on MarkTechPost.

Researchers from China Develop Advanced Compression and Learning Techniques to process Long-Context Videos at 100 Times Less Compute

Adeeba Alam Ansari — Mon, 20 Jan 2025 01:29:48 +0000

One of the most significant and advanced capabilities of a multimodal large language model is long-context video modeling, which allows models to handle movies, documentaries, and live streams spanning multiple hours. However, despite the commendable advancements made in video comprehension in LLMs, including caption generation and question answering, many obstructions remain in processing extremely long videos. The most crucial of these is understanding the context brought by long videos.

Although much work has already been done in this domain, ranging from training on massive text and frame corpora to building an effective training system with long-context parallelism and data packing, these super-long multimodal contexts have significantly reduced models’ training and inference efficiency. Moreover, the redundancy introduced by frames further complicates model learning. An interesting direction in this field is the compression of video tokens, which shows great potential but suffers from a trade-off in detailed representations. This article presents the latest research on a new compression method for long-context multimodal modeling.

Researchers from the Shenzhen Institutes of Advanced Technology propose a hierarchical video token compression method (HiCo) with a practical context modeling system, VideoChat-Flash, tailored for processing long-context videos. HiCo addresses the visual redundancies in video information by compressing extended contexts from clip to video level to minimize computation while preserving all critical data. VideoChat-Flash, on the other hand, features a multi-stage short-to-long learning scheme along with a rich dataset of real-world long videos. It is an adequate long-video understanding of MLLM with a training infrastructure that supports high-degree sequence parallelism.

HiCo compresses tokens hierarchically to obtain high-density token representations and widen the context window. The authors sequentially segment long videos into shorter clips and feed them into the MLLM. The compression is based on spatiotemporal redundancies. HiCo further links the compressed tokens with user queries and exploits semantic correlations between clips and real-world embeddings to reduce the token quantity.

Next, in VideoChat-Flash, which employs a multi-stage short-to-long learning scheme and a corresponding data receipt, the authors begin supervised fine-tuning with short videos and associated captions and QAs, gradually shifting to long videos, and ultimately training on a mixed-length corpus. Short videos prove highly effective in enhancing basic visual perception and concisely expressing long videos. The authors provide a massive dataset for fine-tuning, encompassing 300,000 hours of videos with annotations spanning 2 billion words.

Another innovation proposed in the paper is a modified “Needle in a Haystack” (NIAH) task for multi-hop video configurations. Conventionally, the NIAH task evaluates a model by requiring it to locate an indicated image, find a target word, or answer a question in a video. Here, a target image is typically inserted into video frames, which the model can identify through visual distinction without understanding the context. To address this loophole, the authors proposed a new benchmark, “multi-hop needle in a video haystack,” which requires the model to locate a sequence of interconnected indicative images, where subsequent images can only be found using clues from the first image.

The proposed method achieved a computational reduction of up to two orders of magnitude in experiments. VideoChat-Flash, in particular, demonstrated remarkable performance on both mainstream short and long video benchmarks at 2B and 7B scales. The authors surpassed all other methods for the 7B scale model, proclaiming it as the new state-of-the-art in short video understanding. Even in long-video comprehension, their model outperformed previous open-source MLLMs, achieving SOTA in several benchmarks. The proposed model also exhibited strong temporal grounding capabilities, with zero-shot performance exceeding many renowned MLLMs. Additionally, VideoChat-Flash achieved an astounding accuracy of 99.1% on over 10,000 frames in NIAH.

Conclusion: The authors introduced a hierarchical compression technique, HiCo, and VideoChat-Flash, an MLLM trained using an innovative multi-stage scheme. This method advanced compression techniques to reduce computations for long-context videos while surpassing the accuracies of current SOTA models.

The post Researchers from China Develop Advanced Compression and Learning Techniques to process Long-Context Videos at 100 Times Less Compute appeared first on MarkTechPost.

This AI Study Saves Researchers from Metadata Chaos with a Comparative Analysis of Extraction Techniques for Scholarly Documents

Adeeba Alam Ansari — Wed, 15 Jan 2025 19:32:59 +0000

Scientific metadata in research literature holds immense significance, as highlighted by flourishing research in scientometrics—a discipline dedicated to analyzing scholarly literature. Metadata improves the findability and accessibility of scientific documents by indexing and linking papers in a massive graph. Today, the research community has realized the importance of metadata. However, its awareness and consideration were negligible in the past, especially for non-technical disciplines such as social sciences, which made their publications less discoverable. Over time, many standards have been established to ensure uniformity and standardization. Moreover, metadata automation has progressed significantly, aided by advanced natural language processing (NLP) and computer vision techniques. NLP, in particular, has been leading metadata extraction. Still, there remains a significant issue that hinders its application in small and mid-sized publications, which often have a variety of templates and layouts. This article discusses the latest research comparing methods for metadata extraction from scholarly documents.

Researchers from the Fraunhofer Institute for Applied Information Technology took on this challenge and explored various feature learning and classification approaches for scientific PDFs. The authors employed techniques across domains, from classical methods to the latest innovations. They utilized techniques such as Conditional Random Fields, BiLSTM with BERT representations, and innovative multimodal and TextMap methods. The approaches chosen by the authors overcome the limitations of generative LLMs, which require data in a specified structure, making them incompatible with diverse publication formats. The authors leveraged the strengths of BERT and other architectures to address the uniqueness and variability of different documents, including embedded multimodal content.

The research team also curated two challenging labeled datasets to address the lack of ground truths for training DNN-based tools. For the first dataset, SSOAR-MVD, they synthesized 50,000 samples using predefined templates and available data. The other S-PMRD dataset was derived from the Semantic Scholar Open Research Corpus.

In the paper, the research team assumed that metadata is typically present only on the first page of a PDF document and that its availability may vary across documents. They initially employed Conditional Random Fields to tackle this task by dividing it into two sub-goals: identification and extraction. Identification was facilitated by analyzing font changes, including color, size, style, and alignment variations. The identified lines then served as input to the extraction layer of CRF. The authors subsequently used BiLSTM with BERT embeddings and a BiLSTM-CRF approach with embeddings obtained from BERT. They also experimented with Grobid, a machine-learning library designed to parse sections of documents such as headers, titles, author information, or other metadata into XML/TEI format.

Furthermore, they employed Fast RCNN and vision-language-based models. Lastly, the authors conducted experiments using the TextMap approach, which applies a two-phase processing method to handle spatial representation and semantic mapping. They innovatively integrated spatial and semantic components through a carefully designed interpolation process.

The results from the above experiments were noteworthy. The first model, CRF, performed remarkably well for attributes with structured and predictable formats, such as dates, with an F1 score reaching 0.73. However, as data patterns diminished and complexity and variability increased, such as in the case of titles or authors’ names, its performance dwindled. BiLSTM demonstrated robustness in capturing the sequence and context of data, with an F1 score reaching as high as 0.9 for abstracts and dates. The BiLSTM-CRF performed moderately, as the capabilities of LSTM supported CRF, but it could not surpass the performance of BiLSTM alone. Grobid, despite its simple design, exceeded previous scores, achieving the highest F1-score of 0.96 in author extraction. Fast RCNN demonstrated high precision and recall across various metadata categories, achieving higher accuracies in recognizing titles, abstracts, and journals. In the TextMap method, the best output was obtained with Word2Vec embeddings, where performance reached 0.9 in F1-score.

Conclusion: The authors compared various classical and advanced machine-learning tools for accurate metadata extraction. The paper highlighted the strengths and shortcomings of each method, enabling users to select the most suitable approach based on dataset content, desired accuracy, and physical constraints.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. ^(Promoted)

The post This AI Study Saves Researchers from Metadata Chaos with a Comparative Analysis of Extraction Techniques for Scholarly Documents appeared first on MarkTechPost.

ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios

Adeeba Alam Ansari — Sat, 11 Jan 2025 22:50:39 +0000

Multi-hop queries have always given LLM agents a hard time with their solutions, necessitating multiple reasoning steps and information from different sources. They are crucial for analyzing a model’s comprehension, reasoning, and function-calling capabilities. At this time when new large models are booming every other day with claims of unparalleled capabilities, multi-hop tools realistically assess them by bestowing with a complex query, which the model needs to decompose into atomic parts and iteratively solve by invoking and utilizing appropriate tools. Furthermore, multi-hop tool evaluation has emerged as pivotal for advancing models toward generalized intelligence.

Existing works in this field fall short of offering a reliable evaluation method. Methods proposed until now have relied on tool-driven data construction methods where queries are simulated for a given collection of tools. This shortfall points out the loophole in ensuring the interdependence of collected tools and assessing the multi-hop reasoning. Additionally, the absence of verifiable answers introduces model bias and evaluation errors. This article discusses the latest research that presents a reliable method to honestly assess the multi-hop capabilities of a large language model.

Fudan University and ByteDance researchers presented ToolHop, a dataset designed explicitly for multi-hop tool evaluation with 995 rigorously designed user queries and 3,912 associated tools. Toolhop claims to solve all the aforementioned problems through diverse queries, locally executable tools, meaningful interdependencies, detailed feedback, and verifiable answers. The authors propose a novel query-driven data construction approach that could expand a single multi-hop query into a comprehensive multi-hop tool use test case.

The proposed novel scheme comprises three key stages: tool creation, document refinement, and code generation.

Tool Creation: A preliminary set of tool documents is created per the user-provided multi-hop query. The document is designed to keep it interdependent and relevant by resolving queries into atomic parts and individually handling each. This way, the document captures the essence of the query and structures itself to generate similar queries, ensuring modularity and cohesion.

Document Refinement: The prepared tool document undergoes comprehensive filtering to support the evaluation of models in complex multi-hop scenarios. Here, new features like result filtering and customizable formats are introduced to expand functionality while maintaining originality. Parallelly, the number of parameters is increased, and their types are optimized.

Code Generation: At this stage, locally executable functions are generated by the prepared tool. Through these functions, tools are externally invoked, enabling seamless multi-turn interactions between the model and tools.

The research team implemented the approach with the queries drawn from the MoreHopQA dataset. Further, to ensure the evaluation with ToolHop, a rigorous five-dimensional analysis was done. ToolHop was then evaluated on fourteen LLMs from five families, including open and closed-sourced models. The evaluation method was so designed that answer correctness and minimized invocation errors were ensured. The authors observed that using tools increased the models’ performance by up to 12 % on average and by up to 23 % for GPT models. The best-performing model could achieve 49.04% answer correctness even after the increase. Also, despite using tools in response to multi-hop queries, models hallucinated around 10% of the time.

Conclusion:

This paper presents a comprehensive dataset for solving multi-hop queries using specially designed queries and tools. The main finding from the experiments was that while LLMs have significantly enhanced their ability to solve complex multi-shop queries with the use of tools, their multi-shop tool use capabilities still leave considerable room for improvement.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios appeared first on MarkTechPost.