Vineet Kumar, Author at MarkTechPost

Project Alexandria: Democratizing Scientific Knowledge Through Structured Fact Extraction with LLMs

Vineet Kumar — Tue, 04 Mar 2025 20:25:38 +0000

Scientific publishing has expanded significantly in recent decades, yet access to crucial research remains restricted for many, particularly in developing countries, independent researchers, and small academic institutions. The rising costs of journal subscriptions exacerbate this disparity, limiting the availability of knowledge even in well-funded universities. Despite the push for Open Access (OA), barriers persist, as demonstrated by large-scale access losses in Germany and the U.S. due to price disputes with publishers. This limitation hinders scientific progress, leading researchers to explore alternative methods for making scientific knowledge more accessible while navigating copyright constraints.

Current methods of accessing scientific content primarily involve direct subscriptions, institutional access, or reliance on legally ambiguous repositories. These approaches are either financially unsustainable or legally contentious. While OA publishing helps, it does not fully resolve the accessibility crisis. Large Language Models (LLMs) offer a new avenue for extracting and summarizing knowledge from scholarly texts, but their use raises copyright concerns. The challenge lies in separating factual content from the creative expressions protected under copyright law.

To address this, the research team proposes Project Alexandria, which introduces Knowledge Units (KUs) as a structured format for extracting factual information while omitting stylistic elements. KUs encode key scientific insights—such as definitions, relationships, and methodological details—in a structured database, ensuring that only non-copyrightable factual content is preserved. This framework aligns with legal principles like the idea-expression dichotomy, which states that facts cannot be copyrighted, only their specific phrasing and presentation.

Reference: https://arxiv.org/pdf/2502.19413

Knowledge Units are generated through an LLM pipeline that processes scholarly texts in paragraph-sized segments, extracting core concepts and their relationships. Each KU contains:

Entities: Core scientific concepts identified in the text.
Relationships: Connections between entities, including causal or definitional links.
Attributes: Specific details related to entities.
Context summary: A brief summary ensuring coherence across multiple KUs.
Sentence MinHash: A fingerprint to track the source text without storing the original phrasing.

This structured approach balances knowledge retention with legal defensibility. Paragraph-level segmentation ensures optimal granularity—too small, and information is scattered; too large, and LLM performance degrades.

From a legal standpoint, the framework complies with both German and U.S. copyright laws. German law explicitly excludes facts from copyright protection and allows data mining under specific exemptions. Similarly, the U.S. Fair Use doctrine permits transformative uses like text and data mining, provided they do not harm the market value of the original work. The research team demonstrates that KUs satisfy these legal conditions by excluding expressive elements while preserving factual content.

To evaluate the effectiveness of KUs, the team conducted multiple-choice question (MCQ) tests using abstracts and full-text articles from biology, physics, mathematics, and computer science. The results show that LLMs using KUs achieve nearly the same accuracy as those given the original texts. This suggests that the vast majority of relevant information is retained despite the removal of expressive elements. Furthermore, plagiarism detection tools confirm minimal overlap between KUs and the original texts, reinforcing the method’s legal viability.

Beyond legal considerations, the research explores the limitations of existing alternatives. Text embeddings, commonly used for knowledge representation, fail to capture precise factual details, making them unsuitable for scientific knowledge extraction. Direct paraphrasing methods risk maintaining too much similarity to the original text, potentially violating copyright laws. In contrast, KUs provide a more structured and legally sound approach.

The study also addresses common criticisms. While some argue that citation dilution could result from extracting knowledge into databases, traceable attribution systems can mitigate this concern. Others worry that nuances in scientific research may be lost, but the team highlights that most complex elements—like mathematical proofs—are not copyrightable to begin with. Concerns about potential legal risks and hallucination propagation are acknowledged, with recommendations for hybrid human-AI validation systems to enhance reliability.

The broader impact of freely accessible scientific knowledge extends across multiple sectors. Researchers can collaborate more effectively across disciplines, healthcare professionals can access critical medical research more efficiently, and educators can develop high-quality curricula without cost barriers. Additionally, open scientific knowledge promotes public trust and transparency, reducing misinformation and enabling informed decision-making.

Moving forward, the team identifies several research directions, including refining factual accuracy through cross-referencing, developing educational applications for KU-based knowledge dissemination, and establishing interoperability standards for knowledge graphs. They also propose integrating KUs into a broader semantic web for scientific discovery, leveraging AI to automate and validate extracted knowledge at scale.

In summary, Project Alexandria presents a promising framework for making scientific knowledge more accessible while respecting copyright constraints. By systematically extracting factual content from scholarly texts and structuring it into Knowledge Units, this approach provides a legally viable and technically effective solution to the accessibility crisis in scientific publishing. Extensive testing demonstrates its potential for preserving critical information without violating copyright laws, positioning it as a significant step toward democratizing access to knowledge in the scientific community.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Project Alexandria: Democratizing Scientific Knowledge Through Structured Fact Extraction with LLMs appeared first on MarkTechPost.

NeoBERT: Modernizing Encoder Models for Enhanced Language Understanding

Vineet Kumar — Mon, 03 Mar 2025 20:58:35 +0000

Encoder models like BERT and RoBERTa have long been cornerstones of natural language processing (NLP), powering tasks such as text classification, retrieval, and toxicity detection. However, while decoder-based large language models (LLMs) like GPT and LLaMA have evolved rapidly—incorporating architectural innovations, larger datasets, and extended context windows—encoders have stagnated. Despite their critical role in embedding-dependent applications, BERT-family models rely on outdated architectures, limited training data, and short context lengths, leading to suboptimal performance on modern benchmarks. In this paper, the researchers have presented NeoBERT to revitalize encoder design by integrating advancements from decoder models while addressing inherent limitations of existing encoders.

Traditional encoders like BERT and RoBERTa use absolute positional embeddings, Gaussian Error Linear Unit (GELU) activations, and a fixed 512-token context window. While newer models like GTE and CDE improved fine-tuning strategies for tasks like retrieval, they rely on outdated backbone architectures inherited from BERT. These backbones suffer from inefficiencies:

Architectural Rigidity: Fixed depth-to-width ratios and positional encoding methods limit adaptability to longer sequences.
Data Scarcity: Pre-training on small datasets (e.g., Wikipedia + BookCorpus) restricts knowledge diversity.
Context Constraints: Short sequence lengths (512–2,048 tokens) hinder applications requiring long-context understanding.

Recent fine-tuning advancements masked these issues but failed to modernize the core models. For example, GTE’s contrastive learning boosts retrieval performance but cannot compensate for BERT’s obsolete embeddings. NeoBERT addresses these gaps through architectural overhauls, data scaling, and optimized training:

Architectural Modernization:
1. Rotary Position Embeddings (RoPE): Replaces absolute positional embeddings with relative positioning, enabling better generalization to longer sequences. RoPE integrates positional information directly into attention mechanisms, reducing degradation on out-of-distribution lengths.
2. Depth-to-Width Optimization: Adjusts layer depth (28 layers) and width (768 dimensions) to balance parameter efficiency and performance, avoiding the “width-inefficiency” of smaller models.
3. RMSNorm and SwiGLU: Replaces LayerNorm with RMSNorm for faster computation and adopts SwiGLU activations, enhancing nonlinear modeling while maintaining parameter count.

Data and Training:
1. RefinedWeb Dataset: Trains on 600B tokens (18× larger than RoBERTa’s data), exposing the model to diverse, real-world text.
2. Two-Stage Context Extension: First pre-trains on 1,024-token sequences, then fine-tunes on 4,096-token batches using a mix of standard and long-context data. This phased approach mitigates distribution shifts while expanding usable context.
3. Efficiency Optimizations:
  1. FlashAttention and xFormers: Reduces memory overhead for longer sequences.
  2. AdamW with Cosine Decay: Balances training stability and regularization.Performance and Evaluation

NeoBERT’s improvements are validated across following benchmarks:

GLUE: Scores 89.0%, matching RoBERTa-large’s performance despite having 100M fewer parameters. Key drivers include the RefinedWeb dataset (+3.6% gain) and scaled model size (+2.9%).
MTEB: Outperforms GTE, CDE, and jina-embeddings by +4.5% under standardized contrastive fine-tuning, demonstrating superior embedding quality. The evaluation isolates pre-training benefits by applying identical fine-tuning protocols to all models.
Context Length: NeoBERT4096 achieves stable perplexity on 4,096-token sequences after 50k additional training steps, whereas BERT struggles beyond 512 tokens. Efficiency tests show NeoBERT processes 4,096-token batches 46.7% faster than ModernBERT, despite larger size.

In conclusion, NeoBERT represents a paradigm shift for encoder models, bridging the gap between stagnant architectures and modern LLM advancements. By rethinking depth-to-width ratios, positional encoding, and data scaling, it achieves state-of-the-art performance on GLUE and MTEB while supporting context windows eight times longer than BERT. Its efficiency and open-source availability make it a practical choice for retrieval, classification, and real-world applications requiring robust embeddings. However, reliance on web-scale data introduces biases, necessitating ongoing updates as cleaner datasets emerge. NeoBERT’s success underscores the untapped potential of encoder modernization, setting a roadmap for future research in efficient, scalable language understanding.

Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post NeoBERT: Modernizing Encoder Models for Enhanced Language Understanding appeared first on MarkTechPost.

LEAPS: A Neural Sampling Algorithm for Discrete Distributions via Continuous-Time Markov Chains (‘Discrete Diffusion’)

Vineet Kumar — Fri, 28 Feb 2025 16:50:57 +0000

Sampling from probability distributions with known density functions (up to normalization) is a fundamental challenge across various scientific domains. From Bayesian uncertainty quantification to molecular dynamics and quantum physics, the ability to efficiently generate representative samples is crucial. While Markov chain Monte Carlo (MCMC) methods have long been the dominant approach, they often suffer from slow convergence, especially when dealing with multimodal distributions.

Traditional MCMC methods frequently struggle with convergence to equilibrium, leading researchers to combine them with non-equilibrium dynamics through techniques like annealed importance sampling (AIS) or sequential Monte Carlo (SMC). However, these methods can still exhibit high variance in their importance weights, resulting in inefficient sampling. The integration of deep learning with sampling algorithms has shown promise in continuous domains, but there remains a significant gap in effective sampling approaches for discrete distributions – despite their prevalence in applications ranging from statistical physics to genomic data and language modeling.

The research team addresses this gap with LEAPS (Locally Equivariant discrete Annealed Proactive Sampler), a novel sampling method that leverages continuous-time Markov chains (CTMCs) to efficiently sample from discrete distributions. LEAPS combines the theoretical foundation of non-equilibrium dynamics with neural network-based learning to create a powerful sampling approach.

LEAPS works by constructing a time-dependent probability path (ρt) that begins with an easy-to-sample distribution (ρ0) and gradually transforms it into the target distribution (ρ1). The central innovation lies in designing a CTMC whose evolution follows this prescribed path, enabling efficient sampling through a combination of:

Proactive Importance Sampling: The researchers developed a novel importance sampling scheme that anticipates where the CTMC will jump next, accumulating weights that reflect the deviation from the true distribution.
Locally Equivariant Neural Networks: A key computational breakthrough that allows efficient calculation of importance weights without the prohibitive costs associated with evaluating all neighboring states.
PINN Objective: A physics-informed neural network objective that trains the CTMC rate matrix by minimizing the variance of importance sampling weights.

Traditional approaches would require evaluating the neural network for each neighbor of a state, making the computation of importance weights prohibitively expensive for high-dimensional spaces. LEAPS introduces the concept of “local equivariance” – an inductive bias that enables computing these weights in a single forward pass of the neural network.

A locally equivariant neural network ensures that the “flux of probability” from a state to its neighbor is exactly negative of the flux from the neighbor back to the state. This property allows the model to efficiently capture the dynamics of the system without redundant calculations.

The research team demonstrates how to construct locally equivariant versions of popular neural network architectures:

Multilayer Perceptrons (MLPs) with specifically constrained weight matrices
Locally-Equivariant Attention (LEA) layers that maintain the equivariance property
Locally-Equivariant Convolutional (LEC) networks that can be stacked into deep architectures

LEAPS is not just computationally efficient but also theoretically sound. The researchers prove that their proactive importance sampling scheme provides unbiased estimates and that the locally equivariant parameterization of rate matrices is universally expressive – meaning it can represent any valid CTMC for the sampling problem.

A noteworthy theoretical result is that LEAPS generalizes both AIS and SMC methods. When the neural network component is set to zero, LEAPS recovers these classical approaches, making it a strict superset of these well-established sampling techniques.

To demonstrate LEAPS in action, the researchers applied it to sampling from a 2D Ising model – a classic challenge in statistical physics. Working with a 15×15 lattice (a 225-dimensional discrete space), they compared different neural architectures implementing their method against ground truth samples generated by long-run Glauber dynamics.

The results are impressive:

Convolutional architectures outperformed attention-based models, with deeper networks yielding better results
LEAPS accurately captured the magnetization distribution and two-point correlation functions
The method achieved high effective sample size (ESS), indicating efficient sampling with low-variance importance weights
LEAPS significantly outperformed pure MCMC approaches with the same number of sampling steps

What makes LEAPS particularly valuable is its ability to handle high-dimensional discrete spaces, which are ubiquitous in real-world applications but notoriously challenging for sampling algorithms. The method combines the statistical guarantees of traditional approaches with the representational power of deep learning. Additionally, LEAPS can be integrated with existing MCMC schemes, effectively combining learned transport with traditional random walks to achieve better mixing properties. This hybrid approach provides a practical pathway for researchers to enhance their existing sampling methods.

In conclusion, LEAPS represents a significant advancement in sampling from discrete distributions, especially in high-dimensional settings. By leveraging locally equivariant neural networks and proactive importance sampling, it offers a computationally efficient approach with strong theoretical guarantees. The research team suggests several promising directions for future work, including extending LEAPS to sample from entire families of distributions simultaneously and applying the locally equivariant neural network architecture to other probabilistic modeling tasks. The connection between LEAPS and guidance or reward fine-tuning of generative CTMC models also presents an exciting avenue for further exploration.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post LEAPS: A Neural Sampling Algorithm for Discrete Distributions via Continuous-Time Markov Chains (‘Discrete Diffusion’) appeared first on MarkTechPost.

Building an Ideation Agent System with AutoGen: Create AI Agents that Brainstorm and Debate Ideas

Vineet Kumar — Thu, 20 Feb 2025 17:04:11 +0000

Ideation processes often require time-consuming analysis and debate. What if we make two LLMs come up with ideas and then make them debate about those ideas? Sounds interesting right? This tutorial exactly shows how to create an AI-powered solution using two LLM agents that collaborate through structured conversation. For achieving this we will be using AutoGen for building the agent and ChatGPT as LLM for our agent.

1. Setup and Installation

First install required packages:

Vineet Kumar, Author at MarkTechPost

Project Alexandria: Democratizing Scientific Knowledge Through Structured Fact Extraction with LLMs

NeoBERT: Modernizing Encoder Models for Enhanced Language Understanding

LEAPS: A Neural Sampling Algorithm for Discrete Distributions via Continuous-Time Markov Chains (‘Discrete Diffusion’)

Building an Ideation Agent System with AutoGen: Create AI Agents that Brainstorm and Debate Ideas

1. Setup and Installation

2. Core Components

3. Building the Agent Team

4. Running the Team

5. Monitoring Interactions

Breaking the Autoregressive Mold: LLaDA Proves Diffusion Models can Rival Traditional Language Architectures

ReasonFlux: Elevating LLM Reasoning with Hierarchical Template Scaling

Step by Step Guide on How to Build an AI News Summarizer Using Streamlit, Groq and Tavily

Introduction

Setting Up the Environment

Defining the Agent State

Defining Prompts

Structuring Queries and News

Implementing the AI Agents

1. Browsing Node

2. Writing Node

3. Reflection Node

4. Refinement Node

5. Headlines Generation Node

Building the UI with Streamlit

Conclusion

Open O1: Revolutionizing Open-Source AI with Cutting-Edge Reasoning and Performance

Building an AI Research Agent for Essay Writing

Setting Up the Environment

Defining the Agent State

Initializing the Language Model

Defining the Prompts

Structuring Research Queries

Integrating Tavily for Research

Implementing the AI Agents

1. Planning Node

2. Research Plan Node

3. Writing Node

4. Reflection Node

5. Research Critique Node

Defining the Iteration Condition

Building the Workflow

Running the AI Essay Writer

Efficient Alignment of Large Language Models Using Token-Level Reward Guidance with GenARM