Mohammad Asjad, Author at MarkTechPost https://www.marktechpost.com/author/mohammad_asjad/ An Artificial Intelligence News Platform Mon, 05 May 2025 05:33:28 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://www.marktechpost.com/wp-content/uploads/2022/04/cropped-Favicon-512-x-512-1-1-32x32.png Mohammad Asjad, Author at MarkTechPost https://www.marktechpost.com/author/mohammad_asjad/ 32 32 127842392 Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling https://www.marktechpost.com/2025/05/04/scaling-reinforcement-learning-beyond-math-researchers-from-nvidia-ai-and-cmu-propose-nemotron-crossthink-for-multi-domain-reasoning-with-verifiable-reward-modeling/ https://www.marktechpost.com/2025/05/04/scaling-reinforcement-learning-beyond-math-researchers-from-nvidia-ai-and-cmu-propose-nemotron-crossthink-for-multi-domain-reasoning-with-verifiable-reward-modeling/#respond Mon, 05 May 2025 05:31:33 +0000 https://www.marktechpost.com/?p=71106 Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities across diverse tasks, with Reinforcement Learning (RL) serving as a crucial mechanism for refining their deep thinking abilities. While RL techniques have shown particular success in mathematical reasoning and coding domains with well-defined rules and verifiable correctness criteria, extending these approaches to broader reasoning contexts presents […]

The post Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling appeared first on MarkTechPost.

]]>
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities across diverse tasks, with Reinforcement Learning (RL) serving as a crucial mechanism for refining their deep thinking abilities. While RL techniques have shown particular success in mathematical reasoning and coding domains with well-defined rules and verifiable correctness criteria, extending these approaches to broader reasoning contexts presents significant challenges, including limited training data and difficulties in ensuring cross-domain generalisation.

Evolution of Reasoning in LLMs

The development of Chain-of-Thought (CoT) methodology marked a significant advancement in LLM reasoning capabilities. CoT has demonstrated substantial improvements across mathematics, science, and programming domains by incorporating multi-step intermediate reasoning processes before reaching conclusions. This approach allows models to break down complex problems into manageable steps, mirroring human problem-solving processes.

While mathematical reasoning has dominated recent research due to its verifiable nature, the expansion of RL training to diverse domains remains largely unexplored. Prior research works suggest that blending mathematical content with other verifiable domains can improve performance on broad reasoning benchmarks. However, systematic investigation into how non-mathematical reasoning data, such as legal analysis, social science, or historical interpretation, impacts RL training effectiveness still represents a significant research gap.

Challenges in Diversifying Reasoning Domains

Recent research has explored methods for diversifying RL training datasets, yet questions about optimal data-blending strategies and the relative importance of various sources remain unanswered. A fundamental challenge in applying RL to general reasoning tasks is developing verifiable reward models for domains lacking deterministic solutions. Domain-specific reasoning processes—whether rule-based and symbolic in mathematics or contextual and heuristic in fields like law and history—require different cognitive approaches. In addition to that, question formats (open-ended versus multiple-choice) demand distinct reasoning strategies, suggesting that incorporating diverse reasoning domains could significantly enhance LLMs’ broad cognitive capabilities.

Nemotron-CrossThink: A Multi-Domain Approach

Researchers from NVIDIA, Carnegie Mellon University, and Boston University introduce Nemotron-CrossThink, representing a systematic framework for incorporating multi-domain corpora into RL training to enhance cross-task generalisation. The methodology follows a comprehensive pipeline that curates diverse data sources, including synthetic data from CommonCrawl and open-source question-answer pairs across STEM, humanities, law, and social sciences. By applying templated formats (MCQ/Open-Ended) to constrain answer spaces, filtering samples for verifiable rewards, and implementing strategic data-blending recipes, the framework enables effective self-learning through RL across diverse reasoning domains.

Key Results and Innovations

Nemotron-CrossThink significantly enhances LLM reasoning capabilities by integrating multi-domain data with different question formats. Models trained with this approach demonstrate not only higher accuracy but also dynamic response strategies—generating concise answers for general-purpose questions while providing detailed responses for mathematical problems—thereby optimising inference costs while maintaining task-specific precision.

The framework addresses the challenge of verifiable rewards in non-deterministic domains through templated data curation that limits answer space diversity. It also provides an efficient filtering approach that ranks general-purpose reasoning data by complexity, showing that training with more challenging samples amplifies RL impact across all domains. These innovations have led to substantial performance gains in both mathematical benchmarks (MATH-500: +30.1%, AMC23: +27.5%) and non-mathematical tasks (MMLU-PRO: +12.8%, GPQA-DIAMOND: +11.3%).

Comprehensive Data Curation

Nemotron-CrossThink begins with meticulous data curation from multiple sources to ensure diversity. The training dataset combines synthetically generated data from CommonCrawl and publicly available open-source QA datasets, encompassing both general-purpose reasoning and mathematical content. General-purpose reasoning data includes MMLU, Natural Reasoning, and synthesised QA pairs spanning STEM fields, economics, social sciences, and humanities, while mathematical reasoning incorporates datasets like MATH and Numina-Math alongside synthetically generated problems.

Template Application and Data Filtering

To address the challenge of verifiable rewards in non-mathematical domains, the framework applies specific templates to structure question-answer formats: Multiple Choice Questions (MCQ) and Open-Ended questions. This approach exposes the model to diverse answer formats and reasoning pathways while limiting answer space variability to enable effective reward modeling. Rigorous filtering removes samples that are infeasible to evaluate with rule-based reward functions, discarding MCQs where correct answers aren’t among the choices and open-ended responses exceeding ten words.

Strategic Data Blending and Reinforcement Learning

Nemotron-CrossThink employs Group Relative Policy Optimisation (GRPO) for reinforcement learning, which improves efficiency by estimating baselines from group scores rather than using a separate critic model. The methodology investigates the impact of diverse data sources, question types, and data usefulness through six distinct blending recipes. This systematic approach enables detailed analysis of how general-purpose reasoning data complements mathematical reasoning, ultimately producing more adaptable and generalizable language models.

Technical Contributions

The research demonstrates several key technical advances in multi-domain reasoning through reinforcement learning:

  1. Templated question-answer formats provide more stable reward modeling, with unified open-ended question formats improving performance by 1.21% over mixed formats, and short-form answer templates outperforming long-form ones by 1.20%.
  2. Strategic data-blending proves essential, with multi-domain corpora boosting average reasoning accuracy by 1.61% compared to math-only training while reducing token usage by 28%.
  3. Model-driven filtering techniques effectively select challenging samples by removing those solvable by smaller models, yielding an additional 2.15% accuracy gain for Qwen-2.5-32B.

These findings represent significant progress in developing LLMs with robust reasoning capabilities across diverse domains, moving beyond the traditional focus on mathematical reasoning to encompass the full spectrum of human knowledge and inference patterns.

Experiments and Results

Experimental results demonstrate that different datasets significantly impact model performance across reasoning benchmarks. NuminaMath produced the highest overall average, outperforming the baseline by 8.30%, with particular strength in mathematical tasks while also generalizing well across diverse domains. Synthetic question-answering data improved performance by approximately 1.0%, showing strong accuracy in MMLU-PRO, AGIEVAL, and MATH-500 tasks, confirming that synthetically generated instruction-style data can effectively generalize when aligned with benchmark distributions.

The Nemotron-CrossThink approach consistently outperformed the base model across various blending strategies. The general-purpose reasoning blend (Bgpr↑) achieved the highest overall average, exceeding OPEN-REASONER-ZERO by approximately 5% on average and showing substantial gains on reasoning-focused benchmarks (+12.82% on MMLU-PRO, +15.12% on AGIEVAL). Though Bonly_math performed slightly better on strictly mathematical tasks, it lagged on non-mathematical reasoning benchmarks, demonstrating Bgpr↑’s superior versatility through strong cross-domain transfer.

Further analysis revealed that open-ended question formats (Bopen↑) yielded stronger results on mathematical benchmarks than multiple-choice formats (Bmcq↑), suggesting alignment with the inherently open-ended structure of mathematical problems. Mathematical reasoning data showed transferability to structured reasoning tasks, while general-purpose data proved less effective in isolation. This counterintuitive finding confirms that optimal general-purpose reasoning performance requires including mathematical problems in training blends.

Conclusion

Nemotron-CrossThink introduces a scalable framework that enhances LLM generalization through reinforcement learning with multi-domain corpora. By strategically blending diverse reasoning data with a 2:1 ratio of general-purpose to mathematical content, the approach achieves a remarkable 13.36% average improvement over baselines. The research demonstrates that data diversity, not merely volume, drives broader reasoning capabilities. Through difficulty-based filtering and thoughtful template design, Nemotron-CrossThink establishes a practical methodology for developing more generalizable, efficient, and reliable LLMs that extend self-learning beyond mathematical reasoning.


Check out the Paper and Project Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

The post Scaling Reinforcement Learning Beyond Math: Researchers from NVIDIA AI and CMU Propose Nemotron-CrossThink for Multi-Domain Reasoning with Verifiable Reward Modeling appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/05/04/scaling-reinforcement-learning-beyond-math-researchers-from-nvidia-ai-and-cmu-propose-nemotron-crossthink-for-multi-domain-reasoning-with-verifiable-reward-modeling/feed/ 0 71106
Vision Foundation Models: Implementation and Business Applications https://www.marktechpost.com/2025/05/03/vision-foundation-models-implementation-and-business-applications/ https://www.marktechpost.com/2025/05/03/vision-foundation-models-implementation-and-business-applications/#respond Sat, 03 May 2025 19:59:58 +0000 https://www.marktechpost.com/?p=71072 In this tutorial, we’ll explore implementing various vision foundation models for business applications. We’ll focus on practical code implementation, technical details, and business use cases rather than theoretical aspects. Setup and Environment Configuration First, let’s set up our environment and install the necessary libraries: # Verify CUDA availability for GPU acceleration 1. CLIP: Contrastive Language-Image […]

The post Vision Foundation Models: Implementation and Business Applications appeared first on MarkTechPost.

]]>
In this tutorial, we’ll explore implementing various vision foundation models for business applications. We’ll focus on practical code implementation, technical details, and business use cases rather than theoretical aspects.

Setup and Environment Configuration

First, let’s set up our environment and install the necessary libraries:

!pip install torch torchvision transformers timm pillow matplotlib opencv-python tensorflow-hub tensorflow
!pip install huggingface_hub sentence-transformers ftfy regex tqdm
!pip install accelerate

# Verify CUDA availability for GPU acceleration

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
   print(f"CUDA device: {torch.cuda.get_device_name(0)}")

1. CLIP: Contrastive Language-Image Pre-training

CLIP by OpenAI excels at connecting images with natural language, making it powerful for zero-shot image classification and retrieval tasks.

Business Applications:

  • Product image search and recommendation
  • Content moderation
  • Visual brand monitoring
  • Cross-modal retrieval systems
import torch
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
import matplotlib.pyplot as plt
import numpy as np


# Load model and processor
model_id = "openai/clip-vit-base-patch32"
model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)


# Function to get image embeddings
def get_clip_image_embedding(image_path):
   image = Image.open(image_path) if isinstance(image_path, str) else image_path
   inputs = processor(images=image, return_tensors="pt")
   with torch.no_grad():
       image_features = model.get_image_features(**inputs)
   return image_features


# Function to perform zero-shot classification
def classify_image_with_clip(image_path, categories):
   image = Image.open(image_path) if isinstance(image_path, str) else image_path
   inputs = processor(
       text=categories,
       images=image,
       return_tensors="pt",
       padding=True
   )


   with torch.no_grad():
       outputs = model(**inputs)
       logits_per_image = outputs.logits_per_image
       probs = logits_per_image.softmax(dim=1)


   # Return dict of categories and probabilities
   return {categories[i]: probs[0][i].item() for i in range(len(categories))}


# Example: Product categorization
url = "https://images.unsplash.com/photo-1542291026-7eec264c27ff?q=80&w=1470&auto=format&fit=crop"
image = Image.open(requests.get(url, stream=True).raw)


product_categories = [
   "sneakers", "formal shoes", "sandals", "boots",
   "sports equipment", "casual wear", "luxury item"
]


results = classify_image_with_clip(image, product_categories)


# Sort results by probability
sorted_results = dict(sorted(results.items(), key=lambda x: x[1], reverse=True))


# Display the image and classification results
plt.figure(figsize=(12, 6))


# Plot the image on the left
plt.subplot(1, 2, 1)
plt.imshow(np.array(image))
plt.title("Input Image")
plt.axis("off")


# Plot the classification results on the right
plt.subplot(1, 2, 2)
categories = list(sorted_results.keys())
scores = list(sorted_results.values())


y_pos = np.arange(len(categories))
plt.barh(y_pos, scores, align="center")
plt.yticks(y_pos, categories)
plt.xlabel("Probability")
plt.title("CLIP Classification Results")


plt.tight_layout()
plt.show()


# Also print results to console
print("Classification Results:")
for category, score in sorted_results.items():
   print(f"{category}: {score:.4f}")
Output

2. DINO v2: Self-supervised Vision Transformer

DINO v2 by Meta AI Research provides powerful visual features without requiring labeled data, making it excellent for various downstream tasks.

Business Applications:

  • Visual similarity search
  • Anomaly detection
  • Product clustering
  • Image feature extraction for downstream ML tasks
import torch
import torchvision.transforms as T
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from torch.nn import functional as F
import requests
from io import BytesIO


# Load DINOv2 model
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vits14.eval()


# Preprocess images for DINOv2
transform = T.Compose([
   T.Resize(256),
   T.CenterCrop(224),
   T.ToTensor(),
   T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])


# Function to extract features
def extract_dinov2_features(image_path):
   image = Image.open(image_path).convert('RGB') if isinstance(image_path, str) else image_path
   img_tensor = transform(image).unsqueeze(0)


   with torch.no_grad():
       features = dinov2_vits14(img_tensor)


   return features


# Function to compute similarity between images
def compute_similarity(img1_path, img2_path):
   feat1 = extract_dinov2_features(img1_path)
   feat2 = extract_dinov2_features(img2_path)


   # Normalize features
   feat1 = F.normalize(feat1, dim=1)
   feat2 = F.normalize(feat2, dim=1)


   # Compute cosine similarity
   similarity = torch.mm(feat1, feat2.transpose(0, 1)).item()
   return similarity


# Function to download image from URL
def download_image(url):
   response = requests.get(url, stream=True)
   return Image.open(BytesIO(response.content)).convert('RGB')


# Function to visualize image pair with similarity score
def visualize_similarity(img1_path, img2_path, title=None):
   # Load images
   if img1_path.startswith(('http://', 'https://')):
       img1 = download_image(img1_path)
   else:
       img1 = Image.open(img1_path).convert('RGB')


   if img2_path.startswith(('http://', 'https://')):
       img2 = download_image(img2_path)
   else:
       img2 = Image.open(img2_path).convert('RGB')


   # Compute similarity
   similarity = compute_similarity(img1, img2)


   # Create figure for visualization
   fig, axes = plt.subplots(1, 2, figsize=(12, 6))


   # Display images
   axes[0].imshow(np.array(img1))
   axes[0].set_title("Image 1")
   axes[0].axis("off")


   axes[1].imshow(np.array(img2))
   axes[1].set_title("Image 2")
   axes[1].axis("off")


   # Add similarity score as figure title
   fig_title = f"Similarity Score: {similarity:.4f}"
   if title:
       fig_title = f"{title}n{fig_title}"
   fig.suptitle(fig_title, fontsize=16)


   plt.tight_layout()
   plt.show()


   return similarity


# Example: Use direct URLs instead of downloading files first
# Sample sneaker images from Unsplash
url1 = "https://images.unsplash.com/photo-1560769629-975ec94e6a86?w=500"  # Red sneaker
url2 = "https://images.unsplash.com/photo-1600185365926-3a2ce3cdb9eb?w=500"  # White sneaker
url3 = "https://images.unsplash.com/photo-1491553895911-0055eca6402d?w=500"  # Another sneaker


# Visualize pairs with similarity scores
print("Comparing Product 1 and Product 2:")
similarity_1_2 = visualize_similarity(url1, url2, "Red Sneaker vs White Sneaker")


print("nComparing Product 1 and Product 3:")
similarity_1_3 = visualize_similarity(url1, url3, "Red Sneaker vs Another Sneaker")


print("nComparing Product 2 and Product 3:")
similarity_2_3 = visualize_similarity(url2, url3, "White Sneaker vs Another Sneaker")


# Print summary of all similarities
print("nSummary of Similarity Scores:")
print(f"Similarity between product 1 and 2: {similarity_1_2:.4f}")
print(f"Similarity between product 1 and 3: {similarity_1_3:.4f}")
print(f"Similarity between product 2 and 3: {similarity_2_3:.4f}")
Output

3. Segment Anything Model (SAM): Advanced Image Segmentation

SAM by Meta AI provides powerful zero-shot segmentation capabilities for various business applications.

Business Applications:

Automated image cataloging

Precise product measurement in retail

Medical image analysis

Agricultural crop monitoring

Content creation and editing

# Install required libraries for SAM
!pip install git+https://github.com/facebookresearch/segment-anything.git


import torch
import numpy as np
import matplotlib.pyplot as plt
from segment_anything import sam_model_registry, SamPredictor
import cv2
from PIL import Image
import requests


# Download SAM checkpoint
!wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth


# Load SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
device = "cuda" if torch.cuda.is_available() else "cpu"
sam.to(device)
predictor = SamPredictor(sam)


# Function to perform automatic segmentation
def segment_image(image_path):
   # Load image
   image = cv2.imread(image_path)
   image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)


   # Set image for SAM
   predictor.set_image(image_rgb)


   # Generate automatic masks
   masks, scores, logits = predictor.predict(
       point_coords=None,
       point_labels=None,
       multimask_output=True,
       box=None
   )


   return image_rgb, masks, scores


# Function to visualize segmentation results
def visualize_segmentation(image, masks, scores, limit=5):
   plt.figure(figsize=(15, 10))


   # Display original image
   plt.subplot(1, limit+1, 1)
   plt.imshow(image)
   plt.title("Original Image")
   plt.axis('off')


   # Display top masks
   top_indices = np.argsort(scores)[-limit:][::-1]
   for i, idx in enumerate(top_indices):
       plt.subplot(1, limit+1, i+2)
       plt.imshow(image)
       plt.imshow(masks[idx], alpha=0.7, cmap='jet')
       plt.title(f"Mask {i+1}nScore: {scores[idx]:.3f}")
       plt.axis('off')


   plt.tight_layout()
   plt.show()


# Example: Product segmentation for e-commerce
!wget -q -O product_image.jpg "https://images.unsplash.com/photo-1525966222134-fcfa99b8ae77?w=800"


image_rgb, masks, scores = segment_image("product_image.jpg")
visualize_segmentation(image_rgb, masks, scores)


# Business application: Calculate precise product measurements
def calculate_object_dimensions(mask):
   # Find contours in the mask
   contours, _ = cv2.findContours((mask * 255).astype(np.uint8),
                                  cv2.RETR_EXTERNAL,
                                  cv2.CHAIN_APPROX_SIMPLE)


   if not contours:
       return None


   # Get the largest contour
   largest_contour = max(contours, key=cv2.contourArea)


   # Get bounding rectangle
   x, y, w, h = cv2.boundingRect(largest_contour)


   # Calculate aspect ratio
   aspect_ratio = w / h


   # Calculate area in pixels
   area_pixels = cv2.contourArea(largest_contour)


   return {
       'width': w,
       'height': h,
       'aspect_ratio': aspect_ratio,
       'area_pixels': area_pixels
   }


# Apply to the highest scoring mask
best_mask_idx = np.argmax(scores)
dimensions = calculate_object_dimensions(masks[best_mask_idx])


print("Product Dimensions:")
print(f"Width: {dimensions['width']} pixels")
print(f"Height: {dimensions['height']} pixels")
print(f"Aspect Ratio: {dimensions['aspect_ratio']:.2f}")
print(f"Area: {dimensions['area_pixels']} square pixels")
Output

4. BLIP-2: Vision-Language Model for Business Intelligence

BLIP-2 provides advanced vision-language capabilities for multimodal business applications.

Business Applications:

  • Automated product description generation
  • Image-based customer service automation
  • Visual content analysis for marketing
  • Social media content understanding
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
import numpy as np
from io import BytesIO


# Load BLIP-2 model
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)


if torch.cuda.is_available():
   model = model.to("cuda")


# Function to download image from URL
def download_image(url):
   response = requests.get(url, stream=True)
   return Image.open(BytesIO(response.content)).convert('RGB')


# Function for image captioning
def generate_caption(image_path):
   # Load image from path or URL
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   inputs = processor(images=image, return_tensors="pt")


   if torch.cuda.is_available():
       inputs = {k: v.to("cuda") for k, v in inputs.items()}


   generated_ids = model.generate(**inputs, max_new_tokens=50)
   generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()


   return generated_text


# Function for visual question answering
def visual_qa(image_path, question):
   # Load image from path or URL
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # FIX: Properly format the question for the model
   # BLIP-2 needs a specific prompt format for QA
   prompt = f"Question: {question} Answer:"
   inputs = processor(images=image, text=prompt, return_tensors="pt")


   if torch.cuda.is_available():
       inputs = {k: v.to("cuda") for k, v in inputs.items()}


   generated_ids = model.generate(
       **inputs,
       max_new_tokens=30,
       do_sample=False  # Use greedy decoding for more precise answers
   )


   answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
   # Remove the prompt part from the answer
   answer = answer.replace(prompt, "").strip()


   return answer


# Function to visualize image with caption and QA
def visualize_product_analysis(image_path, questions=None):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Generate caption
   caption = generate_caption(image)


   # Default questions if none provided
   if questions is None:
       questions = [
           "What color is this product?",
           "What material is this product made of?",
           "What is the target demographic for this product?",
           "What is a key feature of this product?"
       ]


   # Get answers
   answers = []
   for question in questions:
       answer = visual_qa(image, question)
       answers.append((question, answer))


   # Create visualization
   plt.figure(figsize=(12, 10))


   # Display image
   plt.subplot(2, 1, 1)
   plt.imshow(np.array(image))
   plt.title("Product Image", fontsize=14)
   plt.axis('off')


   # Display caption and Q&A
   plt.subplot(2, 1, 2)
   plt.axis('off')


   text_content = f"Generated Description: {caption}nn"
   text_content += "Product Analysis:n"
   for q, a in answers:
       text_content += f"Q: {q}nA: {a}nn"


   plt.text(0.01, 0.99, text_content, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top', wrap=True)


   plt.tight_layout()
   plt.show()


   return caption, answers


# Business application: Automated product listing
def create_product_listing(image_path):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Get basic caption
   caption = generate_caption(image)


   # Extract product attributes with more specific prompting
   color = visual_qa(image, "What colors are visible in this product?")
   material = visual_qa(image, "What material does this product appear to be made of?")
   use_case = visual_qa(image, "What would be the main use case for this product?")
   unique_features = visual_qa(image, "What are any unique or notable features of this product?")


   # Create structured listing
   listing = {
       "title": caption,
       "attributes": {
           "color": color,
           "material": material,
           "primary_use": use_case,
           "unique_features": unique_features
       }
   }


   # Visualize the listing
   plt.figure(figsize=(14, 10))


   # Display image
   plt.subplot(1, 2, 1)
   plt.imshow(np.array(image))
   plt.title("Product Image", fontsize=14)
   plt.axis('off')


   # Display listing details
   plt.subplot(1, 2, 2)
   plt.axis('off')


   listing_text = f"PRODUCT LISTINGnn"
   listing_text += f"Title: {listing['title']}nn"
   listing_text += "Product Attributes:n"
   for attr, value in listing['attributes'].items():
       listing_text += f"{attr.replace('_', ' ').title()}: {value}n"


   plt.text(0.01, 0.99, listing_text, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top')


   plt.tight_layout()
   plt.show()


   return listing


# Function for marketing content analysis
def analyze_marketing_content(image_path):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Marketing-specific questions
   marketing_questions = [
       "What emotions does this image evoke?",
       "What brand values are communicated in this image?",
       "What target audience would this image appeal to?",
       "What call to action would pair well with this image?",
       "What marketing channel would this image be most effective on?"
   ]


   # Get answers
   marketing_insights = {}
   for question in marketing_questions:
       answer = visual_qa(image, question)
       key = question.split("?")[0].strip().lower().replace(" ", "_")
       marketing_insights[key] = answer


   # Visualize the analysis
   plt.figure(figsize=(14, 10))


   # Display image
   plt.subplot(1, 2, 1)
   plt.imshow(np.array(image))
   plt.title("Marketing Visual", fontsize=14)
   plt.axis('off')


   # Display marketing insights
   plt.subplot(1, 2, 2)
   plt.axis('off')


   insights_text = "MARKETING CONTENT ANALYSISnn"
   for question, key in zip(marketing_questions, marketing_insights.keys()):
       insights_text += f"{question}n{marketing_insights[key]}nn"


   plt.text(0.01, 0.99, insights_text, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top')


   plt.tight_layout()
   plt.show()


   return marketing_insights


# Function for social media understanding
def analyze_social_media_content(image_path):
   # Load image
   if isinstance(image_path, str):
       if image_path.startswith(('http://', 'https://')):
           image = download_image(image_path)
       else:
           image = Image.open(image_path).convert('RGB')
   else:
       image = image_path


   # Generate caption
   caption = generate_caption(image)


   # Social media specific analysis
   engagement_potential = visual_qa(image, "How likely is this image to engage viewers on social media?")
   suggested_hashtags = visual_qa(image, "What hashtags would be appropriate for this image on social media?")
   platform_fit = visual_qa(image, "Which social media platform would this image perform best on?")
   content_type = visual_qa(image, "What type of social media post would this image be suitable for?")


   # Create analysis dict
   social_analysis = {
       "caption": caption,
       "engagement_potential": engagement_potential,
       "suggested_hashtags": suggested_hashtags,
       "platform_fit": platform_fit,
       "content_type": content_type
   }


   # Visualize the analysis
   plt.figure(figsize=(14, 10))


   # Display image
   plt.subplot(1, 2, 1)
   plt.imshow(np.array(image))
   plt.title("Social Media Content", fontsize=14)
   plt.axis('off')


   # Display social media insights
   plt.subplot(1, 2, 2)
   plt.axis('off')


   insights_text = "SOCIAL MEDIA CONTENT ANALYSISnn"
   insights_text += f"Caption: {social_analysis['caption']}nn"
   insights_text += f"Engagement Potential: {social_analysis['engagement_potential']}nn"
   insights_text += f"Suggested Hashtags: {social_analysis['suggested_hashtags']}nn"
   insights_text += f"Best Platform: {social_analysis['platform_fit']}nn"
   insights_text += f"Content Type: {social_analysis['content_type']}n"


   plt.text(0.01, 0.99, insights_text, transform=plt.gca().transAxes,
            fontsize=12, verticalalignment='top')


   plt.tight_layout()
   plt.show()


   return social_analysis


# Example usage
if __name__ == "__main__":
   # Example: E-commerce product analysis
   product_url = "https://images.unsplash.com/photo-1598033129183-c4f50c736f10?w=800"


   print("1. Basic Product Analysis")
   caption, qa_results = visualize_product_analysis(product_url)


   print("n2. Creating Automated Product Listing")
   product_listing = create_product_listing(product_url)


   print("n3. Marketing Content Analysis")
   marketing_url = "https://images.unsplash.com/photo-1581252584837-9f0b1d3bf82c?ixlib=rb-4.0.3&q=80"
   marketing_insights = analyze_marketing_content(marketing_url)


   print("n4. Social Media Content Analysis")
   social_url = "https://images.unsplash.com/photo-1534442072653-dbbf80c5e1ae?ixlib=rb-4.0.3&q=80"
   social_analysis = analyze_social_media_content(social_url)
Output 1
Output 2

Conclusion

This tutorial provides hands-on implementation guidance for deploying four key computer vision foundation models into business applications: CLIP (zero-shot classification), DINO v2 (self-supervised learning), SAM (image segmentation), and BLIP-2 (vision-language tasks).Future experimentation could explore model ensemble techniques, fine-tuning on domain-specific datasets, edge deployment optimization, and integration with business intelligence platforms to maximize ROI on vision AI investments.


Check out the Notebook here. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Vision Foundation Models: Implementation and Business Applications appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/05/03/vision-foundation-models-implementation-and-business-applications/feed/ 0 71072
LLMs Can Now Reason in Parallel: UC Berkeley and UCSF Researchers Introduce Adaptive Parallel Reasoning to Scale Inference Efficiently Without Exceeding Context Windows https://www.marktechpost.com/2025/05/02/llms-can-now-reason-in-parallel-uc-berkeley-and-ucsf-researchers-introduce-adaptive-parallel-reasoning-to-scale-inference-efficiently-without-exceeding-context-windows/ https://www.marktechpost.com/2025/05/02/llms-can-now-reason-in-parallel-uc-berkeley-and-ucsf-researchers-introduce-adaptive-parallel-reasoning-to-scale-inference-efficiently-without-exceeding-context-windows/#respond Sat, 03 May 2025 06:00:15 +0000 https://www.marktechpost.com/?p=71063 Large language models (LLMs) have made significant strides in reasoning capabilities, exemplified by breakthrough systems like OpenAI o1 and DeepSeekR1, which utilize test-time compute for search and reinforcement learning to optimize performance. Despite this progress, current methodologies face critical challenges that impede their effectiveness. Serialized chain-of-thought approaches generate excessively long output sequences, increasing latency and […]

The post LLMs Can Now Reason in Parallel: UC Berkeley and UCSF Researchers Introduce Adaptive Parallel Reasoning to Scale Inference Efficiently Without Exceeding Context Windows appeared first on MarkTechPost.

]]>
Large language models (LLMs) have made significant strides in reasoning capabilities, exemplified by breakthrough systems like OpenAI o1 and DeepSeekR1, which utilize test-time compute for search and reinforcement learning to optimize performance. Despite this progress, current methodologies face critical challenges that impede their effectiveness. Serialized chain-of-thought approaches generate excessively long output sequences, increasing latency and pushing against context window constraints. In contrast, parallel methods such as best-of-N and self-consistency suffer from poor coordination between inference paths and lack end-to-end optimization, resulting in computational inefficiency and limited improvement potential. Also, structured inference-time search techniques like tree-of-thought rely on manually designed search structures, significantly restricting their flexibility and ability to scale across different reasoning tasks and domains.

Several approaches have emerged to address the computational challenges in LLM reasoning. Inference-time scaling methods have improved downstream task performance by increasing test-time computation, but typically generate significantly longer output sequences. This creates higher latency and forces models to fit entire reasoning chains into a single context window, making it difficult to attend to relevant information. Parallelization strategies like ensembling have attempted to mitigate these issues by running multiple independent language model calls simultaneously. However, these methods suffer from poor coordination across parallel threads, leading to redundant computation and inefficient resource utilization. Fixed parallelizable reasoning structures, such as tree-of-thought and multi-agent reasoning systems, have been proposed, but their hand-designed search structures limit flexibility and scalability. Other approaches, like PASTA decompose tasks into parallel sub-tasks but ultimately reintegrate the complete context into the main inference trajectory, failing to reduce context usage effectively. Meanwhile, Hogwild! Inference employs parallel worker threads but relies exclusively on prompting without end-to-end optimization.

Researchers from UC Berkeley and UCSF have proposed Adaptive Parallel Reasoning (APR). This robust approach enables language models to dynamically distribute inference-time computation across both serial and parallel operations. This methodology generalizes existing reasoning approaches—including serialized chain-of-thought reasoning, parallelized inference with self-consistency, and structured search—by training models to determine when and how to parallelize inference operations rather than imposing fixed search structures. APR introduces two key innovations: a parent-child threading mechanism and end-to-end reinforcement learning optimization. The threading mechanism allows parent inference threads to delegate subtasks to multiple child threads through a spawn() operation, enabling parallel exploration of distinct reasoning paths. Child threads then return outcomes to the parent thread via a join() operation, allowing the parent to continue decoding with this new information. Built on the SGLang model serving framework, APR significantly reduces real-time latency by performing inference in child threads simultaneously through batching. The second innovation—fine-tuning via end-to-end reinforcement learning—optimizes for overall task success without requiring predefined reasoning structures. This approach delivers three significant advantages: higher performance within fixed context windows, superior scaling with increased compute budgets, and improved performance at equivalent latency compared to traditional methods.

The APR architecture implements a sophisticated multi-threading mechanism that enables language models to dynamically orchestrate parallel inference processes. APR addresses the limitations of serialized reasoning methods by distributing computation across parent and child threads, minimizing latency while improving performance within context constraints. The architecture consists of three key components:

First, the multi-threading inference system allows parent threads to spawn multiple child threads using a spawn(msgs) operation. Each child thread receives a distinct context and executes inference independently, yet simultaneously using the same language model. When a child thread completes its task, it returns results to the parent via a join(msg) operation, selectively communicating only the most relevant information. This approach significantly reduces token usage by keeping intermediate search traces confined to child threads.

Second, the training methodology employs a two-phase approach. Initially, APR utilizes supervised learning with automatically-generated demonstrations that incorporate both depth-first and breadth-first search strategies, creating hybrid search patterns. The symbolic solver creates demonstrations with parallelization, decomposing searches into multiple components that avoid context window bottlenecks during both training and inference.

Finally, the system implements end-to-end reinforcement learning optimization with GRPO (Gradient-based Policy Optimization). During this phase, the model learns to strategically determine when and how broadly to invoke child threads, optimizing for computational efficiency and reasoning effectiveness. The model iteratively samples reasoning traces, evaluates their correctness, and adjusts parameters accordingly, ultimately learning to balance parallel exploration against context window constraints for maximum performance.

The evaluation compared Adaptive Parallel Reasoning against serialized chain-of-thought reasoning and self-consistency methods using a standard decoder-only language model with 228M parameters built on the Llama2 architecture and supporting a 4,096-token context window. All models were initialized through supervised learning on 500,000 trajectories from symbolic solvers. For direct compute-accuracy assessment, the team implemented a budget constraint method with context-window conditioning for SoS+ models and thread count conditioning for APR models. The SGLang framework was utilized for inference due to its support for continuous batching and radix attention, enabling efficient APR implementation.

Experimental results demonstrate that APR consistently outperforms serialized methods across multiple dimensions. When scaling with higher compute, APR initially underperforms in low-compute regimes due to parallelism overhead but significantly outpaces SoS+ as compute increases, achieving a 13.5% improvement at 20k tokens and surpassing SoS+ pass@8 performance while using 57.4% less compute. For context window scaling, APR consistently exploits context more efficiently, with 10 threads achieving approximately 20% higher accuracy at the 4k-token limit by distributing reasoning across parallel threads rather than containing entire traces within a single context window.

End-to-end reinforcement learning significantly enhances APR performance, boosting accuracy from 75.5% to 83.4%. The RL-optimized models demonstrate markedly different behaviors, increasing both sequence length (22.1% relative increase) and number of child threads (34.4% relative increase). This reveals that for Countdown tasks, RL-optimized models favor broader search patterns over deeper ones, demonstrating the algorithm’s ability to discover optimal search strategies autonomously.

APR demonstrates superior efficiency in both theoretical and practical evaluations. When measuring sequential token usage, APR significantly boosts accuracy with minimal additional sequential tokens beyond 2,048, rarely exceeding 2,500 tokens, while SoS+ shows only marginal improvements despite approaching 3,000 tokens. Real-world latency testing on an 8-GPU NVIDIA RTX A6000 server reveals APR achieves substantially better accuracy-latency trade-offs, reaching 75% accuracy at 5000ms per sample—an 18% absolute improvement over SoS+’s 57%. These results highlight APR’s effective hardware parallelization and potential for optimized performance in deployment scenarios.

Adaptive Parallel Reasoning represents a significant advancement in language model reasoning capabilities by enabling dynamic distribution of computation across serial and parallel paths through a parent-child threading mechanism. By combining supervised training with end-to-end reinforcement learning, APR eliminates the need for manually designed structures while allowing models to develop optimal parallelization strategies. Experimental results on the Countdown task demonstrate APR’s substantial advantages: higher performance within fixed context windows, superior scaling with increased compute budgets, and significantly improved success rates at equivalent latency constraints. These achievements highlight the potential of reasoning systems that dynamically structure inference processes to achieve enhanced scalability and efficiency in complex problem-solving tasks.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Now Reason in Parallel: UC Berkeley and UCSF Researchers Introduce Adaptive Parallel Reasoning to Scale Inference Efficiently Without Exceeding Context Windows appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/05/02/llms-can-now-reason-in-parallel-uc-berkeley-and-ucsf-researchers-introduce-adaptive-parallel-reasoning-to-scale-inference-efficiently-without-exceeding-context-windows/feed/ 0 71063
Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning https://www.marktechpost.com/2025/05/01/training-llm-agents-just-got-more-stable-researchers-introduce-starpo-s-and-ragen-to-tackle-multi-turn-reasoning-and-collapse-in-reinforcement-learning/ https://www.marktechpost.com/2025/05/01/training-llm-agents-just-got-more-stable-researchers-introduce-starpo-s-and-ragen-to-tackle-multi-turn-reasoning-and-collapse-in-reinforcement-learning/#respond Fri, 02 May 2025 06:31:03 +0000 https://www.marktechpost.com/?p=71032 Large language models (LLMs) face significant challenges when trained as autonomous agents in interactive environments. Unlike static tasks, agent settings require sequential decision-making, cross-turn memory maintenance, and adaptation to stochastic environmental feedback. These capabilities are essential for developing effective planning assistants, robotics applications, and tutoring agents that can self-improve through experience. While reinforcement learning (RL) […]

The post Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning appeared first on MarkTechPost.

]]>
Large language models (LLMs) face significant challenges when trained as autonomous agents in interactive environments. Unlike static tasks, agent settings require sequential decision-making, cross-turn memory maintenance, and adaptation to stochastic environmental feedback. These capabilities are essential for developing effective planning assistants, robotics applications, and tutoring agents that can self-improve through experience. While reinforcement learning (RL) has been applied to LLMs using rule-based rewards, training self-evolving agents that can reason and adapt remains underexplored. Current approaches suffer from training instability, complex reward signal interpretation, and limited generalisation across varying prompts or changing environments, particularly during multi-turn interactions with unpredictable feedback. The fundamental question emerges: which design elements are crucial for creating LLM agents that learn effectively and maintain stability throughout their evolution?

Through diverse methodologies, RL has significantly advanced LLMs’ reasoning capabilities. PPO maintains training stability by clipping policy updates, while GRPO enhances systematic problem-solving abilities. SAC employs entropy-regularised objectives for robust exploration, and meta tokens facilitate structured thinking. PRM and MCTS-based approaches have further improved systematic reasoning. Simultaneously, chain-of-thought techniques like STaR iteratively utilise small rationale examples alongside larger datasets. At the same time, DAPO, Dr. GRPO, and Open Reasoner Zero demonstrate that minimalist RL techniques with decoupled clipping and simple reward schemes can substantially enhance reasoning performance.

LLM agent architectures have evolved from basic reasoning-action frameworks to structured planning approaches and complex multi-agent systems. Testing environments range from specialised platforms like Sokoban and FrozenLake to general-purpose frameworks like HuggingGPT, enabling applications from web navigation to coding assistance and embodied tasks. Despite these advances, challenges persist in architectural complexity and self-correction, particularly for diverse multi-step reasoning tasks where maintaining coherence across interactions remains problematic.

Researchers have approached agent learning through StarPO (State-Thinking-Actions-Reward Policy Optimisation), a unified framework for trajectory-level agent training with flexible control over reasoning processes, reward mechanisms, and prompt structures. Building on this framework, they developed RAGEN, a modular system implementing complete training loops for analysing LLM agent dynamics in multi-turn stochastic environments. To isolate learning factors from confounding variables like pretrained knowledge, evaluation focuses on three controlled gaming environments: Bandit (single-turn, stochastic), Sokoban (multi-turn, deterministic), and Frozen Lake (multi-turn, stochastic). These minimalistic environments require policy learning through interaction rather than relying on pre-existing knowledge. The analysis reveals three critical dimensions of agent learning: gradient stability issues in multi-turn reinforcement learning, the importance of rollout frequency and diversity in shaping agent evolution, and the need for carefully designed reward signals to develop genuine reasoning capabilities rather than shallow action selection or hallucinated thinking processes.

StarPO represents a unique framework designed specifically for optimising multi-turn interaction trajectories in LLM agents. Unlike traditional approaches that treat each action independently, StarPO optimises entire trajectories—including observations, reasoning traces, actions, and feedback—as coherent units. This trajectory-level approach is particularly suited for interactive environments where agents must maintain memory across turns and adapt to stochastic feedback. StarPO’s objective function focuses on maximising expected rewards across complete trajectories rather than individual steps, making it directly compatible with autoregressive LLMs through decomposition into token-level likelihoods. The framework integrates reasoning-guided structured outputs that combine both intermediate thinking processes and executable actions, enabling agents to develop more sophisticated decision-making capabilities while maintaining learning stability in complex environments.

Experimental results reveal that StarPO-S significantly outperforms vanilla StarPO across multiple agent tasks. By implementing uncertainty-based instance filtering, KL term removal, and asymmetric clipping, StarPO-S effectively delays performance collapse and enhances final task outcomes. The stabilised approach demonstrates particular effectiveness in complex environments like FrozenLake and Sokoban, where retaining only 25-50% of high-variance rollouts dramatically improves training stability while reducing computational requirements by up to 50%.

Task diversity and interaction granularity significantly impact performance. Models trained with higher task diversity and 4-6 actions per turn demonstrate superior generalisation capabilities across novel vocabulary and larger environments. Also, frequent rollout updates prove critical for maintaining alignment between optimisation targets and policy behavior. Agents trained with up-to-date rollouts every 1-10 updates achieve faster convergence and higher success rates compared to those relying on outdated trajectory data.

Symbolic reasoning benefits vary substantially between single-turn and multi-turn tasks. While reasoning traces significantly improve generalisation in single-turn Bandit environments, they provide limited advantage in complex multi-turn settings like Sokoban and FrozenLake. Analysis shows reasoning length consistently declines during training, suggesting models gradually suppress their thought processes when rewards are sparse and delayed. This highlights the need for reward mechanisms that directly reinforce intermediate reasoning steps rather than relying solely on outcome-based feedback.

This research establishes reinforcement learning as a viable approach for training language agents in complex, stochastic environments. StarPO-S represents a significant advancement in stabilising multi-turn agent training through uncertainty-based sampling and exploration encouragement. By transitioning from human supervision to verifiable outcome-based rewards, this framework creates opportunities for developing more capable AI systems across theorem proving, software engineering, and scientific discovery. Future work should focus on multi-modal inputs, enhanced training efficiency, and applications to increasingly complex domains with verifiable objectives.


Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/05/01/training-llm-agents-just-got-more-stable-researchers-introduce-starpo-s-and-ragen-to-tackle-multi-turn-reasoning-and-collapse-in-reinforcement-learning/feed/ 0 71032
The WAVLab Team Releases of VERSA: A Comprehensive and Versatile Evaluation Toolkit for Assessing Speech, Audio, and Music Signals https://www.marktechpost.com/2025/04/28/the-wavlab-team-is-releases-of-versa-a-comprehensive-and-versatile-evaluation-toolkit-for-assessing-speech-audio-and-music-signals/ https://www.marktechpost.com/2025/04/28/the-wavlab-team-is-releases-of-versa-a-comprehensive-and-versatile-evaluation-toolkit-for-assessing-speech-audio-and-music-signals/#respond Tue, 29 Apr 2025 06:44:01 +0000 https://www.marktechpost.com/?p=70914 AI models have made remarkable strides in generating speech, music, and other forms of audio content, expanding possibilities across communication, entertainment, and human-computer interaction. The ability to create human-like audio through deep generative models is no longer a futuristic ambition but a tangible reality that is impacting industries today. However, as these models grow more […]

The post The WAVLab Team Releases of VERSA: A Comprehensive and Versatile Evaluation Toolkit for Assessing Speech, Audio, and Music Signals appeared first on MarkTechPost.

]]>
AI models have made remarkable strides in generating speech, music, and other forms of audio content, expanding possibilities across communication, entertainment, and human-computer interaction. The ability to create human-like audio through deep generative models is no longer a futuristic ambition but a tangible reality that is impacting industries today. However, as these models grow more sophisticated, the need for rigorous, scalable, and objective evaluation systems becomes critical. Evaluating the quality of generated audio is complex because it involves not only measuring signal accuracy but also assessing perceptual aspects such as naturalness, emotion, speaker identity, and musical creativity. Traditional evaluation practices, such as human subjective assessments, are time-consuming, expensive, and prone to psychological biases, making automated audio evaluation methods a necessity for advancing research and applications.

One persistent challenge in automated audio evaluation lies in the diversity and inconsistency of existing methods. Human evaluations, despite being a gold standard, suffer from biases such as range-equalizing effects and require significant labor and expert knowledge, particularly in nuanced areas like singing synthesis or emotional expression. Automatic metrics have filled this gap, but they vary widely depending on the application scenario, such as speech enhancement, speech synthesis, or music generation. Moreover, there is no universally adopted set of metrics or standardized framework, leading to scattered efforts and incomparable results across different systems. Without unified evaluation practices, it becomes increasingly difficult to benchmark the performance of audio generative models and track genuine progress in the field.

Existing tools and methods each cover only parts of the problem. Toolkits like ESPnet and SHEET offer evaluation modules, but focus heavily on speech processing, providing limited coverage for music or mixed audio tasks. AudioLDM-Eval, Stable-Audio-Metric, and Sony Audio-Metrics attempt broader audio evaluations but still suffer from fragmented metric support and inflexible configurations. Metrics such as Mean Opinion Score (MOS), PESQ (Perceptual Evaluation of Speech Quality), SI-SNR (Scale-Invariant Signal-to-Noise Ratio), and Fréchet Audio Distance (FAD) are widely used; however, most tools implement only a handful of these measures. Also, reliance on external references, whether matching or non-matching audio, text transcriptions, or visual cues, varies significantly between tools. Centralizing and standardizing these evaluations in a flexible and scalable toolkit has remained an unmet need until now.

Researchers from Carnegie Mellon University, Microsoft, Indiana University, Nanyang Technological University, the University of Rochester, Renmin University of China, Shanghai Jiaotong University, and Sony AI introduced VERSA, a new evaluation toolkit. VERSA stands out by offering a Python-based, modular toolkit that integrates 65 evaluation metrics, leading to 729 configurable metric variants. It uniquely supports speech, audio, and music evaluation within a single framework, a feature that no prior toolkit has comprehensively achieved. VERSA also emphasizes flexible configuration and strict dependency control, allowing easy adaptation to different evaluation needs without incurring software conflicts. Released publicly via GitHub, VERSA aims to become a foundational tool for benchmarking sound generation tasks, thereby making a significant contribution to the research and engineering communities.

The VERSA system is organized around two core scripts: ‘scorer.py’ and ‘aggregate_result.py’. The ‘scorer.py’ handles the actual computation of metrics, while ‘aggregate_result.py’ consolidates metric outputs into comprehensive evaluation reports. Input and output interfaces are designed to support a range of formats, including PCM, FLAC, MP3, and Kaldi-ARK, accommodating various file organizations from wav.scp mappings to simple directory structures. Metrics are controlled through unified YAML-style configuration files, allowing users to select metrics from a master list (general.yaml) or create specialized setups for individual metrics (e.g., mcd_f0.yaml for Mel Cepstral Distortion evaluation). To further simplify usability, VERSA ensures minimal default dependencies while providing optional installation scripts for metrics that require additional packages. Local forks of external evaluation libraries are incorporated, ensuring flexibility without strict version locking, enhancing both usability and system robustness.

When benchmarked against existing solutions, VERSA outperforms them significantly. It supports 22 independent metrics that do not require reference audio, 25 dependent metrics based on matching references, 11 metrics that rely on non-matching references, and five distributional metrics for evaluating generative models. For instance, independent metrics such as SI-SNR and VAD (Voice Activity Detection) are supported, alongside dependent metrics like PESQ and STOI (Short-Time Objective Intelligibility). The toolkit covers 54 metrics applicable to speech tasks, 22 to general audio, and 22 to music generation, offering unprecedented flexibility. Notably, VERSA supports evaluation using external resources, such as textual captions and visual cues, making it suitable for multimodal generative evaluation scenarios. Compared to other toolkits, such as AudioCraft (which supports only six metrics) or Amphion (15 metrics), VERSA offers unmatched breadth and depth.

The research demonstrates that VERSA enables consistent benchmarking by minimizing subjective variability, improving comparability by providing a unified metric set, and enhancing research efficiency by consolidating diverse evaluation methods into a single platform. By offering more than 700 metric variants simply through configuration adjustments, researchers no longer have to piece together different evaluation methods from multiple fragmented tools. This consistency in evaluation fosters reproducibility and fair comparisons, both of which are critical for tracking advancements in generative sound technologies.

Several Key Takeaways from the Research on VERSA include:

  • VERSA provides 65 metrics and 729 metric variations for evaluating speech, audio, and music.
  • It supports various file formats, including PCM, FLAC, MP3, and Kaldi-ARK.
  • The toolkit covers 54 metrics applicable to speech, 22 to audio, and 22 to music generation tasks.
  • Two core scripts, ‘scorer.py’ and ‘aggregate_result.py’, simplify the evaluation and report generation process.
  • VERSA offers strict but flexible dependency control, minimizing installation conflicts.
  • It supports evaluation using matching and non-matching audio references, text transcriptions, and visual cues.
  • Compared to 16 metrics in ESPnet and 15 in Amphion, VERSA’s 65 metrics represent a major advancement.
  • Released publicly, it aims to become a universal standard for evaluating sound generation.
  • The flexibility to modify configuration files enables users to generate up to 729 distinct evaluation setups.
  • The toolkit addresses biases and inefficiencies in subjective human evaluations through reliable automated assessments.

Check out the Paper, Demo on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post The WAVLab Team Releases of VERSA: A Comprehensive and Versatile Evaluation Toolkit for Assessing Speech, Audio, and Music Signals appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/04/28/the-wavlab-team-is-releases-of-versa-a-comprehensive-and-versatile-evaluation-toolkit-for-assessing-speech-audio-and-music-signals/feed/ 0 70914
Google DeepMind Research Introduces QuestBench: Evaluating LLMs’ Ability to Identify Missing Information in Reasoning Tasks https://www.marktechpost.com/2025/04/25/google-deepmind-research-introduces-questbench-evaluating-llms-ability-to-identify-missing-information-in-reasoning-tasks/ https://www.marktechpost.com/2025/04/25/google-deepmind-research-introduces-questbench-evaluating-llms-ability-to-identify-missing-information-in-reasoning-tasks/#respond Sat, 26 Apr 2025 04:06:48 +0000 https://www.marktechpost.com/?p=70841 Large language models (LLMs) have gained significant traction in reasoning tasks, including mathematics, logic, planning, and coding. However, a critical challenge emerges when applying these models to real-world scenarios. While current implementations typically operate under the assumption that all necessary information is provided upfront in well-specified tasks, reality often presents incomplete or ambiguous situations. Users […]

The post Google DeepMind Research Introduces QuestBench: Evaluating LLMs’ Ability to Identify Missing Information in Reasoning Tasks appeared first on MarkTechPost.

]]>
Large language models (LLMs) have gained significant traction in reasoning tasks, including mathematics, logic, planning, and coding. However, a critical challenge emerges when applying these models to real-world scenarios. While current implementations typically operate under the assumption that all necessary information is provided upfront in well-specified tasks, reality often presents incomplete or ambiguous situations. Users frequently omit crucial details when formulating math problems, and autonomous systems like robots must function in environments with partial observability. This fundamental mismatch between idealised complete-information settings and the incomplete nature of real-world problems necessitates LLMs to develop proactive information-gathering capabilities. Recognising information gaps and generating relevant clarifying questions represents an essential but underdeveloped functionality for LLMs to effectively navigate ambiguous scenarios and provide accurate solutions in practical applications.

Various approaches have attempted to address the challenge of information gathering in ambiguous scenarios. Active learning strategies acquire sequential data through methods like Bayesian optimisation, reinforcement learning, and robot planning with partially observable states. Research on ambiguity in natural language has explored semantic uncertainties, factual question-answering, task-oriented dialogues, and personalised preferences. Question-asking methods for LLMs include direct prompting techniques, information gain computation, and multi-stage clarification frameworks. However, most existing benchmarks focus on subjective tasks where multiple valid clarifying questions exist, making objective evaluation difficult. These approaches address ambiguous or knowledge-based tasks rather than underspecified reasoning problems, where an objectively correct question is determinable.

QuestBench presents a robust approach to evaluating LLMs’ ability to identify and acquire missing information in reasoning tasks. The methodology formalises underspecified problems as Constraint Satisfaction Problems (CSPs) where a target variable cannot be determined without additional information. Unlike semantic ambiguity, where multiple interpretations exist but each yields a solvable answer, underspecification renders problems unsolvable without supplementary data. QuestBench specifically focuses on “1-sufficient CSPs” – problems requiring knowledge of just one unknown variable’s value to solve for the target variable. The benchmark comprises three distinct domains: Logic-Q (logical reasoning tasks), Planning-Q (blocks world planning problems with partially observed initial states), and GSM-Q/GSME-Q (grade-school math problems in verbal and equation forms). The framework strategically categorises problems along four axes of difficulty: number of variables, number of constraints, search depth required, and expected guesses needed by brute-force search. This classification offers insights into LLMs’ reasoning strategies and performance limitations.

QuestBench employs a formal Constraint Satisfaction Problem framework,  precisely identify and evaluate information gaps in reasoning tasks. A CSP is defined as a tuple ⟨X, D, C, A, y⟩ where X represents variables, D denotes their domains, C encompasses constraints, A consists of variable assignments, and y is the target variable to solve. The framework introduces the “Known” predicate, indicating when a variable’s value is determinable either through direct assignment or derivation from existing constraints. A CSP is classified as underspecified when the target variable y cannot be determined from available information. The methodology focuses specifically on “1-sufficient CSPs”, where knowing just one additional variable is sufficient to solve for the target.

The benchmark measures model performance along four difficulty axes that correspond to algorithmic complexity: total number of variables (|X|), total number of constraints (|C|), depth of backwards search tree (d), and expected number of random guesses needed (𝔼BF). These metrics provide quantitative measures of problem complexity and help differentiate between semantic ambiguity (multiple valid interpretations) and underspecification (missing information). For each task, models must identify the single sufficient variable that, when known, enables solving for the target variable, requiring both recognition of information gaps and strategic reasoning about constraint relationships.

Experimental evaluation of QuestBench reveals varying capabilities among leading large language models in information-gathering tasks. GPT-4o, GPT-4-o1 Preview, Claude 3.5 Sonnet, Gemini 1.5 Pro/Flash, Gemini 2.0 Flash Thinking Experimental, and open-sourced Gemma models were tested across zero-shot, chain-of-thought, and four-shot settings. Tests were conducted on representative subsets of 288 GSM-Q and 151 GSME-Q tasks between June 2024 and March 2025. Performance analysis along the difficulty axes demonstrates that models struggle most with problems featuring high search depths and complex constraint relationships. Chain-of-thought prompting generally improved performance across all models, suggesting that explicit reasoning pathways help identify information gaps. Among the evaluated models, Gemini 2.0 Flash Thinking Experimental achieved the highest accuracy, particularly on planning tasks, while open-source models showed competitive performance on logical reasoning tasks but struggled with complex math problems requiring deeper search.

QuestBench provides a unique framework for evaluating LLMs’ ability to identify underspecified information and generate appropriate clarifying questions in reasoning tasks. Current state-of-the-art models demonstrate reasonable performance on simple algebra problems but struggle significantly with complex logic and planning tasks. Performance deteriorates as problem complexity increases along key dimensions like search depth and expected number of brute-force guesses. These findings highlight that while reasoning ability is necessary for effective question-asking, it alone may not be sufficient. Significant advancement opportunities exist in developing LLMs that can better recognize information gaps and request clarification when operating under uncertainty.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Google DeepMind Research Introduces QuestBench: Evaluating LLMs’ Ability to Identify Missing Information in Reasoning Tasks appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/04/25/google-deepmind-research-introduces-questbench-evaluating-llms-ability-to-identify-missing-information-in-reasoning-tasks/feed/ 0 70841
LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels https://www.marktechpost.com/2025/04/18/llms-can-now-solve-challenging-math-problems-with-minimal-data-researchers-from-uc-berkeley-and-ai2-unveil-a-fine-tuning-recipe-that-unlocks-mathematical-reasoning-across-difficulty-levels/ https://www.marktechpost.com/2025/04/18/llms-can-now-solve-challenging-math-problems-with-minimal-data-researchers-from-uc-berkeley-and-ai2-unveil-a-fine-tuning-recipe-that-unlocks-mathematical-reasoning-across-difficulty-levels/#respond Sat, 19 Apr 2025 05:49:17 +0000 https://www.marktechpost.com/?p=70645 Language models have made significant strides in tackling reasoning tasks, with even small-scale supervised fine-tuning (SFT) approaches such as LIMO and s1 demonstrating remarkable improvements in mathematical problem-solving capabilities. However, fundamental questions remain about these advancements: Do these models genuinely generalise beyond their training data, or are they merely overfitting to test sets? The research […]

The post LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels appeared first on MarkTechPost.

]]>
Language models have made significant strides in tackling reasoning tasks, with even small-scale supervised fine-tuning (SFT) approaches such as LIMO and s1 demonstrating remarkable improvements in mathematical problem-solving capabilities. However, fundamental questions remain about these advancements: Do these models genuinely generalise beyond their training data, or are they merely overfitting to test sets? The research community faces challenges in understanding which capabilities are enhanced through small-scale SFT and which limitations persist despite these improvements. Despite impressive performance on popular benchmarks, there is an incomplete understanding of these fine-tuned models’ specific strengths and weaknesses, creating a critical gap in knowledge about their true reasoning abilities and practical limitations.

Various attempts have been made to understand the effects of reasoning-based supervised fine-tuning beyond simple benchmark scores. Researchers have questioned whether SFT merely improves performance on previously seen problem types or genuinely enables models to transfer problem-solving strategies to new contexts, such as applying coordinate-based techniques in geometry. Existing methods focus on factors like correctness, solution length, and response diversity, which initial studies suggest play significant roles in model improvement through SFT. However, these approaches lack the granularity needed to determine exactly which types of previously unsolvable questions become solvable after fine-tuning, and which problem categories remain resistant to improvement despite extensive training. The research community still struggles to establish whether observed improvements reflect deeper learning or simply memorisation of training trajectories, highlighting the need for more sophisticated analysis methods.

The researchers from the University of California, Berkeley and the Allen Institute for AI propose a tiered analysis framework to investigate how supervised fine-tuning affects reasoning capabilities in language models. This approach utilises the AIME24 dataset, chosen for its complexity and widespread use in reasoning research, which exhibits a ladder-like structure where models solving higher-tier questions typically succeed on lower-tier ones. By categorising questions into four difficulty tiers, Easy, Medium, Hard, and Exh, the study systematically examines the specific requirements for advancing between tiers. The analysis reveals that progression from Easy to Medium primarily requires adopting an R1 reasoning style with long inference context, while Hard-level questions demand greater computational stability during deep exploration. Exh-level questions present a fundamentally different challenge, requiring unconventional problem-solving strategies that current models uniformly struggle with. The research also identifies four key insights: the performance gap between potential and stability in small-scale SFT models, minimal benefits from careful dataset curation, diminishing returns from scaling SFT datasets, and potential intelligence barriers that may not be overcome through SFT alone.

The methodology employs a comprehensive tiered analysis using the AIME24 dataset as the primary test benchmark. This choice stems from three key attributes: the dataset’s hierarchical difficulty that challenges even state-of-the-art models, its diverse coverage of mathematical domains, and its focus on high school mathematics that isolates pure reasoning ability from domain-specific knowledge. Qwen2.5-32 B-Instruct serves as the base model due to its widespread adoption and inherent cognitive behaviours, including verification, backtracking, and subgoal setting. The fine-tuning data consists of question-response pairs from the Openr1-Math-220k dataset, specifically using CoT trajectories generated by DeepSeek R1 for problems from NuminaMath1.5, with incorrect solutions filtered out. The training configuration mirrors prior studies with a learning rate of 1 × 10−5, weight decay of 1 × 10−4, batch size of 32, and 5 epochs. Performance evaluation employs avg@n (average pass rate over multiple attempts) and cov@n metrics, with questions categorised into four difficulty levels (Easy, Medium, Hard, and Extremely Hard) based on model performance patterns.

Research results reveal that effective progression from Easy to Medium-level mathematical problem-solving requires minimal but specific conditions. The study systematically examined multiple training variables, including foundational knowledge across diverse mathematical categories, dataset size variations (100-1000 examples per category), trajectory length (short, normal, or long), and trajectory style (comparing DeepSeek-R1 with Gemini-flash). Through comprehensive ablation studies, researchers isolated the impact of each dimension on model performance, represented as P = f(C, N, L, S), where C represents category, N represents the number of trajectories, L represents length, and S represents style. The findings demonstrate that achieving performance ≥90% on Medium-level questions minimally requires at least 500 normal or long R1-style trajectories, regardless of the specific mathematical category. Models consistently fail to meet performance thresholds when trained with fewer trajectories, shorter trajectories, or Gemini-style trajectories. This indicates that reasoning trajectory length and quantity represent critical factors in developing mathematical reasoning capabilities, while the specific subject matter of the trajectories proves less important than their structural characteristics.

The research demonstrates that models with small-scale supervised fine-tuning can potentially solve as many questions as more sophisticated models like Deepseek-R1, though significant challenges remain. The primary limitation identified is instability in mathematical reasoning, rather than capability. Experimental results show that geometry-trained models can achieve a coverage score of 90, matching R1’s performance when given multiple attempts, yet their overall accuracy lags by more than 20%. This performance gap stems primarily from instability in deep exploration and computational limitations during complex problem-solving. While increasing the SFT dataset size offers one solution path, performance enhancement follows a logarithmic scaling trend with diminishing returns. Notably, the study challenges recent assertions about the importance of careful dataset curation, revealing that performance across various mathematical categories remains consistent within a narrow range of 55±4%, with only marginal differences between specifically constructed similar datasets and randomly constructed ones. This conclusion suggests that the quantity and quality of reasoning trajectories matter more than subject-specific content for developing robust mathematical reasoning capabilities.


Here is the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/04/18/llms-can-now-solve-challenging-math-problems-with-minimal-data-researchers-from-uc-berkeley-and-ai2-unveil-a-fine-tuning-recipe-that-unlocks-mathematical-reasoning-across-difficulty-levels/feed/ 0 70645
LLM Reasoning Benchmarks are Statistically Fragile: New Study Shows Reinforcement Learning RL Gains often Fall within Random Variance https://www.marktechpost.com/2025/04/15/llm-reasoning-benchmarks-are-statistically-fragile-new-study-shows-reinforcement-learning-rl-gains-often-fall-within-random-variance/ https://www.marktechpost.com/2025/04/15/llm-reasoning-benchmarks-are-statistically-fragile-new-study-shows-reinforcement-learning-rl-gains-often-fall-within-random-variance/#respond Tue, 15 Apr 2025 16:44:40 +0000 https://www.marktechpost.com/?p=70538 Reasoning capabilities have become central to advancements in large language models, crucial in leading AI systems developed by major research labs. Despite a surge in research focused on understanding and enhancing LLM reasoning abilities, significant methodological challenges persist in evaluating these capabilities accurately. The field faces growing concerns regarding evaluation rigor as non-reproducible or inconclusive […]

The post LLM Reasoning Benchmarks are Statistically Fragile: New Study Shows Reinforcement Learning RL Gains often Fall within Random Variance appeared first on MarkTechPost.

]]>
Reasoning capabilities have become central to advancements in large language models, crucial in leading AI systems developed by major research labs. Despite a surge in research focused on understanding and enhancing LLM reasoning abilities, significant methodological challenges persist in evaluating these capabilities accurately. The field faces growing concerns regarding evaluation rigor as non-reproducible or inconclusive assessments risk distorting scientific understanding, misguiding adoption decisions, and skewing future research priorities. In the rapidly evolving landscape of LLM reasoning, where quick publication cycles and benchmarking competitions are commonplace, methodological shortcuts can silently undermine genuine progress. While reproducibility issues in LLM evaluations have been documented, their continued presence—particularly in reasoning tasks—demands heightened scrutiny and more stringent evaluation standards to ensure that reported advances reflect genuine capabilities rather than artifacts of flawed assessment methodologies.

Numerous approaches have emerged to enhance reasoning capabilities in language models, with supervised fine-tuning (SFT) and reinforcement learning (RL) being the primary methods of interest. Recent innovations have expanded upon the DeepSeek-R1 recipe through innovative RL algorithms like LCPO, REINFORCE++, DAPO, and VinePPO. Researchers have also conducted empirical studies exploring RL design spaces, data scaling trends, curricula, and reward mechanisms. Despite these advancements, the field faces significant evaluation challenges. Machine learning progress often lacks rigorous assessment, with many reported gains failing to hold up when tested against well-tuned baselines. RL algorithms are particularly susceptible to variations in implementation details, including random seeds, raising concerns about the reliability of benchmarking practices.

Motivated by inconsistent claims in reasoning research, this study by researchers from Tübingen AI Center, University of Tübingen and  University of Cambridge conducts a rigorous investigation into mathematical reasoning benchmarks, revealing that many recent empirical conclusions fail under careful re-evaluation. The analysis identifies surprising sensitivity in LLM reasoning pipelines to minor design choices, including decoding parameters, prompt formatting, random seeds, and hardware configurations. Small benchmark sizes contribute significantly to this instability, with single questions potentially shifting Pass@1 scores by over 3 percentage points on datasets like AIME’24 and AMC’23. This leads to double-digit performance variations across seeds, undermining published results. The study systematically analyzes these instability sources and proposes best practices for improving reproducibility and rigor in reasoning evaluations, providing a standardized framework for re-evaluating recent techniques under more controlled conditions.

The study explores design factors affecting reasoning performance in language models through a standardized experimental framework. Nine widely used models across 1.5B and 7B parameter classes were evaluated, including DeepSeek-R1-Distill variants, DeepScaleR-1.5B, II-1.5 B-Preview, OpenRS models, S1.1-7B, and OpenThinker7B. Using consistent hardware (A100 GPU, AMD CPU) and software configurations, models were benchmarked on AIME’24, AMC’23, and MATH500 datasets using Pass@1 metrics. The analysis revealed significant performance variance across random seeds, with standard deviations ranging from 5 to 15 percentage points. This instability is particularly pronounced in smaller datasets where a single question can shift performance by 2.5-3.3 percentage points, making single-seed evaluations unreliable.

Based on rigorous standardized evaluations, the study reveals several key findings about current reasoning methodologies in language models. Most RL-trained variants of the DeepSeek R1-Distill model fail to deliver meaningful performance improvements, with only DeepScaleR demonstrating robust, significant gains across benchmarks. While RL training can substantially improve base model performance when applied to models like Qwen2.5, instruction tuning generally remains superior, with Open Reasoner-Zero-7B being the notable exception. In contrast, SFT consistently outperforms instruction-tuned baselines across all benchmarks and generalizes well to new datasets like AIME’25, highlighting its robustness as a training paradigm. RL-trained models show pronounced performance drops between AIME’24 and the more challenging AIME’25, indicating problematic overfitting to training distributions. Additional phenomena investigated include the correlation between response length and accuracy, with longer responses consistently showing higher error rates across all model types.

This comprehensive analysis reveals that apparent progress in LLM-based reasoning has been built on unstable foundations, with performance metrics susceptible to minor variations in evaluation protocols. The investigation demonstrates that reinforcement learning approaches yield modest improvements at best and frequently exhibit overfitting to specific benchmarks, while supervised fine-tuning consistently delivers robust, generalizable performance gains. To establish more reliable assessment standards, standardized evaluation frameworks with Dockerized environments, seed-averaged metrics, and transparent protocols are essential. These findings highlight the critical need for methodological rigor over leaderboard competition to ensure that claimed advances in reasoning capabilities reflect genuine progress rather than artifacts of inconsistent evaluation practices.


Here is the Paper, GitHub Page and Leaderboard. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLM Reasoning Benchmarks are Statistically Fragile: New Study Shows Reinforcement Learning RL Gains often Fall within Random Variance appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/04/15/llm-reasoning-benchmarks-are-statistically-fragile-new-study-shows-reinforcement-learning-rl-gains-often-fall-within-random-variance/feed/ 0 70538
Multimodal Models Don’t Need Late Fusion: Apple Researchers Show Early-Fusion Architectures are more Scalable, Efficient, and Modality-Agnostic https://www.marktechpost.com/2025/04/14/multimodal-models-dont-need-late-fusion-apple-researchers-show-early-fusion-architectures-are-more-scalable-efficient-and-modality-agnostic/ https://www.marktechpost.com/2025/04/14/multimodal-models-dont-need-late-fusion-apple-researchers-show-early-fusion-architectures-are-more-scalable-efficient-and-modality-agnostic/#respond Mon, 14 Apr 2025 22:16:54 +0000 https://www.marktechpost.com/?p=70517 Multimodal artificial intelligence faces fundamental challenges in effectively integrating and processing diverse data types simultaneously. Current methodologies predominantly rely on late-fusion strategies, where separately pre-trained unimodal models are grafted together, such as attaching vision encoders to language models. This approach, while convenient, raises critical questions about optimality for true multimodal understanding. The inherent biases from […]

The post Multimodal Models Don’t Need Late Fusion: Apple Researchers Show Early-Fusion Architectures are more Scalable, Efficient, and Modality-Agnostic appeared first on MarkTechPost.

]]>
Multimodal artificial intelligence faces fundamental challenges in effectively integrating and processing diverse data types simultaneously. Current methodologies predominantly rely on late-fusion strategies, where separately pre-trained unimodal models are grafted together, such as attaching vision encoders to language models. This approach, while convenient, raises critical questions about optimality for true multimodal understanding. The inherent biases from unimodal pre-training potentially limit the model’s ability to capture essential cross-modality dependencies. Also, scaling these composite systems introduces significant complexity, as each component brings its hyperparameters, pre-training requirements, and distinct scaling properties. The allocation of computational resources across modalities becomes increasingly difficult with this rigid architectural paradigm, hampering efficient scaling and potentially limiting performance in tasks requiring deep multimodal reasoning and representation learning.

Researchers have explored various approaches to multimodal integration, with late-fusion strategies dominating current implementations. These methods connect pre-trained vision encoders with language models, establishing a well-understood paradigm with established best practices. Early-fusion models, which combine modalities at earlier processing stages, remain comparatively unexplored despite their potential advantages. Native multimodal models trained from scratch on all modalities simultaneously represent another approach. However, some rely on pre-trained image tokenizers to convert visual data into discrete tokens compatible with text vocabularies. Mixture of Experts (MoE) architectures have been extensively studied for language models to enable efficient parameter scaling, but their application to multimodal systems remains limited. While scaling laws have been well-established for unimodal models, predicting performance improvements based on compute resources, few studies have investigated these relationships in truly multimodal systems, particularly those using early-fusion architectures processing raw inputs.

Researchers from Sorbonne University and Apple investigate scaling properties of native multimodal models trained from scratch on multimodal data, challenging conventional wisdom about architectural choices. By comparing early-fusion models, which process raw multimodal inputs directly against traditional late-fusion approaches, researchers demonstrate that late fusion offers no inherent advantage when both architectures are trained from scratch. Contrary to current practices, early-fusion models prove more efficient and easier to scale, following scaling laws similar to language models with slight variations in scaling coefficients across modalities and datasets. Analysis reveals optimal performance occurs when model parameters and training tokens are scaled in roughly equal proportions, with findings generalizing across diverse multimodal training mixtures. Recognizing the heterogeneous nature of multimodal data, the research extends to MoE architectures, enabling dynamic parameter specialization across modalities in a symmetric and parallel manner. This approach yields significant performance improvements and faster convergence compared to standard architectures, with scaling laws indicating training tokens should be prioritized over active parameters, a pattern distinct from dense models due to the higher total parameter count in sparse models.

The architectural investigation reveals several key findings about multimodal model scaling and design. Native early-fusion and late-fusion architectures perform comparably when trained from scratch, with early-fusion models showing slight advantages at lower compute budgets. Scaling laws analysis confirms that compute-optimal models for both architectures perform similarly as compute budgets increase. Importantly, native multimodal models (NMMs) demonstrate scaling properties resembling text-only language models, with scaling exponents varying slightly depending on target data types and training mixtures. Compute-optimal late-fusion models require a higher parameters-to-data ratio compared to their early-fusion counterparts, indicating different resource allocation patterns. Sparse architectures using Mixture of Experts significantly benefit early-fusion NMMs, showing substantial improvements over dense models at equivalent inference costs while implicitly learning modality-specific weights. In addition to this, the compute-optimal sparse models increasingly prioritize scaling training tokens over active parameters as compute budgets grow. Notably, modality-agnostic routing in sparse mixtures consistently outperforms modality-aware routing approaches, challenging intuitions about explicit modality specialization.

The study presents comprehensive scaling experiments with NMMs across various architectural configurations. Researchers trained models ranging from 0.3 billion to 4 billion active parameters, maintaining consistent depth while scaling width to systematically evaluate performance patterns. The training methodology follows a structured approach with variable warm-up periods—1,000 steps for smaller token budgets and 5,000 steps for larger budgets—followed by constant learning rate training and a cooling-down phase using an inverse square root scheduler comprising 20% of the constant learning rate duration. To robustly estimate scaling coefficients in their predictive equations, researchers employed the L-BFGS optimization algorithm paired with Huber loss (using δ = 10^-3), conducting thorough grid searches across initialization ranges. 

Comparative analysis reveals significant performance advantages of sparse architectures over dense models for multimodal processing. When compared at equivalent inference costs, MoE models consistently outperform their dense counterparts, with this advantage becoming particularly pronounced for smaller model sizes, suggesting enhanced capability to handle heterogeneous data through modality specialization. As model scale increases, this performance gap gradually narrows. Scaling laws analysis demonstrates that sparse early-fusion models follow similar power law relationships to dense models with comparable scaling exponents (-0.047 vs -0.049), but with a smaller multiplicative constant (26.287 vs 29.574), indicating lower overall loss. 

This research demonstrates that native multimodal models follow scaling patterns similar to language models, challenging conventional architectural assumptions. Early-fusion and late-fusion approaches perform comparably when trained from scratch, with early-fusion showing advantages at lower compute budgets while being more efficient to train. Sparse architectures using Mixture of Experts naturally develop modality-specific specialization, significantly improving performance without increasing inference costs. These findings suggest that unified, early-fusion architectures with dynamic parameter allocation represent a promising direction for efficient multimodal AI systems that can effectively process heterogeneous data.


Check out Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Multimodal Models Don’t Need Late Fusion: Apple Researchers Show Early-Fusion Architectures are more Scalable, Efficient, and Modality-Agnostic appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/04/14/multimodal-models-dont-need-late-fusion-apple-researchers-show-early-fusion-architectures-are-more-scalable-efficient-and-modality-agnostic/feed/ 0 70517
Step by Step Coding Guide to Build a Neural Collaborative Filtering (NCF) Recommendation System with PyTorch https://www.marktechpost.com/2025/04/11/step-by-step-coding-guide-to-build-a-neural-collaborative-filtering-ncf-recommendation-system-with-pytorch/ https://www.marktechpost.com/2025/04/11/step-by-step-coding-guide-to-build-a-neural-collaborative-filtering-ncf-recommendation-system-with-pytorch/#respond Sat, 12 Apr 2025 03:58:19 +0000 https://www.marktechpost.com/?p=70461 This tutorial will walk you through using PyTorch to implement a Neural Collaborative Filtering (NCF) recommendation system. NCF extends traditional matrix factorisation by using neural networks to model complex user-item interactions. Introduction Neural Collaborative Filtering (NCF) is a state-of-the-art approach for building recommendation systems. Unlike traditional collaborative filtering methods that rely on linear models, NCF […]

The post Step by Step Coding Guide to Build a Neural Collaborative Filtering (NCF) Recommendation System with PyTorch appeared first on MarkTechPost.

]]>
This tutorial will walk you through using PyTorch to implement a Neural Collaborative Filtering (NCF) recommendation system. NCF extends traditional matrix factorisation by using neural networks to model complex user-item interactions.

Introduction

Neural Collaborative Filtering (NCF) is a state-of-the-art approach for building recommendation systems. Unlike traditional collaborative filtering methods that rely on linear models, NCF utilizes deep learning to capture non-linear relationships between users and items.

In this tutorial, we’ll:

  1. Prepare and explore the MovieLens dataset
  2. Implement the NCF model architecture
  3. Train the model
  4. Evaluate its performance
  5. Generate recommendations for users

Setup and Environment

First, let’s install the necessary libraries and import them:

!pip install torch numpy pandas matplotlib seaborn scikit-learn tqdm


import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
import random




torch.manual_seed(42)
np.random.seed(42)
random.seed(42)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Data Loading and Preparation

We’ll use the MovieLens 100K dataset, which contains 100,000 movie ratings from users:

!wget -nc https://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -q -n ml-100k.zip


ratings_df = pd.read_csv('ml-100k/u.data', sep='t', names=['user_id', 'item_id', 'rating', 'timestamp'])


movies_df = pd.read_csv('ml-100k/u.item', sep='|', encoding='latin-1',
                       names=['item_id', 'title', 'release_date', 'video_release_date',
                              'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation',
                              'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
                              'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
                              'Thriller', 'War', 'Western'])


print("Ratings data:")
print(ratings_df.head())


print("nMovies data:")
print(movies_df[['item_id', 'title']].head())




print(f"nTotal number of ratings: {len(ratings_df)}")
print(f"Number of unique users: {ratings_df['user_id'].nunique()}")
print(f"Number of unique movies: {ratings_df['item_id'].nunique()}")
print(f"Rating range: {ratings_df['rating'].min()} to {ratings_df['rating'].max()}")
print(f"Average rating: {ratings_df['rating'].mean():.2f}")




plt.figure(figsize=(10, 6))
sns.countplot(x='rating', data=ratings_df)
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


ratings_df['label'] = (ratings_df['rating'] >= 4).astype(np.float32)

Data Preparation for NCF

Now, let’s prepare the data for our NCF model:

train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)


print(f"Training set size: {len(train_df)}")
print(f"Test set size: {len(test_df)}")


num_users = ratings_df['user_id'].max()
num_items = ratings_df['item_id'].max()


print(f"Number of users: {num_users}")
print(f"Number of items: {num_items}")


class NCFDataset(Dataset):
   def __init__(self, df):
       self.user_ids = torch.tensor(df['user_id'].values, dtype=torch.long)
       self.item_ids = torch.tensor(df['item_id'].values, dtype=torch.long)
       self.labels = torch.tensor(df['label'].values, dtype=torch.float)
      
   def __len__(self):
       return len(self.user_ids)
  
   def __getitem__(self, idx):
       return {
           'user_id': self.user_ids[idx],
           'item_id': self.item_ids[idx],
           'label': self.labels[idx]
       }


train_dataset = NCFDataset(train_df)
test_dataset = NCFDataset(test_df)


batch_size = 256
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

Model Architecture

Now we’ll implement the Neural Collaborative Filtering (NCF) model, which combines Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP) components:

class NCF(nn.Module):
   def __init__(self, num_users, num_items, embedding_dim=32, mlp_layers=[64, 32, 16]):
       super(NCF, self).__init__() 


       self.user_embedding_gmf = nn.Embedding(num_users + 1, embedding_dim)
       self.item_embedding_gmf = nn.Embedding(num_items + 1, embedding_dim)


       self.user_embedding_mlp = nn.Embedding(num_users + 1, embedding_dim)
       self.item_embedding_mlp = nn.Embedding(num_items + 1, embedding_dim)
      
       mlp_input_dim = 2 * embedding_dim
       self.mlp_layers = nn.ModuleList()
       for idx, layer_size in enumerate(mlp_layers):
           if idx == 0:
               self.mlp_layers.append(nn.Linear(mlp_input_dim, layer_size))
           else:
               self.mlp_layers.append(nn.Linear(mlp_layers[idx-1], layer_size))
           self.mlp_layers.append(nn.ReLU())


       self.output_layer = nn.Linear(embedding_dim + mlp_layers[-1], 1)
       self.sigmoid = nn.Sigmoid()


       self._init_weights()
  
   def _init_weights(self):
       for m in self.modules():
           if isinstance(m, nn.Embedding):
               nn.init.normal_(m.weight, mean=0.0, std=0.01)
           elif isinstance(m, nn.Linear):
               nn.init.kaiming_uniform_(m.weight)
               if m.bias is not None:
                   nn.init.zeros_(m.bias)
  
   def forward(self, user_ids, item_ids):
       user_embedding_gmf = self.user_embedding_gmf(user_ids)
       item_embedding_gmf = self.item_embedding_gmf(item_ids)
       gmf_vector = user_embedding_gmf * item_embedding_gmf
      
       user_embedding_mlp = self.user_embedding_mlp(user_ids)
       item_embedding_mlp = self.item_embedding_mlp(item_ids)
       mlp_vector = torch.cat([user_embedding_mlp, item_embedding_mlp], dim=-1)


       for layer in self.mlp_layers:
           mlp_vector = layer(mlp_vector)


       concat_vector = torch.cat([gmf_vector, mlp_vector], dim=-1)


       prediction = self.sigmoid(self.output_layer(concat_vector)).squeeze()
      
       return prediction


embedding_dim = 32
mlp_layers = [64, 32, 16]
model = NCF(num_users, num_items, embedding_dim, mlp_layers).to(device)


print(model)

Training the Model

Let’s train our NCF model:

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)


def train_epoch(model, data_loader, criterion, optimizer, device):
   model.train()
   total_loss = 0
   for batch in tqdm(data_loader, desc="Training"):
       user_ids = batch['user_id'].to(device)
       item_ids = batch['item_id'].to(device)
       labels = batch['label'].to(device)
      
       optimizer.zero_grad()
       outputs = model(user_ids, item_ids)
       loss = criterion(outputs, labels)
      
       loss.backward()
       optimizer.step()
      
       total_loss += loss.item()
  
   return total_loss / len(data_loader)


def evaluate(model, data_loader, criterion, device):
   model.eval()
   total_loss = 0
   predictions = []
   true_labels = []
  
   with torch.no_grad():
       for batch in tqdm(data_loader, desc="Evaluating"):
           user_ids = batch['user_id'].to(device)
           item_ids = batch['item_id'].to(device)
           labels = batch['label'].to(device)
          
           outputs = model(user_ids, item_ids)
           loss = criterion(outputs, labels)
           total_loss += loss.item()
          
           predictions.extend(outputs.cpu().numpy())
           true_labels.extend(labels.cpu().numpy())
  
   from sklearn.metrics import roc_auc_score, average_precision_score
   auc = roc_auc_score(true_labels, predictions)
   ap = average_precision_score(true_labels, predictions)
  
   return {
       'loss': total_loss / len(data_loader),
       'auc': auc,
       'ap': ap
   }


num_epochs = 10
history = {'train_loss': [], 'val_loss': [], 'val_auc': [], 'val_ap': []}


for epoch in range(num_epochs):
   train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
  
   eval_metrics = evaluate(model, test_loader, criterion, device)
  
   history['train_loss'].append(train_loss)
   history['val_loss'].append(eval_metrics['loss'])
   history['val_auc'].append(eval_metrics['auc'])
   history['val_ap'].append(eval_metrics['ap'])
  
   print(f"Epoch {epoch+1}/{num_epochs} - "
         f"Train Loss: {train_loss:.4f}, "
         f"Val Loss: {eval_metrics['loss']:.4f}, "
         f"AUC: {eval_metrics['auc']:.4f}, "
         f"AP: {eval_metrics['ap']:.4f}")


plt.figure(figsize=(12, 4))


plt.subplot(1, 2, 1)
plt.plot(history['train_loss'], label='Train Loss')
plt.plot(history['val_loss'], label='Validation Loss')
plt.title('Loss During Training')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()


plt.subplot(1, 2, 2)
plt.plot(history['val_auc'], label='AUC')
plt.plot(history['val_ap'], label='Average Precision')
plt.title('Evaluation Metrics')
plt.xlabel('Epoch')
plt.ylabel('Score')
plt.legend()


plt.tight_layout()
plt.show()


torch.save(model.state_dict(), 'ncf_model.pth')
print("Model saved successfully!")

Generating Recommendations

Now let’s create a function to generate recommendations for users:

def generate_recommendations(model, user_id, n=10):
   model.eval()
   user_ids = torch.tensor([user_id] * num_items, dtype=torch.long).to(device)
   item_ids = torch.tensor(range(1, num_items + 1), dtype=torch.long).to(device)
  
   with torch.no_grad():
       predictions = model(user_ids, item_ids).cpu().numpy()
  
   items_df = pd.DataFrame({
       'item_id': range(1, num_items + 1),
       'score': predictions
   })
  
   user_rated_items = set(ratings_df[ratings_df['user_id'] == user_id]['item_id'].values)
  
   items_df = items_df[~items_df['item_id'].isin(user_rated_items)]
  
   top_n_items = items_df.sort_values('score', ascending=False).head(n)
  
   recommendations = pd.merge(top_n_items, movies_df[['item_id', 'title']], on='item_id')
  
   return recommendations[['item_id', 'title', 'score']]


test_users = [1, 42, 100]


for user_id in test_users:
   print(f"nTop 10 recommendations for user {user_id}:")
   recommendations = generate_recommendations(model, user_id, n=10)
   print(recommendations)
  
   print(f"nMovies that user {user_id} has rated highly (4-5 stars):")
   user_liked = ratings_df[(ratings_df['user_id'] == user_id) & (ratings_df['rating'] >= 4)]
   user_liked = pd.merge(user_liked, movies_df[['item_id', 'title']], on='item_id')
   user_liked[['item_id', 'title', 'rating']]

Evaluating the Model Further

Let’s evaluate our model further by computing some additional metrics:

def evaluate_model_with_metrics(model, test_loader, device):
   model.eval()
   predictions = []
   true_labels = []
  
   with torch.no_grad():
       for batch in tqdm(test_loader, desc="Evaluating"):
           user_ids = batch['user_id'].to(device)
           item_ids = batch['item_id'].to(device)
           labels = batch['label'].to(device)
          
           outputs = model(user_ids, item_ids)
          
           predictions.extend(outputs.cpu().numpy())
           true_labels.extend(labels.cpu().numpy())
  
   from sklearn.metrics import roc_auc_score, average_precision_score, precision_recall_curve, accuracy_score
  
   binary_preds = [1 if p >= 0.5 else 0 for p in predictions]
  
   auc = roc_auc_score(true_labels, predictions)
   ap = average_precision_score(true_labels, predictions)
   accuracy = accuracy_score(true_labels, binary_preds)
  
   precision, recall, thresholds = precision_recall_curve(true_labels, predictions)
  
   plt.figure(figsize=(10, 6))
   plt.plot(recall, precision, label=f'AP={ap:.3f}')
   plt.xlabel('Recall')
   plt.ylabel('Precision')
   plt.title('Precision-Recall Curve')
   plt.legend()
   plt.grid(True)
   plt.show()
  
   return {
       'auc': auc,
       'ap': ap,
       'accuracy': accuracy
   }


metrics = evaluate_model_with_metrics(model, test_loader, device)
print(f"AUC: {metrics['auc']:.4f}")
print(f"Average Precision: {metrics['ap']:.4f}")
print(f"Accuracy: {metrics['accuracy']:.4f}")

Cold Start Analysis

Let’s analyze how our model performs for new users or users with few ratings (cold start problem):

user_rating_counts = ratings_df.groupby('user_id').size().reset_index(name='count')
user_rating_counts['group'] = pd.cut(user_rating_counts['count'],
                                   bins=[0, 10, 50, 100, float('inf')],
                                   labels=['1-10', '11-50', '51-100', '100+'])


print("Number of users in each rating frequency group:")
print(user_rating_counts['group'].value_counts())


def evaluate_by_user_group(model, ratings_df, user_groups, device):
   results = {}
  
   for group_name, user_ids in user_groups.items():
       group_ratings = ratings_df[ratings_df['user_id'].isin(user_ids)]
      
       group_dataset = NCFDataset(group_ratings)
       group_loader = DataLoader(group_dataset, batch_size=256, shuffle=False)
      
       if len(group_loader) == 0:
           continue
      
       model.eval()
       predictions = []
       true_labels = []
      
       with torch.no_grad():
           for batch in group_loader:
               user_ids = batch['user_id'].to(device)
               item_ids = batch['item_id'].to(device)
               labels = batch['label'].to(device)
              
               outputs = model(user_ids, item_ids)
              
               predictions.extend(outputs.cpu().numpy())
               true_labels.extend(labels.cpu().numpy())
      
       from sklearn.metrics import roc_auc_score
       try:
           auc = roc_auc_score(true_labels, predictions)
           results[group_name] = auc
       except:
           results[group_name] = None
  
   return results


user_groups = {}
for group in user_rating_counts['group'].unique():
   users_in_group = user_rating_counts[user_rating_counts['group'] == group]['user_id'].values
   user_groups[group] = users_in_group


group_performance = evaluate_by_user_group(model, test_df, user_groups, device)


plt.figure(figsize=(10, 6))
groups = []
aucs = []


for group, auc in group_performance.items():
   if auc is not None:
       groups.append(group)
       aucs.append(auc)


plt.bar(groups, aucs)
plt.xlabel('Number of Ratings per User')
plt.ylabel('AUC Score')
plt.title('Model Performance by User Rating Frequency (Cold Start Analysis)')
plt.ylim(0.5, 1.0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


print("AUC scores by user rating frequency:")
for group, auc in group_performance.items():
   if auc is not None:
       print(f"{group}: {auc:.4f}")

Business Insights and Extensions

def analyze_predictions(model, data_loader, device):
   model.eval()
   predictions = []
   true_labels = []
  
   with torch.no_grad():
       for batch in data_loader:
           user_ids = batch['user_id'].to(device)
           item_ids = batch['item_id'].to(device)
           labels = batch['label'].to(device)
          
           outputs = model(user_ids, item_ids)
          
           predictions.extend(outputs.cpu().numpy())
           true_labels.extend(labels.cpu().numpy())
  
   results_df = pd.DataFrame({
       'true_label': true_labels,
       'predicted_score': predictions
   })
  
   plt.figure(figsize=(12, 6))
  
   plt.subplot(1, 2, 1)
   sns.histplot(results_df['predicted_score'], bins=30, kde=True)
   plt.title('Distribution of Predicted Scores')
   plt.xlabel('Predicted Score')
   plt.ylabel('Count')
  
   plt.subplot(1, 2, 2)
   sns.boxplot(x='true_label', y='predicted_score', data=results_df)
   plt.title('Predicted Scores by True Label')
   plt.xlabel('True Label (0=Disliked, 1=Liked)')
   plt.ylabel('Predicted Score')
  
   plt.tight_layout()
   plt.show()
  
   avg_scores = results_df.groupby('true_label')['predicted_score'].mean()
   print("Average prediction scores:")
   print(f"Items user disliked (0): {avg_scores[0]:.4f}")
   print(f"Items user liked (1): {avg_scores[1]:.4f}")


analyze_predictions(model, test_loader, device)

This tutorial demonstrates implementing Neural Collaborative Filtering, a deep learning recommendation system combining matrix factorization with neural networks. Using the MovieLens dataset and PyTorch, we built a model that generates personalized content recommendations. The implementation addresses key challenges, including the cold start problem and provides performance metrics like AUC and precision-recall curves. This foundation can be extended with hybrid approaches, attention mechanisms, or deployable web applications for various business recommendation scenarios.


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

The post Step by Step Coding Guide to Build a Neural Collaborative Filtering (NCF) Recommendation System with PyTorch appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2025/04/11/step-by-step-coding-guide-to-build-a-neural-collaborative-filtering-ncf-recommendation-system-with-pytorch/feed/ 0 70461