Reinforcement Learning Category - MarkTechPost

Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning

Mohammad Asjad — Fri, 02 May 2025 06:31:03 +0000

Large language models (LLMs) face significant challenges when trained as autonomous agents in interactive environments. Unlike static tasks, agent settings require sequential decision-making, cross-turn memory maintenance, and adaptation to stochastic environmental feedback. These capabilities are essential for developing effective planning assistants, robotics applications, and tutoring agents that can self-improve through experience. While reinforcement learning (RL) has been applied to LLMs using rule-based rewards, training self-evolving agents that can reason and adapt remains underexplored. Current approaches suffer from training instability, complex reward signal interpretation, and limited generalisation across varying prompts or changing environments, particularly during multi-turn interactions with unpredictable feedback. The fundamental question emerges: which design elements are crucial for creating LLM agents that learn effectively and maintain stability throughout their evolution?

Through diverse methodologies, RL has significantly advanced LLMs’ reasoning capabilities. PPO maintains training stability by clipping policy updates, while GRPO enhances systematic problem-solving abilities. SAC employs entropy-regularised objectives for robust exploration, and meta tokens facilitate structured thinking. PRM and MCTS-based approaches have further improved systematic reasoning. Simultaneously, chain-of-thought techniques like STaR iteratively utilise small rationale examples alongside larger datasets. At the same time, DAPO, Dr. GRPO, and Open Reasoner Zero demonstrate that minimalist RL techniques with decoupled clipping and simple reward schemes can substantially enhance reasoning performance.

LLM agent architectures have evolved from basic reasoning-action frameworks to structured planning approaches and complex multi-agent systems. Testing environments range from specialised platforms like Sokoban and FrozenLake to general-purpose frameworks like HuggingGPT, enabling applications from web navigation to coding assistance and embodied tasks. Despite these advances, challenges persist in architectural complexity and self-correction, particularly for diverse multi-step reasoning tasks where maintaining coherence across interactions remains problematic.

Researchers have approached agent learning through StarPO (State-Thinking-Actions-Reward Policy Optimisation), a unified framework for trajectory-level agent training with flexible control over reasoning processes, reward mechanisms, and prompt structures. Building on this framework, they developed RAGEN, a modular system implementing complete training loops for analysing LLM agent dynamics in multi-turn stochastic environments. To isolate learning factors from confounding variables like pretrained knowledge, evaluation focuses on three controlled gaming environments: Bandit (single-turn, stochastic), Sokoban (multi-turn, deterministic), and Frozen Lake (multi-turn, stochastic). These minimalistic environments require policy learning through interaction rather than relying on pre-existing knowledge. The analysis reveals three critical dimensions of agent learning: gradient stability issues in multi-turn reinforcement learning, the importance of rollout frequency and diversity in shaping agent evolution, and the need for carefully designed reward signals to develop genuine reasoning capabilities rather than shallow action selection or hallucinated thinking processes.

StarPO represents a unique framework designed specifically for optimising multi-turn interaction trajectories in LLM agents. Unlike traditional approaches that treat each action independently, StarPO optimises entire trajectories—including observations, reasoning traces, actions, and feedback—as coherent units. This trajectory-level approach is particularly suited for interactive environments where agents must maintain memory across turns and adapt to stochastic feedback. StarPO’s objective function focuses on maximising expected rewards across complete trajectories rather than individual steps, making it directly compatible with autoregressive LLMs through decomposition into token-level likelihoods. The framework integrates reasoning-guided structured outputs that combine both intermediate thinking processes and executable actions, enabling agents to develop more sophisticated decision-making capabilities while maintaining learning stability in complex environments.

Experimental results reveal that StarPO-S significantly outperforms vanilla StarPO across multiple agent tasks. By implementing uncertainty-based instance filtering, KL term removal, and asymmetric clipping, StarPO-S effectively delays performance collapse and enhances final task outcomes. The stabilised approach demonstrates particular effectiveness in complex environments like FrozenLake and Sokoban, where retaining only 25-50% of high-variance rollouts dramatically improves training stability while reducing computational requirements by up to 50%.

Task diversity and interaction granularity significantly impact performance. Models trained with higher task diversity and 4-6 actions per turn demonstrate superior generalisation capabilities across novel vocabulary and larger environments. Also, frequent rollout updates prove critical for maintaining alignment between optimisation targets and policy behavior. Agents trained with up-to-date rollouts every 1-10 updates achieve faster convergence and higher success rates compared to those relying on outdated trajectory data.

Symbolic reasoning benefits vary substantially between single-turn and multi-turn tasks. While reasoning traces significantly improve generalisation in single-turn Bandit environments, they provide limited advantage in complex multi-turn settings like Sokoban and FrozenLake. Analysis shows reasoning length consistently declines during training, suggesting models gradually suppress their thought processes when rewards are sparse and delayed. This highlights the need for reward mechanisms that directly reinforce intermediate reasoning steps rather than relying solely on outcome-based feedback.

This research establishes reinforcement learning as a viable approach for training language agents in complex, stochastic environments. StarPO-S represents a significant advancement in stabilising multi-turn agent training through uncertainty-based sampling and exploration encouragement. By transitioning from human supervision to verifiable outcome-based rewards, this framework creates opportunities for developing more capable AI systems across theorem proving, software engineering, and scientific discovery. Future work should focus on multi-modal inputs, enhanced training efficiency, and applications to increasingly complex domains with verifiable objectives.

Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning appeared first on MarkTechPost.

π0 Released and Open Sourced: A General-Purpose Robotic Foundation Model that could be Fine-Tuned to a Diverse Range of Tasks

Divyesh Vitthal Jawkhede — Fri, 07 Feb 2025 03:46:11 +0000

Robots are usually unsuitable for altering different tasks and environments. General-purpose models of robots are devised to circumvent this problem. They allow fine-tuning these general-purpose models for a wide scope of robotic tasks. However, it is challenging to maintain the consistency of shared open resources across various platforms. Success in real-world environments is far from guaranteed; pre-trained models cannot always be relied upon. Though collaboration fosters improvement in robotic intelligence, fully adaptable yet reliable models are still a distant dream.

Currently, robotic control relies on task-specific models, which lack adaptability and struggle to generalize across different tasks and platforms. These methods limit flexibility because other models are needed for each task, and it is inefficient to integrate across robotic systems. Compatibility across different platforms remains a major challenge because existing approaches often fail to perform consistently in diverse environments. Practical reliability remains uncertain, and many attempts to fine-tune models for new tasks may not succeed, highlighting the limitations of current robotic learning techniques.

To mitigate these issues, researchers proposed π0, a robotic foundation model designed for general-purpose control across different robots and tasks. Unlike task-specific models lacking flexibility, π0 integrates vision, language, and action using a flow-based diffusion approach. The model is trained on over 10,000 hours of robot data and provides pre-trained checkpoints for fine-tuning on specific platforms. π0-FAST, an alternative version, follows language instructions more accurately but requires higher inference time. The open-source release of π0 allows researchers to fine-tune it for their robots, though its performance may vary across platforms.

The framework consists of pre-trained models and fine-tuning capabilities, enabling adaptation to various robotic tasks like cleaning, folding, and object manipulation. The open repository contains model weights, example codes, and fine-tuned checkpoints for DROID and ALOHA platforms. Fine-tuning usually depends on 1 to 20 hours of data but on the robot and the task. It is expected that by making π0 available, the researchers would help in greater advances in robotic learning and AI systems that could understand real-world interactions. However, it is uncertain for all of the above platforms, and adaptation challenges still exist.

In the end, the open-sourcing of π0 enables general-purpose robotic foundation models to adapt to complex tasks and various platforms. It is not widely applicable but encourages experimenting and collaborating in robotic learning. As a baseline for future research, π0 can provide insights into AI-driven robotic interaction that leads to advanced generalization, efficient fine-tuning, and even greater autonomy.

Check out the Details and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

The post π0 Released and Open Sourced: A General-Purpose Robotic Foundation Model that could be Fine-Tuned to a Diverse Range of Tasks appeared first on MarkTechPost.

REDA: A Novel AI Approach to Multi-Agent Reinforcement Learning That Makes Complex Sequence-Dependent Assignment Problems Solvable

Adeeba Alam Ansari — Sat, 04 Jan 2025 05:02:05 +0000

Power distribution systems are often conceptualized as optimization models. While optimizing agents to perform tasks works well for systems with limited checkpoints, things begin to go out of hand when heuristics tackle multiple tasks and agents. Scaling dramatically increases the complexity of assignment problems, often NP-hard and nonlinear. Optimization methods become the white elephants in the room, providing suboptimality at high resource consumption. Another major issue with these methods is that their problem setup is dynamic, requiring an iterative, state-based assignment strategy. When one thinks of state in AI, reinforcement learning is the first thing that comes to mind. In the case of assignment applications, given its temporal state-dependent nature, researchers realized the attractive and massive potential of sequential decision-making reinforcement learning. This paper discusses the latest research in state-based assignment, which optimizes its solution through RL.

Researchers from the University of Washington, Seattle, introduced a novel multi-agent reinforcement learning approach for sequential satellite assignment problems. Multi-Agent RL provides solutions for large-scale, realistic scenarios that, with other methods, would have been extravagantly complex. The authors presented a meticulously designed and theoretically justified novel algorithm for solving satellite assignments that ensures specific rewards, guarantees global objectives, and avoids conflicting constraints. The approach integrates existing greedy algorithms in MARL only to improve its solution for long-term planning. The authors also provide the readers with novel insights into its working and global convergence properties through simple experimentation and comparisons.

The methodology that distinguishes it is that agents first learn an expected assignment value; this value serves as the input for an optimally distributed task assignment mechanism. This allows agents to execute joint assignments that satisfy assignment constraints while learning a near-optimal joint policy at the system level. The paper follows a generalized approach to satellite internet constellations, where satellites act as agents. This Satellite Assignment Problem is solved via an RL-enabled Distributed Assignment algorithm(REDA). In this, the authors bootstrap the policy from a non-parameterized greedy policy with which they act at the beginning of training with probability ε. Additionally, to induce further exploration, the authors add randomly distributed noise to Q . Another aspect of REDA that reduces its complexity is its learning target specification, which ensures targets satisfy the constraints.

For evaluation, the authors perform experiments on a simple SAP environment, which they later scale to a complex satellite constellation task allocation environment with hundreds of satellites and tasks. The authors steer the experiments to answer some interesting questions, such as whether REDA encourages unselfish behavior and if REDA can be applied to large problems. The authors reported that REDA immediately drove the group to an optimal joint policy, unlike other methods that encouraged selfishness. For the highly complex scaled SAP, REDA yielded low variance and consistently outperformed all other methods. Overall, the authors reported an increase of 20% to 50% over other state-of-the-art methods.

Conclusion: This paper discussed REDA, a novel Multi-Agent Reinforcement Learning approach for solving complex state-dependent assignment problems. The paper addresses satellite assignment problems and teaches agents to act unselfishly while learning efficient solutions, even in large problem settings.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post REDA: A Novel AI Approach to Multi-Agent Reinforcement Learning That Makes Complex Sequence-Dependent Assignment Problems Solvable appeared first on MarkTechPost.

OpenAI Researchers Propose a Multi-Step Reinforcement Learning Approach to Improve LLM Red Teaming

Asif Razzaq — Sun, 24 Nov 2024 04:56:31 +0000

As the use of large language models (LLMs) becomes increasingly prevalent across real-world applications, concerns about their vulnerabilities grow accordingly. Despite their capabilities, LLMs are still susceptible to various types of adversarial attacks, including those that generate toxic content, reveal private information, or allow for prompt injections. These vulnerabilities pose significant ethical concerns regarding bias, misinformation, potential privacy violations, and system abuse. The need for an effective strategy to address these issues is pressing. Traditionally, red teaming—a process that involves stress-testing AI systems by simulating attacks—has been effective for vulnerability detection. However, past approaches to automated red teaming have often struggled to balance the diversity of generated attacks and their effectiveness, limiting the robustness of the models.

To address these challenges, OpenAI researchers propose an approach to automated red teaming that incorporates both diversity and effectiveness in the attacks generated. This is achieved by decomposing the red teaming process into two distinct steps. The first step involves generating diverse attacker goals, while the second step trains a reinforcement learning (RL) attacker to effectively meet these goals. The proposed method uses multi-step reinforcement learning (multi-step RL) and automated reward generation. This approach involves leveraging large language models to generate attacker goals and utilizing rule-based rewards (RBRs) and custom diversity measures to guide RL training. By rewarding an RL-based attacker for being both effective and distinct from its past attempts, the method ensures greater diversity and effectiveness of the attacks.

Technical Details

The research team describes the decomposition of the red teaming system into generating goals and training attacks as a means to simplify the process while achieving robust results. For generating goals, the authors utilize both few-shot prompting of a language model and existing datasets of past attacks. These goals serve as a diverse foundation, giving the RL-based attacker specific but varied directions to optimize for. The core of the RL-based attacker training uses a targeted rule-based reward function for each example, ensuring that each attack aligns with a specific adversarial goal. Moreover, to prevent the RL attacker from converging on similar attack strategies, a diversity reward is implemented that focuses on stylistic differences in generated prompts. Multi-step RL allows the attacker to iterate on its own attacks and be rewarded for successfully generating new and varied types of attacks—leading to a more comprehensive red teaming system. This process helps identify the model’s vulnerabilities while ensuring that the diversity of adversarial examples closely mirrors those that could be encountered in real-world situations.

The significance of this red teaming approach lies in its ability to address both the effectiveness and diversity of attacks, a duality that has been a long-standing challenge in automated adversarial generation. By using multi-step RL and automated rewards, the approach allows the generated attacks to be diverse and relevant. The authors demonstrated their approach on two key applications: prompt injection attacks and “jailbreaking” attacks that elicit unsafe responses. In both scenarios, the multi-step RL-based attacker showed improved effectiveness and diversity of attacks compared to previous methods. Specifically, the indirect prompt injection, which can trick a model into generating unintended behavior, achieved a high attack success rate and was notably more varied in style compared to one-shot prompting methods. Overall, the proposed method was able to generate attacks with an attack success rate of up to 50%, while achieving substantially higher diversity metrics than prior approaches. This combination of automated reward generation and reinforcement learning provides a nuanced mechanism for probing model robustness and ultimately improving the LLM’s defenses against real-world threats.

Conclusion

The proposed red teaming approach offers a direction for automated adversarial testing of LLMs, addressing previous limitations involving trade-offs between attack diversity and effectiveness. By leveraging both automated goal generation and multi-step RL, this methodology allows for a more detailed exploration of the vulnerabilities present in LLMs, ultimately helping to create safer and more robust models. While the results presented are promising, there are still limitations and areas for further research, particularly in refining the automated rewards and optimizing training stability. Nevertheless, the combination of RL with rule-based rewards and diversity-focused training marks an important step in adversarial testing, providing a model that can better respond to the evolving nature of attacks.

Check out the Paper here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

The post OpenAI Researchers Propose a Multi-Step Reinforcement Learning Approach to Improve LLM Red Teaming appeared first on MarkTechPost.

Top Reinforcement Learning Courses

Shobha Kakkar — Mon, 16 Sep 2024 08:00:00 +0000

Reinforcement learning (RL) enables machines to learn from their actions and make decisions through trial and error, similar to how humans learn. It’s the foundation of AI systems that can solve complex tasks, such as playing games or controlling robots, without being explicitly programmed. Learning RL is valuable because it opens doors to building smarter, autonomous systems and advances our understanding of AI. This article, therefore, lists the top courses on Reinforcement Learning that provide comprehensive knowledge, practical implementation, and hands-on projects, helping learners grasp the core concepts, algorithms, and real-world applications of RL.

Reinforcement Learning Specialization (University of Alberta)
Decision Making and Reinforcement Learning (Columbia University)
Deep Learning and Reinforcement Learning (IBM)
Reinforcement Learning (RWTHx)
Reinforcement Learning from Human Feedback (Deeplearning.ai)
Fundamentals of Deep Reinforcement Learning (LVx)
Reinforcement Learning beginner to master – AI in Python (Udemy)
Artificial Intelligence 2.0: AI, Python, DRL + ChatGPT Prize (Udemy)
Reinforcement Learning – Youtube Playlist (Youtube)
Deep Reinforcement Learning (Udacity)
AWS DeepRacer Course (Udacity)

Reinforcement Learning Specialization (University of Alberta)

This course series on Reinforcement Learning teaches you how to build adaptive AI systems through trial-and-error interactions. You’ll explore foundational concepts like Markov Decision Processes, value functions, and key RL algorithms like Q-learning and Policy Gradients. By the end, you’ll be able to implement a complete RL solution and apply it to real-world problems such as game development, customer interaction, and more.

Decision Making and Reinforcement Learning (Columbia University)

This course introduces sequential decision-making and reinforcement learning. It starts with utility theory and models simple problems as multi-armed bandit problems. You’ll explore Markov decision processes (MDPs), partial observability, and POMDPs. The course covers key RL methods like Monte Carlo and temporal difference learning, emphasizing algorithms and practical examples.

Deep Learning and Reinforcement Learning (IBM)

This course introduces deep learning and reinforcement learning, two key areas of machine learning. You’ll start with neural networks and deep learning architectures, then explore reinforcement learning, where algorithms learn through rewards.

Reinforcement Learning (RWTHx)

This course introduces you to the world of Reinforcement Learning (RL), where machines learn by interacting with their environment, much like how humans learn through trial and error. You will start by building a solid mathematical foundation of RL concepts, followed by modern deep RL algorithms. Through hands-on exercises and programming examples, you’ll gain a deep understanding of key RL methods like Markov decision processes, dynamic programming, and temporal-difference methods.

Reinforcement Learning from Human Feedback (Deeplearning.ai)

This course provides an introduction to Reinforcement Learning from Human Feedback (RLHF) for aligning large language models (LLMs) with human values. You’ll explore the RLHF process, work with preference and prompt datasets, and use Google Cloud tools to fine-tune the Llama 2 model. Finally, you’ll compare the tuned model with the base LLM using loss curves and the Side-by-Side (SxS) method.

Fundamentals of Deep Reinforcement Learning (LVx)

This course provides an introduction to Reinforcement Learning (RL), starting from fundamental concepts and building up to Q-learning, a key RL algorithm. In Part II, you will implement Q-learning using neural networks, exploring the “Deep” in Deep Reinforcement Learning. The course covers the theoretical foundation of RL, practical implementations in Python, the Bellman Equation, and enhancements to the Q-Learning algorithm.

Reinforcement Learning beginner to master – AI in Python (Udemy)

This course aims to provide a comprehensive understanding of the Reinforcement Learning (RL) paradigm and its ideal applications. You’ll learn to approach and solve cognitive tasks using RL and evaluate various RL methods to choose the most suitable one. The course teaches how to implement RL algorithms from scratch, understand their learning processes, debug and extend them, and explore new RL algorithms from research papers for advanced learning.

Artificial Intelligence 2.0: AI, Python, DRL + ChatGPT Prize (Udemy)

This course focuses on advanced techniques in Deep Reinforcement Learning (DRL). You’ll learn key algorithms such as Q-Learning, Deep Q-Learning, Policy Gradient, Actor-Critic, Deep Deterministic Policy Gradient (DDPG), and Twin-Delayed DDPG (TD3). The course emphasizes foundational DRL techniques and teaches how to implement state-of-the-art AI models that excel in virtual applications.

Reinforcement Learning – Youtube Playlist (Youtube)

This YouTube playlist provides a step-by-step introduction to Q-Learning, a key reinforcement learning algorithm. It begins with building a Q-table for managing state-action pairs in environments like OpenAI Gym’s MountainCar. The series covers Q-learning theory practical Python implementations and moves towards more advanced topics like Deep Q-learning and Deep Q Networks (DQN). The focus is on explaining the core concepts, using Python to create agents that learn optimal strategies over time.

Deep Reinforcement Learning (Udacity)

This program focuses on mastering Deep Reinforcement Learning (DRL) techniques. Through courses on value-based, policy-based, and multi-agent RL, students learn classical solution methods like Monte Carlo and temporal difference and apply deep learning architectures to real-world problems. Projects include training agents for tasks like virtual navigation, financial trading, and multi-agent competition. With practical projects, students gain hands-on experience in advanced RL techniques such as Proximal Policy Optimization (PPO) and Actor-Critic methods, preparing them for complex applications in AI.

AWS DeepRacer Course (Udacity)

This course offers a hands-on introduction to Reinforcement Learning (RL) through the exciting application of autonomous driving with AWS DeepRacer. You’ll explore key RL concepts like agents, actions, environments, states, and rewards and see how they come together to train a virtual car. By experimenting with different parameters, hyperparameters, and reward functions, you’ll learn how to optimize your model’s performance. Finally, you’ll deploy your model in real-world settings, bridging the gap between simulations and actual environments.

We make a small profit from purchases made via referral/affiliate links attached to each course mentioned in the above list.

If you want to suggest any course that we missed from this list, then please email us at asif@marktechpost.com

The post Top Reinforcement Learning Courses appeared first on MarkTechPost.

Unraveling Human Reward Learning: A Hybrid Approach Combining Reinforcement Learning with Advanced Memory Architectures

Sana Hassan — Sat, 10 Aug 2024 16:54:26 +0000

Human reward-guided learning is often modeled using simple RL algorithms that summarize past experiences into key variables like Q-values, representing expected rewards. However, recent findings suggest that these models oversimplify the complexity of human memory and decision-making. For instance, individual events and global reward statistics can significantly influence behavior, indicating that memory involves more than just summary statistics. ANNs, particularly RNNs, offer a more complex model by capturing long-term dependencies and intricate learning mechanisms, though they often need to be more interpretable than traditional RL models.

Researchers from institutions including Google DeepMind, University of Oxford, Princeton University, and University College London studied human reward-learning behavior using a hybrid approach combining RL models with ANNs. Their findings suggest that human behavior needs to be adequately explained by algorithms that incrementally update choice variables. Instead, human reward learning relies on a flexible memory system that forms complex representations of past events over multiple timescales. By iteratively replacing components of a classic RL model with ANNs, they uncovered insights into how experiences shape memory and guide decision-making.

A dataset was gathered from a reward-learning task involving 880 participants. In this task, participants repeatedly chose between four actions, each rewarded based on noisy, drifting reward magnitudes. After filtering, the study included 862 participants and 617,871 valid trials. Most participants learned the task by consistently choosing actions with higher rewards. This extensive dataset enabled significant behavioral variance extraction using RNNs and hybrid models, outperforming basic RL models in capturing human decision-making patterns.

The data was initially modeled using a traditional RL model (Best RL) and a flexible Vanilla RNN. Best RL, identified as the most effective among incremental-update models, employed a reward module to update Q-values and an action module for action perseverance. However, its simplicity limited its expressivity. The Vanilla RNN, which processes actions, rewards, and latent states together, predicted choices more accurately (68.3% vs. 58.9%). Further hybrid models like RL-ANN and Context-ANN, while improving upon Best RL, still fell short of Vanilla RNN. Memory-ANN, incorporating recurrent memory representations, matched Vanilla RNN’s performance, suggesting that detailed memory use was key to participants’ learning in the task.

The study reveals that traditional RL models, which rely solely on incrementally updated decision variables, need to catch up in predicting human choices compared to a novel model incorporating memory-sensitive decision-making. This new model distinguishes between decision variables that drive choices and memory variables that modulate how these decision variables are updated based on past rewards. Unlike RL models, where decision and learning variables are intertwined, this approach separates them, providing a clearer understanding of how learning influences choices. The model suggests that human knowledge is influenced by compressed memories of task history, reflecting both short- and long-term reward and action histories, which modulate learning independently of how they are implemented.

Memory-ANN, the proposed modular cognitive architecture, separates reward-based learning from action-based learning, supported by evidence from computational models and neuroscience. The architecture comprises a “surface” level of decision rules that process observable data and a “deep” level that handles complex, context-rich representations. This dual-layer system allows for flexible, context-driven decision-making, suggesting that human reward learning involves simple surface-level processes and deeper memory-based mechanisms. These findings agree that complex models with rich representations must capture the full spectrum of human behavior, particularly in learning tasks. The insights gained here could have broader applications, extending to various learning tasks and cognitive science.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post Unraveling Human Reward Learning: A Hybrid Approach Combining Reinforcement Learning with Advanced Memory Architectures appeared first on MarkTechPost.

REBEL: A Reinforcement Learning RL Algorithm that Reduces the Problem of RL to Solving a Sequence of Relative Reward Regression Problems on Iteratively Collected Datasets

Mohammad Asjad — Tue, 30 Apr 2024 16:11:08 +0000

Initially designed for continuous control tasks, Proximal Policy Optimization (PPO) has become widely used in reinforcement learning (RL) applications, including fine-tuning generative models. However, PPO’s effectiveness relies on multiple heuristics for stable convergence, such as value networks and clipping, making its implementation sensitive and complex. Despite this, RL demonstrates remarkable versatility, transitioning from tasks like continuous control to fine-tuning generative models. Yet, adapting PPO, originally meant to optimize two-layer networks, to fine-tune modern generative models with billions of parameters raises concerns. This necessitates storing multiple models in memory simultaneously and raises questions about the suitability of PPO for such tasks. Also, PPO’s performance varies widely due to seemingly trivial implementation details. This raises the question: Are there simpler algorithms that scale to modern RL applications?

Policy Gradient (PG) methods, renowned for their direct, gradient-based policy optimization, are pivotal in RL. Divided into two families, PG methods based on REINFORCE often incorporate variance reduction techniques, while adaptive PG techniques precondition policy gradients to ensure stability and faster convergence. However, computing and inverting the Fisher Information Matrix in adaptive PG methods like TRPO pose computational challenges, leading to coarse approximations like PPO.

Researchers from Cornell, Princeton, and Carnegie Mellon University introduce REBEL: REgression to RElative REward Based RL. This algorithm reduces the problem of policy optimization by regressing the relative rewards via direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. Theoretical analysis reveals REBEL as a foundation for RL algorithms like Natural Policy Gradient, matching top theoretical guarantees for convergence and sample efficiency. REBEL accommodates offline data and addresses intransitive preferences that are common in practice.

The researchers adopt the Contextual Bandit formulation for RL, which is particularly relevant for models like LLMs and Diffusion Models due to deterministic transitions. Prompt-response pairs are considered with a reward function to measure response quality. The KL-constrained RL problem is formulated to fine-tune the policy according to rewards while adhering to a baseline policy. A closed-form solution to the relative entropy problem is derived from prior research work, allowing the reward to be expressed as a function of the policy. REBEL iteratively updates the policy based on a square loss objective, utilizing paired samples to approximate the partition function. This core REBEL objective aims to fit the relative rewards between response pairs, ultimately seeking to solve the KL-constrained RL problem.

The comparison between REBEL, SFT, PPO, and DPO for models trained with LoRA reveals REBEL’s superior performance regarding RM score across all model sizes, albeit with a slightly larger KL divergence than PPO. Particularly, REBEL achieves the highest win rate under GPT4 when evaluated against human references, indicating the advantage of regressing relative rewards. The trade-off between reward model score and KL divergence, where REBEL exhibits higher divergence but achieves larger RM scores than PPO, especially towards the end of training.

In conclusion, this research presents REBEL, a simplified RL algorithm that tackles the RL problem by solving a series of relative reward regression tasks on sequentially gathered datasets. Unlike policy gradient approaches, which often rely on additional networks and heuristics like clipping for optimization stability, REBEL focuses on driving down training error on a least squares problem, making it remarkably straightforward to implement and scale. Theoretically, REBEL aligns with the strongest guarantees available for RL algorithms in agnostic settings. In practice, REBEL demonstrates competitive or superior performance compared to more complex and resource-intensive methods across language modeling and guided image generation tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 40k+ ML SubReddit

The post REBEL: A Reinforcement Learning RL Algorithm that Reduces the Problem of RL to Solving a Sequence of Relative Reward Regression Problems on Iteratively Collected Datasets appeared first on MarkTechPost.

Emerging Trends in Reinforcement Learning: Applications Beyond Gaming

Adnan Hassan — Wed, 17 Apr 2024 05:00:00 +0000

Reinforcement Learning (RL) is expanding its footprint, finding innovative uses across various industries far beyond its origins in gaming. Let’s explore how RL drives significant advancements in finance, healthcare, robotics, autonomous vehicles, and smart infrastructure.

Finance

In finance, RL algorithms are revolutionizing investment strategies and risk management. They make sequential decisions by observing market states, selecting actions, and adjusting strategies based on rewards. Despite their potential, RL models in finance grapple with the uncertainties of financial markets and ethical concerns regarding automated trading systems.

Key Features in Finance:

Portfolio Management: Automating the distribution of assets to maximize returns based on predicted market conditions.
Algorithmic Trading: Executing high-speed trades based on learned strategies from vast market data.
Risk Assessment: Evaluating potential financial risks in real-time to make informed decisions.

Healthcare

Healthcare has seen promising RL applications, particularly in personalized medicine and patient management. RL models process complex data to optimize treatment plans, predict patient trajectories, and manage resources efficiently, promising to transform patient care with data-driven precision.

Key Features in Healthcare:

Personalized Treatment Plans: Tailoring medical treatments based on individual patient data to improve outcomes.
Robotic Surgery: Enhancing surgical robots’ precision and adaptability in complex procedures.
Medical Diagnostics: Improving diagnostic accuracy through continuous learning from diverse patient data.

Robotics

Robotics leverages RL to develop sophisticated autonomous machines capable of assembly, navigation, and complex manipulation tasks. This includes advanced techniques like model-based RL, imitation learning, and hierarchical RL, which enhance robots’ adaptability and efficiency in dynamic environments.

Key Features in Robotics:

Automated Warehousing: Optimizing warehouse logistics through intelligent robotic systems that adapt to changing inventory and demand.
Service Robots: Improving interaction and service delivery in retail and hospitality through robots trained to understand and respond to human activities.
Advanced Manufacturing: Enabling robots to handle intricate assembly tasks with high precision and minimal human intervention.

Autonomous Vehicles

RL is crucial in the evolution of autonomous vehicles. It empowers self-driving cars with capabilities for dynamic navigation, decision-making, and operational control under varying conditions, enhancing road safety and efficiency.

Key Features in Autonomous Vehicles:

Dynamic Navigation Systems: Enabling AVs to navigate complex urban and highway scenarios adaptively.
Real-time Decision Making: Optimizing routes and driving decisions based on traffic conditions, weather, and onboard sensor data.
Safety Enhancements: Continuously learning and updating safety protocols to handle unexpected road situations.

Smart Cities

In urban planning, RL is used to optimize traffic management systems. Algorithms control traffic signals, reducing congestion based on real-time data regarding traffic flow, peak times, and other urban dynamics, demonstrating a significant impact on city mobility.

Key Features in Smart Cities:

Traffic Signal Control: Adapting traffic lights in real-time to reduce congestion and improve flow during varying traffic volumes.
Energy Management: Optimizing energy distribution and consumption in urban areas to enhance efficiency and reduce waste.
Public Safety Monitoring: Utilizing RL in surveillance systems to enhance public safety through dynamic response strategies.

Customer Interaction

RL has transformed customer service through more responsive, intelligent chatbots and virtual assistants. These systems learn from interactions to improve their understanding and response to customer queries, enhancing the user experience.

Reinforcement Learning: Use Cases and Examples

Challenges and Possible Future Developments

While RL’s potential is vast, it faces challenges like data dependency, complexity in training, and the need for robust models that can generalize across different environments. Future developments aim to refine these algorithms for better adaptability and reduced reliance on large datasets, enhancing their practicality in real-world applications.

Conclusion

Reinforcement learning is a key driver of innovation across numerous fields, extending well beyond its gaming origins. Its ability to learn and optimize complex decision-making processes makes it invaluable in tackling varied industrial challenges. As RL technology continues to evolve, its integration into more sectors is anticipated, promising further transformative impacts on global industries.

References

https://www.deepchecks.com/reinforcement-learning-applications-from-gaming-to-real-world/
https://www.imf.org/Deep-Reinforcement-Learning-Emerging-Trends-in-Macroeconomics
https://builtin.com/what-is-reinforcement-learning-definition-uses
https://www.mdpi.com/Sensors-Deep-Reinforcement-Learning-Algorithms-for-Robotic-Manipulation

The post Emerging Trends in Reinforcement Learning: Applications Beyond Gaming appeared first on MarkTechPost.

Recall to Imagine (R2I): A New Machine Learning Approach that Enhances Long-Term Memory by Incorporating State Space Models into Model-based Reinforcement Learning (MBRL)

Tanya Malhotra — Thu, 28 Mar 2024 09:00:00 +0000

With the recent advancements in the field of Machine Learning (ML), Reinforcement Learning (RL), which is one of its branches, has become significantly popular. In RL, an agent picks up skills to interact with its surroundings by acting in a way that maximizes the sum of its rewards.

The incorporation of world models into RL has emerged as a potent paradigm in recent years. Agents may observe, simulate, and plan within the learned dynamics with the help of the world models, which encapsulate the dynamics of the surrounding environment. Model-Based Reinforcement Learning (MBRL) has been made easier by this integration, in which an agent learns a world model from previous experiences in order to forecast the results of its actions and make wise judgments.

One of the major issues in the field of MBRL is managing long-term dependencies. These dependencies describe scenarios in which an agent must recollect distant observations in order to make judgments or situations in which there are significant temporal gaps between the agent’s actions and the results. The inability of current MBRL agents to perform well in tasks requiring temporal coherence is a result of their frequent struggles with these settings.

To address these issues, a team of researchers has suggested a unique ‘Recall to Imagine’ (R2I) method to tackle this problem and enhance the agents’ capacity to manage long-term dependency. R2I incorporates a set of state space models (SSMs) into the MBRL agent world models. The goal of this integration is to improve the agents’ capacity for long-term memory as well as their capacity for credit assignment.

The team has proven the effectiveness of R2I by an extensive evaluation of a wide range of illustrative jobs. First, R2I has set a new benchmark for performance on demanding RL tasks like memory and credit assignment found in POPGym and BSuite environments. R2I has also demonstrated superhuman performance in the Memory Maze task, a challenging memory domain, demonstrating its capacity to manage challenging memory-related tasks.

R2I has not only performed comparably in standard reinforcement learning tasks like those in the Atari and DeepMind Control (DMC) environments, but it also excelled in memory-intensive tasks. This implies that this approach is both generalizable to different RL scenarios and effective in specific memory domains.

The team has illustrated the effectiveness of R2I by showing that it converges more quickly in terms of wall time when compared to DreamerV3, the most advanced MBRL approach. Due to its rapid convergence, R2I is a viable solution for real-world applications where time efficiency is critical, and it can accomplish desirable outputs more efficiently.

The team has summarized their primary contributions as follows:

DreamerV3 is the foundation for R2I, an improved MBRL agent with improved memory. A modified version of S4 has been used by R2I to manage temporal dependencies. It preserves the generality of DreamerV3 and offers up to 9 times faster calculation while using fixed world model hyperparameters across domains.

POPGym, BSuite, Memory Maze, and other memory-intensive domains have shown that R2I performs better than its competitors. R2I performs better than humans, especially in a Memory Maze, which is a difficult 3D environment that tests long-term memory.

R2I’s performance has been evaluated in RL benchmarks such as DMC and Atari. The results highlighted R2I’s adaptability by showing that its improved memory capabilities do not degrade its performance in a variety of control tasks.

In order to evaluate the effects of the design choices made for R2I, the team carried out ablation tests. This provided insight into the efficiency of the system’s architecture and individual parts.

If you like our work, you will love our newsletter..

Don’t Forget to join our 39k+ ML SubReddit

The post Recall to Imagine (R2I): A New Machine Learning Approach that Enhances Long-Term Memory by Incorporating State Space Models into Model-based Reinforcement Learning (MBRL) appeared first on MarkTechPost.

Researchers at the University of Oxford Introduce Craftax: A Machine Learning Benchmark for Open-Ended Reinforcement Learning

Dhanshree Shripad Shenwai — Thu, 07 Mar 2024 17:25:45 +0000

Building and using appropriate benchmarks is a major driver of advancement in RL algorithms. For value-based deep RL algorithms, there’s the Arcade Learning Environment; for continuous control, there’s Mujoco; and for multi-agent RL, there’s the StarCraft Multi-Agent Challenge. Benchmarks that demonstrate more open-ended dynamics, such as procedural world generation, skill acquisition and reuse, long-term dependencies, and constant learning, have emerged as part of the move towards more generic agents. Because of this, tools like MiniHack, Crafter, MALMO, and The NetHack Learning Environment have been created.

Unfortunately, researchers cannot use them due to their lengthy runtime, making them impractical for use with current methods that do not employ large-scale computer resources. At the same time, JAX has seen a boom in RL environments as the speed of running an end-to-end compiled RL pipeline has been fully realized. Experiments that used to take days to execute on a huge compute cluster may now be completed in minutes on a single GPU thanks to effective parallelization, compilation, and the elimination of CPU GPU transfer.

To unite these two schools of thought, a recent study by the University of Oxford and University College London provides the Craftax benchmark, an environment based on JAX that runs orders of magnitude quicker than similar ones and displays intricate, open-ended dynamics. One concrete example is Craftax-Classic, a JAX reimplementation of Crafter that outperforms the original Python version by 250.

The researchers demonstrate that a basic PPO agent can solve Craftax-Classic (to 90% of maximum return) in 51 minutes with easy access to significantly more timesteps. Accordingly, they also offer Craftax, a far more difficult setting that borrows mechanics from NetHack and, more generally, the Roguelike genre. They provide users with the primary Craftax environment, designed to be harder while keeping a fast runtime, to give a more appealing challenge. A wide variety of new game mechanics are introduced in Craftax. The usage of pixels just adds another layer of representation learning to the problem, and many of the qualities that Crafter examines (exploration, memory) are unconcerned with the precise form of the observation. So, they provide Craftax variants that use symbolic observations as well as pixel-based observations; the former is around ten times faster.

The results of their tests reveal that the currently available approaches perform poorly on Craftax. Therefore, the team hopes it allows experimentation with constrained computational resources while posing a substantial challenge for future RL research.

The team hopes that Craftax-Classic will offer a smooth introduction to Craftax for individuals who are already familiar with the Crafter standard.

Check out the Paper, Github, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

You may also like our FREE AI Courses….

The post Researchers at the University of Oxford Introduce Craftax: A Machine Learning Benchmark for Open-Ended Reinforcement Learning appeared first on MarkTechPost.

Reinforcement Learning Category - MarkTechPost

Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning

π0 Released and Open Sourced: A General-Purpose Robotic Foundation Model that could be Fine-Tuned to a Diverse Range of Tasks

REDA: A Novel AI Approach to Multi-Agent Reinforcement Learning That Makes Complex Sequence-Dependent Assignment Problems Solvable

OpenAI Researchers Propose a Multi-Step Reinforcement Learning Approach to Improve LLM Red Teaming

Technical Details

Conclusion

Top Reinforcement Learning Courses

Table of contents

Reinforcement Learning Specialization (University of Alberta)

Decision Making and Reinforcement Learning (Columbia University)

Deep Learning and Reinforcement Learning (IBM)

Reinforcement Learning (RWTHx)

Reinforcement Learning from Human Feedback (Deeplearning.ai)

Fundamentals of Deep Reinforcement Learning (LVx)

Reinforcement Learning beginner to master – AI in Python (Udemy)

Artificial Intelligence 2.0: AI, Python, DRL + ChatGPT Prize (Udemy)

Reinforcement Learning – Youtube Playlist (Youtube)

Deep Reinforcement Learning (Udacity)

AWS DeepRacer Course (Udacity)

Unraveling Human Reward Learning: A Hybrid Approach Combining Reinforcement Learning with Advanced Memory Architectures

REBEL: A Reinforcement Learning RL Algorithm that Reduces the Problem of RL to Solving a Sequence of Relative Reward Regression Problems on Iteratively Collected Datasets

Emerging Trends in Reinforcement Learning: Applications Beyond Gaming

Finance

Healthcare

Robotics

Autonomous Vehicles

Smart Cities

Customer Interaction

Challenges and Possible Future Developments

Conclusion

Recall to Imagine (R2I): A New Machine Learning Approach that Enhances Long-Term Memory by Incorporating State Space Models into Model-based Reinforcement Learning (MBRL)

Researchers at the University of Oxford Introduce Craftax: A Machine Learning Benchmark for Open-Ended Reinforcement Learning