A Survey of Reinforcement Learning for Large Reasoning Models

2025-09-16

A Survey of Reinforcement Learning for Large Reasoning Models

Paper: arXiv:2509.08827
Authors: Kaiyan Zhang, Yuxin Zuo, Bingxiang He, et al. (2025)

1. Introduction

Large Reasoning Models (LRMs) extend LLMs by focusing on complex problem-solving and multi-step reasoning.
This survey explores how Reinforcement Learning (RL) is reshaping LRMs, what components are essential, what challenges remain, and where the field is heading.

2. Foundational Components of RL for LLMs

Reward Design

Verifiable rewards
These rewards are based on tasks where correctness can be objectively checked.
- Examples include verifying a math solution, running unit tests for generated code, or checking if output formatting matches specifications.
- Strength: reliable and automated.
- Limitation: not every reasoning task is easily verifiable.
Learned reward models
When ground-truth verification is not possible (e.g., evaluating free-form text quality), a separate reward model is trained.
- This model scores outputs based on human feedback or preferences.
- Useful for subjective or open-ended tasks.
- Risk: reward models can encode biases or be gamed by the LLM (“reward hacking”).
Dense & shaped rewards
Instead of only rewarding the final outcome, shaped rewards give intermediate feedback at each reasoning step.
- Example: partial credit for writing the correct equation even if the final math answer is wrong.
- Benefit: faster and more stable learning.
- Challenge: designing meaningful step-level rewards is task-dependent and labor-intensive.
Unsupervised signals
Rewards can also come from intrinsic measures, without explicit labels.
- Examples: promoting output diversity, penalizing repetition, or rewarding lower entropy.
- Advantage: requires no manual labeling.
- Drawback: weak alignment with actual task objectives.

Policy Optimization

Policy gradient methods (REINFORCE, PPO)
These directly adjust the model’s parameters to maximize expected reward.
- REINFORCE is simple but noisy.
- PPO (Proximal Policy Optimization) is widely used for LLMs because it balances improvement with stability by limiting step size.
Critic-based approaches (actor-critic)
In this setup, the actor generates outputs while a critic estimates their value.
- Advantage: more sample-efficient and stable.
- Widely used in long reasoning tasks where variance is high.
Critic-free optimization
Eliminates the critic component, relying only on policy gradients and heuristic baselines.
- Simpler and cheaper, but more prone to instability.
Off-policy methods
These allow training on previously collected trajectories rather than requiring live interaction.
- Example: reusing logs of model generations.
- Key for scaling, since collecting fresh samples from large models is very expensive.
Regularization for stability and diversity
To avoid collapse (e.g., repeating the same safe outputs), additional penalties are introduced.
- KL divergence regularization ensures outputs don’t drift too far from the base model.
- Entropy bonuses encourage exploration and diversity in reasoning.

Sampling Strategies

Structured vs. dynamic sampling
- Structured: pre-defined, fixed rollout numbers and task distributions.
- Dynamic: adaptive sampling that focuses more on difficult tasks or uncertain cases.
Balancing exploration vs. exploitation
- Exploration encourages the model to try novel reasoning paths.
- Exploitation reinforces what is already working well.
- Proper balance ensures both creativity and reliability.
Hyperparameter tuning for long reasoning chains
Parameters like temperature, top-k, and top-p strongly affect output diversity.
- Long reasoning chains require careful control of sampling to avoid drifting into irrelevant paths.
- Rollout length also matters: too short may cut off reasoning, too long may add noise.

3. Foundational Problems

RL vs. Supervised Fine-Tuning (SFT):
RL sharpens reasoning beyond SFT, but SFT often provides stronger priors.
Process vs. Outcome Rewards:
Process rewards encourage correct steps; outcome rewards only care about the final answer.
Model Priors:
RL interacts with pretraining biases and can either strengthen or override them.
Training Recipes:
Subtle choices (batch size, rollout strategy, reward shaping) drastically affect stability.

4. Training Resources

Static corpora: benchmark-style datasets, labeled corpora
Dynamic environments: interactive tasks, agents, simulators
Infrastructure: large-scale GPU/TPU clusters, verifier systems, scalable logging

5. Applications

RL for reasoning has been tested across multiple domains:

Coding (unit-test based verifiable rewards)
Agentic tasks (planning, decision-making)
Multimodal reasoning (vision + language)
Multi-agent coordination
Medical reasoning (safe, verifiable responses)
Robotics (real-world sequential decision-making)

6. Future Directions

The survey highlights promising avenues:

Continual RL – enabling lifelong learning
Memory-based RL – better context and past trajectory use
Model-based RL – planning with environment models
Latent space reasoning – efficient inference in compressed representations
Scientific discovery – leveraging RL for hypothesis testing and reasoning
LLM + Architecture Co-Design – aligning model structures with RL algorithms

7. Key Takeaways

RL enhances reasoning but introduces reward design and stability challenges.
Verifiable rewards are critical for math, coding, and factual tasks.
RL is most effective when combined with supervised pretraining + careful reward shaping.
Applications extend beyond NLP into robotics, healthcare, and multi-agent systems.
Future research will need to address scalability, cost, and robust evaluation.

References

Paper PDF
Zhang et al., A Survey of Reinforcement Learning for Large Reasoning Models, 2025