RL’s Razor: Why On-Policy Reinforcement Learning Forgets Less

2025-09-07

RL’s Razor: Why On-Policy Reinforcement Learning Forgets Less

Introduction

One of the major challenges in machine learning is that models tend to lose previously acquired knowledge when adapting to new tasks. This phenomenon is known as catastrophic forgetting. A recent paper titled "RL’s Razor: Why Online Reinforcement Learning Forgets Less" offers an important insight: on-policy reinforcement learning (RL) methods consistently forget less than supervised fine-tuning (SFT) approaches when achieving the same performance on a new task.

1. Catastrophic Forgetting and How to Measure It

If a model performs well on a new task but deteriorates on older ones, forgetting has occurred. Therefore, measuring forgetting requires looking at both new task performance and retained performance on past tasks.
The paper frames this trade-off as a Pareto frontier: for a fixed new-task score, the higher the old-task score, the less forgetting has happened.

2. Two Paths: SFT vs. On-Policy RL

Supervised Fine-Tuning (SFT): The model is trained on externally provided labels or outputs. This often pushes the model away from its base distribution.
On-Policy Reinforcement Learning (RL): The model generates its own samples, evaluates them with a reward signal, and updates accordingly. This naturally keeps the policy closer to its base distribution (π₀).

As a result, RL updates tend to involve smaller deviations, helping preserve prior knowledge.

3. The Key Predictor of Forgetting: KL Divergence

The central quantity in the study is the forward KL divergence (D_KL(π₀‖π)).

Formula:
D_KL(π₀‖π) = E_(x∼τ) [ Σ_y π₀(y|x) log(π₀(y|x) / π(y|x)) ]

Where:

π₀ = base policy
π = fine-tuned policy
τ = new task distribution

Findings show that forgetting correlates strongly with forward KL divergence measured on the new task distribution. Other metrics (weight norms, gradient rank, reverse KL) were weaker predictors.

4. RL’s Razor Principle

The authors coin this phenomenon "RL’s Razor":

Among all policies that solve a new task, on-policy RL tends to select the one closest (in KL terms) to the base policy.

This KL-minimal tendency explains why RL achieves better retention compared to SFT at the same level of new-task performance.

5. Oracle SFT: A Validation Experiment

The researchers introduce Oracle SFT, a scenario where labels are constructed to minimize KL divergence from the base policy while still solving the new task. Oracle SFT not only matches RL but can even forget less than RL, confirming that the true advantage lies not in the algorithm itself but in staying close to the base distribution.

6. Theoretical Foundations: I-Projection and KL-Minimal Solutions

From an information geometry perspective:

I-projection (min D_KL(q‖p)) describes moving to the closest distribution under forward KL.
Policy gradient updates can be seen as alternating projections that bias learning toward KL-minimal solutions.

This theoretical framing supports the empirical observation that on-policy RL forgets less.

7. Empirical Results

The paper evaluates both large language models (mathematical reasoning, scientific QA, tool use) and robotics tasks (pick-and-place). Key results:

On-policy RL consistently achieves better Pareto trade-offs than SFT.
Forgetting across methods collapses onto the same curve when plotted against forward KL divergence.

8. Practical Implications

Track KL divergence: Monitoring KL on new-task data can serve as an early warning for forgetting.
Favor on-policy RL: It achieves the same new-task performance with less forgetting.
Design labels carefully: For SFT, consider strategies that keep outputs closer to the base distribution.
Adopt Pareto thinking: Evaluate methods by both new-task gains and old-task retention.
Use hybrid strategies: A short SFT phase, followed by on-policy RL, and then distillation can yield efficient, balanced results.

Conclusion

The RL’s Razor study demonstrates that on-policy RL forgets less because it implicitly favors KL-minimal solutions when adapting to new tasks. Both theoretical analysis and empirical evidence confirm this principle. For practitioners, the message is clear: if you want models that adapt while retaining prior capabilities, measure KL divergence, leverage on-policy RL, and design training strategies with distributional closeness in mind.