RL explores mountains it doesn't need to climb
Post-training a large language model for reasoning boils down to navigating a high-dimensional parameter space to find a region where the model reliably solves hard problems. Reinforcement Learning (RL — GRPO, PPO, DAPO) does this well but expensively: it needs many rollouts to discover which directions actually improve reasoning, and it wastes update budget on layers that don't matter.
On-policy distillation (OPD) uses a stronger teacher to generate supervision, and practitioners have long observed it converges faster. The standard explanation — "denser supervision" — is macroscopic and unsatisfying. This paper asks: what actually happens at the parameter level?
Even under the same norm constraint, OPD achieves substantially higher reasoning gains per unit of parameter change than RL — meaning RL updates carry "dead weight": components that inflate the gradient norm without improving reasoning.
When you scale up OPD or RL updates by a factor α ∈ [0, 1] and measure the resulting accuracy, OPD's curve rises steeply while RL's is shallower — identical norm, different signal density. The mystery is why OPD is so much more precise.
OPD ignores the layers that don't reason
The first dimension of OPD's foresight operates at the module level. To locate where meaningful updates go, the authors ran a sliding-window intervention: for each block of layers, they inject only that block's RL or OPD update and measure the resulting reasoning accuracy. The picture is clear:
- MLP modules are far more important for reasoning than attention modules — they serve as the primary knowledge carriers.
- Middle layers (layers ~5–10 out of 32 in Qwen3-8B) produce the biggest accuracy jumps. Bottom and top layers contribute little.
- Embedding layer replacement has negligible effect on reasoning — embeddings are basically inert for task performance after pretraining.
Middle layers (shaded region) produce the largest reasoning gains — OPD concentrates its budget there. RL spreads updates more evenly, accumulating large norms in peripheral layers with low marginal utility.
RL and OPD share nearly identical sensitivity patterns — they both "know" which layers matter for reasoning. But RL dumps large updates into the low-sensitivity peripheral layers anyway, generating redundancy. OPD suppresses updates in those layers early, concentrating its entire budget on the middle-layer MLPs where each delta actually moves accuracy.
A critical data point: on Qwen2.5-8B, the Top-1% subspace of OPD's update captures 94.7% of the update energy — vs 88.5% for RL. More energy in fewer, better-aligned directions means less wasted compute.
The right direction is found at 10% training progress
The second dimension of foresight is geometric: OPD's parameter updates form a low-rank structure that stabilizes early. To measure this, the authors applied SVD to the update matrix ∆W = Wfinal − Wbase and tracked several metrics across training:
- Effective rank: OPD's update has lower effective rank (2341 vs 2754 for RL on 8B), meaning fewer dimensions carry the action.
- Top-1% subspace norm ratio: OPD concentrates 94.7% of its update energy in the top 1% of directions; RL only 88.5%.
- Subspace alignment: At each training step, cosine similarity of the dominant subspace to the final checkpoint's subspace is computed — OPD reaches high alignment within the first 10% of training. RL alignment grows slowly over the full training run.
By 10% training, OPD's dominant subspace already has cosine similarity ~0.72 with the final direction. RL doesn't reach similar alignment until ~60–70% of training.
The decisive experiment: take an OPD checkpoint at only 10% training progress, preserve its update direction per module, but rescale each module's Frobenius norm to match the final checkpoint's norm. The resulting model recovers approximately 80% of the final reasoning performance. The direction was already correct — only the magnitude was missing.
→ recovers ~80% of final reasoning accuracy at only 10% training progress
This structurally explains Property 1: because OPD locks into a small set of high-quality directions early, it automatically avoids redundant updates elsewhere — peripheral modules simply never receive much gradient energy. RL, lacking this directional stability, keeps exploring new directions throughout training, accumulating redundancy in the process.
If the direction is correct at 10%, why train to 100%?
The two properties above suggest an obvious shortcut: since OPD's update direction stabilizes early and subsequent training just increases magnitude along the same trajectory, we should be able to extrapolate — jump further along the known-good direction instead of grinding out each training step.
EffOPD does exactly this. It runs a lightweight directional extrapolation at exponentially spaced checkpoints (t = 1, 2, 4, 8, 16, …), each time estimating the current update direction and testing whether a larger leap produces better validation performance:
The validation set Dv serves only as a direction check — it doesn't need to be hard or representative. In ablation studies, easy, medium, and hard validation sets all give consistent signals, confirming that the check is verifying directional quality rather than fine-grained supervision.
The exponential checkpoint schedule means EffOPD runs the extrapolation search only O(log T) times over T total training steps — the overhead is tiny compared to the saved training iterations.
10 steps beats 35 steps of vanilla OPD
Experiments span model scales from 1.5B to 32B parameters (Qwen2.5 and Qwen3 families), trained on Eurus-RL-Code and DeepMath-103K, and evaluated on seven benchmarks: Codeforces, TACO, AIME24, AIME25, AIME26, MINERVA, and GPQA.
Schematic based on Figure 6 of the paper. EffOPD (purple) reaches convergence ~3× earlier than vanilla OPD (cyan). RL (grey) converges later and at a lower ceiling on math tasks.
Geometry of the speedup
The speedup is not uniform across methods. Fixed-extrapolation baselines (AlphaOPD, ExOPD) sometimes overshoot and destabilize. EffOPD's adaptive validation prevents this: if a candidate is too aggressive, it's rejected and training continues normally. On Qwen3-4B-Non-Thinking, EffOPD attains strong reasoning performance by the 4th training step.
Spectral structure across model scales (Table 1 numbers)
| Model | Method | Spec/Frob Ratio ↑ | Eff. Rank ↓ | Top-1% Norm Ratio ↑ |
|---|---|---|---|---|
| 1.5B | RL | 33.2% | 964 | 78.1% |
| 1.5B | OPD | 39.6% | 778 | 92.3% |
| 8B | RL | 32.7% | 2754 | 88.5% |
| 8B | OPD | 36.8% | 2341 | 94.7% |
| 14B | RL | 24.4% | 3174 | 81.2% |
| 14B | OPD | 28.1% | 2937 | 94.5% |
OPD consistently shows higher spectral-to-Frobenius ratio (dominant directions carry more of the energy), lower effective rank (fewer dimensions carry the action), and higher Top-1% norm ratio (energy is concentrated in the top 1% of singular vectors) at every scale.
What this means for LLM post-training
Open questions
- Can the foresight properties be induced in RL directly — e.g., by regularizing the update matrix's effective rank?
- Does the early lock-in break at very large scales (70B+) or when teacher and student are from different architectures?
- Can the validation set Dv be replaced by a learned critic that predicts extrapolation success without sampling?
- Do the same properties appear in RLHF (reward from human preferences), or is foresight specific to verifiable-reward settings?