OPD Has Foresight: Why Distillation Beats RL and How to Exploit It 3×

01 — The Exploration Tax

RL explores mountains it doesn't need to climb

Post-training a large language model for reasoning boils down to navigating a high-dimensional parameter space to find a region where the model reliably solves hard problems. Reinforcement Learning (RL — GRPO, PPO, DAPO) does this well but expensively: it needs many rollouts to discover which directions actually improve reasoning, and it wastes update budget on layers that don't matter.

On-policy distillation (OPD) uses a stronger teacher to generate supervision, and practitioners have long observed it converges faster. The standard explanation — "denser supervision" — is macroscopic and unsatisfying. This paper asks: what actually happens at the parameter level?

parameter-space navigation — RL vs OPD click Play to animate both trajectories

step: 0

Even under the same norm constraint, OPD achieves substantially higher reasoning gains per unit of parameter change than RL — meaning RL updates carry "dead weight": components that inflate the gradient norm without improving reasoning.

When you scale up OPD or RL updates by a factor α ∈ [0, 1] and measure the resulting accuracy, OPD's curve rises steeply while RL's is shallower — identical norm, different signal density. The mystery is why OPD is so much more precise.

02 — Functional Redundancy Avoidance

OPD ignores the layers that don't reason

The first dimension of OPD's foresight operates at the module level. To locate where meaningful updates go, the authors ran a sliding-window intervention: for each block of layers, they inject only that block's RL or OPD update and measure the resulting reasoning accuracy. The picture is clear:

MLP modules are far more important for reasoning than attention modules — they serve as the primary knowledge carriers.
Middle layers (layers ~5–10 out of 32 in Qwen3-8B) produce the biggest accuracy jumps. Bottom and top layers contribute little.
Embedding layer replacement has negligible effect on reasoning — embeddings are basically inert for task performance after pretraining.

layer-wise update norm — Qwen3-8B toggle RL / OPD to see where each concentrates updates

Middle layers (shaded region) produce the largest reasoning gains — OPD concentrates its budget there. RL spreads updates more evenly, accumulating large norms in peripheral layers with low marginal utility.

RL and OPD share nearly identical sensitivity patterns — they both "know" which layers matter for reasoning. But RL dumps large updates into the low-sensitivity peripheral layers anyway, generating redundancy. OPD suppresses updates in those layers early, concentrating its entire budget on the middle-layer MLPs where each delta actually moves accuracy.

Property 1 — Functional Redundancy Avoidance: OPD identifies modules with low marginal utility early in training and suppresses their parameter updates, concentrating updates on reasoning-critical modules. RL accumulates large redundant updates in these peripheral regions.

A critical data point: on Qwen2.5-8B, the Top-1% subspace of OPD's update captures 94.7% of the update energy — vs 88.5% for RL. More energy in fewer, better-aligned directions means less wasted compute.

03 — Early Low-Rank Lock-in

The right direction is found at 10% training progress

The second dimension of foresight is geometric: OPD's parameter updates form a low-rank structure that stabilizes early. To measure this, the authors applied SVD to the update matrix ∆W = W_final − W_base and tracked several metrics across training:

Effective rank: OPD's update has lower effective rank (2341 vs 2754 for RL on 8B), meaning fewer dimensions carry the action.
Top-1% subspace norm ratio: OPD concentrates 94.7% of its update energy in the top 1% of directions; RL only 88.5%.
Subspace alignment: At each training step, cosine similarity of the dominant subspace to the final checkpoint's subspace is computed — OPD reaches high alignment within the first 10% of training. RL alignment grows slowly over the full training run.

subspace alignment to final direction vs training progress OPD locks in early; RL explores the whole way

By 10% training, OPD's dominant subspace already has cosine similarity ~0.72 with the final direction. RL doesn't reach similar alignment until ~60–70% of training.

The decisive experiment: take an OPD checkpoint at only 10% training progress, preserve its update direction per module, but rescale each module's Frobenius norm to match the final checkpoint's norm. The resulting model recovers approximately 80% of the final reasoning performance. The direction was already correct — only the magnitude was missing.

magnitude scaling experiment W_scaled = W_base + (∆W_10% / ‖ ∆W_10% ‖ F) \times ‖ ∆W_final ‖ F \to recovers ~80% of final reasoning accuracy at only 10% training progress

Property 2 — Early Low-Rank Lock-in: OPD stabilizes its dominant update subspaces early, with high alignment to the final solution from the first 10% of training. Subsequent training primarily amplifies magnitude along these locked-in directions rather than changing direction.

This structurally explains Property 1: because OPD locks into a small set of high-quality directions early, it automatically avoids redundant updates elsewhere — peripheral modules simply never receive much gradient energy. RL, lacking this directional stability, keeps exploring new directions throughout training, accumulating redundancy in the process.

04 — EffOPD: Extrapolating the Foresight

If the direction is correct at 10%, why train to 100%?

The two properties above suggest an obvious shortcut: since OPD's update direction stabilizes early and subsequent training just increases magnitude along the same trajectory, we should be able to extrapolate — jump further along the known-good direction instead of grinding out each training step.

EffOPD does exactly this. It runs a lightweight directional extrapolation at exponentially spaced checkpoints (t = 1, 2, 4, 8, 16, …), each time estimating the current update direction and testing whether a larger leap produces better validation performance:

EffOPD algorithm — step by step click Next to walk through each phase

Run OPD normally until checkpoint t = 2ⁿ

Standard on-policy distillation — student generates samples, teacher provides token-level KL supervision. Checkpoints are at exponentially spaced steps: 1, 2, 4, 8, …

t ∈ {1, 2, 4, 8, 16, …} # exponential schedule

Estimate the local update direction Δ_n

For the first checkpoint (n=0), use W₁ − W₀ as the direction. For n ≥ 1, use the displacement between adjacent exponential checkpoints — this captures the accumulated update trend.

Δ_n = W_{2^n} − W_{2^{n−1}}

Generate 5 extrapolation candidates

Project forward along Δ_n with increasing step sizes 2¹, 2², …, 2⁵. These candidates represent "what if we jumped k times further along the current direction?"

W_cand_k = W_{2^n} + 2^k · Δ_n # k = 1..5

Validate on 50 training examples

Evaluate each candidate on a tiny held-out set D_v (50 samples). Accept greedily: if candidate k beats the running best, accept it. Stop at the first failure — no exhaustive search.

if score(W_cand_k) >= best_score: W_acc = W_cand_k

Resume training from the accepted point

Replace the current model with the best accepted candidate. Continue OPD training. If no candidate improved (k=1 failed), fall back to vanilla OPD — EffOPD degrades gracefully.

W_current = W_acc # jump if beneficial, else stay

Step 1 of 5

The validation set D_v serves only as a direction check — it doesn't need to be hard or representative. In ablation studies, easy, medium, and hard validation sets all give consistent signals, confirming that the check is verifying directional quality rather than fine-grained supervision.

The exponential checkpoint schedule means EffOPD runs the extrapolation search only O(log T) times over T total training steps — the overhead is tiny compared to the saved training iterations.

05 — Results: 3× Training Speedup

10 steps beats 35 steps of vanilla OPD

Experiments span model scales from 1.5B to 32B parameters (Qwen2.5 and Qwen3 families), trained on Eurus-RL-Code and DeepMath-103K, and evaluated on seven benchmarks: Codeforces, TACO, AIME24, AIME25, AIME26, MINERVA, and GPQA.

convergence curve — accuracy vs training steps (math reasoning) EffOPD reaches 80% accuracy at step 10 vs step 35 for vanilla OPD

Schematic based on Figure 6 of the paper. EffOPD (purple) reaches convergence ~3× earlier than vanilla OPD (cyan). RL (grey) converges later and at a lower ceiling on math tasks.

Geometry of the speedup

The speedup is not uniform across methods. Fixed-extrapolation baselines (AlphaOPD, ExOPD) sometimes overshoot and destabilize. EffOPD's adaptive validation prevents this: if a candidate is too aggressive, it's rejected and training continues normally. On Qwen3-4B-Non-Thinking, EffOPD attains strong reasoning performance by the 4th training step.

Spectral structure across model scales (Table 1 numbers)

Model	Method	Spec/Frob Ratio ↑	Eff. Rank ↓	Top-1% Norm Ratio ↑
1.5B	RL	33.2%	964	78.1%
1.5B	OPD	39.6%	778	92.3%
8B	RL	32.7%	2754	88.5%
8B	OPD	36.8%	2341	94.7%
14B	RL	24.4%	3174	81.2%
14B	OPD	28.1%	2937	94.5%

OPD consistently shows higher spectral-to-Frobenius ratio (dominant directions carry more of the energy), lower effective rank (fewer dimensions carry the action), and higher Top-1% norm ratio (energy is concentrated in the top 1% of singular vectors) at every scale.

06 — Takeaways

What this means for LLM post-training

For practitioners: EffOPD is plug-and-play — it requires no architectural changes, no new hyperparameters, and no extra trainable modules. If you're already running OPD, you can add EffOPD's extrapolation loop and expect ~3× fewer training steps for the same final performance. The only cost is validating 50 samples per exponential checkpoint.

For researchers: The two foresight properties (Functional Redundancy Avoidance + Early Low-Rank Lock-in) are measurable throughout training. They provide a diagnostic lens: if your fine-tuning method doesn't exhibit strong early subspace alignment, it likely has headroom. The cosine-similarity-to-final-checkpoint metric is a useful signal to track during training.

Broader implication: "Denser supervision" is the right intuition, but the mechanism is geometric. A teacher that constrains the student's update to a low-rank subspace aligned with the final solution creates a natural momentum that RL has to discover on its own. This is why distillation is structurally easier to optimize — not just statistically, but in the geometry of the loss landscape.

Open questions

Can the foresight properties be induced in RL directly — e.g., by regularizing the update matrix's effective rank?
Does the early lock-in break at very large scales (70B+) or when teacher and student are from different architectures?
Can the validation set D_v be replaced by a learned critic that predicts extrapolation success without sampling?
Do the same properties appear in RLHF (reward from human preferences), or is foresight specific to verifiable-reward settings?