On-Policy Distillation LLM Post-Training Mechanistic Analysis Interactive

OPD Has Foresight: Why Distillation Beats RL and How to Exploit It 3×

On-policy distillation (OPD) isn't just "denser supervision" — at the parameter level it locks onto the correct update direction within the first 10% of training, far earlier than RL ever does. This foresight manifests as two measurable properties: functional redundancy avoidance (OPD ignores low-utility layers early) and early low-rank lock-in (its dominant subspaces stabilize fast). EffOPD exploits both to achieve a 3× training speedup with no extra modules.

June 1, 2026 ~14 min read Paper: arXiv:2605.11739
01 — The Exploration Tax

RL explores mountains it doesn't need to climb

Post-training a large language model for reasoning boils down to navigating a high-dimensional parameter space to find a region where the model reliably solves hard problems. Reinforcement Learning (RL — GRPO, PPO, DAPO) does this well but expensively: it needs many rollouts to discover which directions actually improve reasoning, and it wastes update budget on layers that don't matter.

On-policy distillation (OPD) uses a stronger teacher to generate supervision, and practitioners have long observed it converges faster. The standard explanation — "denser supervision" — is macroscopic and unsatisfying. This paper asks: what actually happens at the parameter level?

parameter-space navigation — RL vs OPD click Play to animate both trajectories
step: 0

Even under the same norm constraint, OPD achieves substantially higher reasoning gains per unit of parameter change than RL — meaning RL updates carry "dead weight": components that inflate the gradient norm without improving reasoning.

When you scale up OPD or RL updates by a factor α ∈ [0, 1] and measure the resulting accuracy, OPD's curve rises steeply while RL's is shallower — identical norm, different signal density. The mystery is why OPD is so much more precise.

02 — Functional Redundancy Avoidance

OPD ignores the layers that don't reason

The first dimension of OPD's foresight operates at the module level. To locate where meaningful updates go, the authors ran a sliding-window intervention: for each block of layers, they inject only that block's RL or OPD update and measure the resulting reasoning accuracy. The picture is clear:

  • MLP modules are far more important for reasoning than attention modules — they serve as the primary knowledge carriers.
  • Middle layers (layers ~5–10 out of 32 in Qwen3-8B) produce the biggest accuracy jumps. Bottom and top layers contribute little.
  • Embedding layer replacement has negligible effect on reasoning — embeddings are basically inert for task performance after pretraining.
layer-wise update norm — Qwen3-8B toggle RL / OPD to see where each concentrates updates

Middle layers (shaded region) produce the largest reasoning gains — OPD concentrates its budget there. RL spreads updates more evenly, accumulating large norms in peripheral layers with low marginal utility.

RL and OPD share nearly identical sensitivity patterns — they both "know" which layers matter for reasoning. But RL dumps large updates into the low-sensitivity peripheral layers anyway, generating redundancy. OPD suppresses updates in those layers early, concentrating its entire budget on the middle-layer MLPs where each delta actually moves accuracy.

Property 1 — Functional Redundancy Avoidance: OPD identifies modules with low marginal utility early in training and suppresses their parameter updates, concentrating updates on reasoning-critical modules. RL accumulates large redundant updates in these peripheral regions.

A critical data point: on Qwen2.5-8B, the Top-1% subspace of OPD's update captures 94.7% of the update energy — vs 88.5% for RL. More energy in fewer, better-aligned directions means less wasted compute.

03 — Early Low-Rank Lock-in

The right direction is found at 10% training progress

The second dimension of foresight is geometric: OPD's parameter updates form a low-rank structure that stabilizes early. To measure this, the authors applied SVD to the update matrix ∆W = Wfinal − Wbase and tracked several metrics across training:

  • Effective rank: OPD's update has lower effective rank (2341 vs 2754 for RL on 8B), meaning fewer dimensions carry the action.
  • Top-1% subspace norm ratio: OPD concentrates 94.7% of its update energy in the top 1% of directions; RL only 88.5%.
  • Subspace alignment: At each training step, cosine similarity of the dominant subspace to the final checkpoint's subspace is computed — OPD reaches high alignment within the first 10% of training. RL alignment grows slowly over the full training run.
subspace alignment to final direction vs training progress OPD locks in early; RL explores the whole way

By 10% training, OPD's dominant subspace already has cosine similarity ~0.72 with the final direction. RL doesn't reach similar alignment until ~60–70% of training.

The decisive experiment: take an OPD checkpoint at only 10% training progress, preserve its update direction per module, but rescale each module's Frobenius norm to match the final checkpoint's norm. The resulting model recovers approximately 80% of the final reasoning performance. The direction was already correct — only the magnitude was missing.

magnitude scaling experiment
W_scaled = W_base + (∆W_10% / ‖∆W_10%F) × ‖∆W_finalF
→ recovers ~80% of final reasoning accuracy at only 10% training progress
Property 2 — Early Low-Rank Lock-in: OPD stabilizes its dominant update subspaces early, with high alignment to the final solution from the first 10% of training. Subsequent training primarily amplifies magnitude along these locked-in directions rather than changing direction.

This structurally explains Property 1: because OPD locks into a small set of high-quality directions early, it automatically avoids redundant updates elsewhere — peripheral modules simply never receive much gradient energy. RL, lacking this directional stability, keeps exploring new directions throughout training, accumulating redundancy in the process.

04 — EffOPD: Extrapolating the Foresight

If the direction is correct at 10%, why train to 100%?

The two properties above suggest an obvious shortcut: since OPD's update direction stabilizes early and subsequent training just increases magnitude along the same trajectory, we should be able to extrapolate — jump further along the known-good direction instead of grinding out each training step.

EffOPD does exactly this. It runs a lightweight directional extrapolation at exponentially spaced checkpoints (t = 1, 2, 4, 8, 16, …), each time estimating the current update direction and testing whether a larger leap produces better validation performance:

EffOPD algorithm — step by step click Next to walk through each phase
1
Run OPD normally until checkpoint t = 2n
Standard on-policy distillation — student generates samples, teacher provides token-level KL supervision. Checkpoints are at exponentially spaced steps: 1, 2, 4, 8, …
t ∈ {1, 2, 4, 8, 16, …} # exponential schedule
2
Estimate the local update direction Δn
For the first checkpoint (n=0), use W₁ − W₀ as the direction. For n ≥ 1, use the displacement between adjacent exponential checkpoints — this captures the accumulated update trend.
Δ_n = W_{2^n} − W_{2^{n−1}}
3
Generate 5 extrapolation candidates
Project forward along Δn with increasing step sizes 2¹, 2², …, 2⁵. These candidates represent "what if we jumped k times further along the current direction?"
W_cand_k = W_{2^n} + 2^k · Δ_n # k = 1..5
4
Validate on 50 training examples
Evaluate each candidate on a tiny held-out set Dv (50 samples). Accept greedily: if candidate k beats the running best, accept it. Stop at the first failure — no exhaustive search.
if score(W_cand_k) >= best_score: W_acc = W_cand_k
5
Resume training from the accepted point
Replace the current model with the best accepted candidate. Continue OPD training. If no candidate improved (k=1 failed), fall back to vanilla OPD — EffOPD degrades gracefully.
W_current = W_acc # jump if beneficial, else stay
Step 1 of 5

The validation set Dv serves only as a direction check — it doesn't need to be hard or representative. In ablation studies, easy, medium, and hard validation sets all give consistent signals, confirming that the check is verifying directional quality rather than fine-grained supervision.

The exponential checkpoint schedule means EffOPD runs the extrapolation search only O(log T) times over T total training steps — the overhead is tiny compared to the saved training iterations.

05 — Results: 3× Training Speedup

10 steps beats 35 steps of vanilla OPD

Experiments span model scales from 1.5B to 32B parameters (Qwen2.5 and Qwen3 families), trained on Eurus-RL-Code and DeepMath-103K, and evaluated on seven benchmarks: Codeforces, TACO, AIME24, AIME25, AIME26, MINERVA, and GPQA.

convergence curve — accuracy vs training steps (math reasoning) EffOPD reaches 80% accuracy at step 10 vs step 35 for vanilla OPD

Schematic based on Figure 6 of the paper. EffOPD (purple) reaches convergence ~3× earlier than vanilla OPD (cyan). RL (grey) converges later and at a lower ceiling on math tasks.

Geometry of the speedup

The speedup is not uniform across methods. Fixed-extrapolation baselines (AlphaOPD, ExOPD) sometimes overshoot and destabilize. EffOPD's adaptive validation prevents this: if a candidate is too aggressive, it's rejected and training continues normally. On Qwen3-4B-Non-Thinking, EffOPD attains strong reasoning performance by the 4th training step.

Spectral structure across model scales (Table 1 numbers)

ModelMethodSpec/Frob Ratio ↑Eff. Rank ↓Top-1% Norm Ratio ↑
1.5BRL33.2%96478.1%
1.5BOPD39.6%77892.3%
8BRL32.7%275488.5%
8BOPD36.8%234194.7%
14BRL24.4%317481.2%
14BOPD28.1%293794.5%

OPD consistently shows higher spectral-to-Frobenius ratio (dominant directions carry more of the energy), lower effective rank (fewer dimensions carry the action), and higher Top-1% norm ratio (energy is concentrated in the top 1% of singular vectors) at every scale.

06 — Takeaways

What this means for LLM post-training

For practitioners: EffOPD is plug-and-play — it requires no architectural changes, no new hyperparameters, and no extra trainable modules. If you're already running OPD, you can add EffOPD's extrapolation loop and expect ~3× fewer training steps for the same final performance. The only cost is validating 50 samples per exponential checkpoint.
For researchers: The two foresight properties (Functional Redundancy Avoidance + Early Low-Rank Lock-in) are measurable throughout training. They provide a diagnostic lens: if your fine-tuning method doesn't exhibit strong early subspace alignment, it likely has headroom. The cosine-similarity-to-final-checkpoint metric is a useful signal to track during training.
Broader implication: "Denser supervision" is the right intuition, but the mechanism is geometric. A teacher that constrains the student's update to a low-rank subspace aligned with the final solution creates a natural momentum that RL has to discover on its own. This is why distillation is structurally easier to optimize — not just statistically, but in the geometry of the loss landscape.

Open questions

  • Can the foresight properties be induced in RL directly — e.g., by regularizing the update matrix's effective rank?
  • Does the early lock-in break at very large scales (70B+) or when teacher and student are from different architectures?
  • Can the validation set Dv be replaced by a learned critic that predicts extrapolation success without sampling?
  • Do the same properties appear in RLHF (reward from human preferences), or is foresight specific to verifiable-reward settings?