On-Policy Distillation Done Right

01 — The One-Token Trap

A Fragile Learning Signal

On-policy distillation (OPD) trains a student LLM to imitate a teacher by having the student generate rollouts, then comparing the student's choices against the teacher's distribution at each token position. The idea is elegant: the student explores its own manifold, and the teacher provides dense correction everywhere the student errs.

In virtually every production pipeline—Qwen3, MiMo-V2-Flash, GLM-5, Thinking Machines Lab—the correction boils down to a single number per token position:

Sampled-token update at position t r t = log q (y t | c t) - log π θ (y t | c t) # teacher log-prob minus student log-prob for ONE sampled token

This is cheap and simple, but the entire training signal for a given prefix is concentrated in a single token lottery. If that token happens to be a filler ("the", "and", "is") where the student already matches the teacher, the update is near zero. If it's a token where the student is overconfident, the update is strongly negative. The distribution of these rewards is wildly skewed—and the paper shows most tokens receive negative rewards in practice.

opd_signal.py — token probability comparison click K buttons to expand the comparison set

Signal width:

1 token

Stability:

fragile

Context: "The answer is \boxed{" — teacher (purple bars) peaks at digits; student (teal) is overconfident on "1". Highlighted region = tokens contributing to the gradient update.

The visualization above makes the problem visceral: with K=1 (sampled-token OPD), the gradient update depends on whichever token the student happened to sample. Switch to K=32 and the signal is computed over a rich local distribution—much harder to game with a single noisy draw.

Core insight: Sampled-token OPD is a one-sample Monte Carlo approximation of the full reverse-KL. The authors replace it with a truncated reverse-KL over the teacher's top-K token support—retaining the efficiency of token-level updates while capturing the distributional shape at each prefix.

02 — Variance vs Bias

Why Token-Level Beats Sequence-Level

Before diagnosing the implementation problems, the paper settles a theoretical question: should OPD even be token-level? The sequence-level reverse-KL assigns each token update a reward that sums future log-ratios ("reward-to-go"), which is unbiased—but expensive in variance.

Discounted estimator (interpolates token \leftrightarrow sequence level) ĝ γ = Σ t s t \cdot Σ t'\geqt γ t'-t r t', γ \in [0, 1] # γ=0 \to token-level (biased, low variance) γ=1 \to sequence-level (unbiased, high variance)

Under bounded rewards, the worst-case variance of the token-level estimator (γ=0) scales as O(T²), while sequence-level (γ=1) scales as O(T⁴). For a 16K-token context that's a difference of 8 orders of magnitude. The paper validates this empirically with a two-task toy experiment.

gradient_variance.py — gradient variance vs training iteration toggle γ values to compare curves

Log-scale Y axis. Token-level (γ=0, green) stays orders of magnitude below sequence-level (γ=1, red) throughout training. Approximated from Figure 1a of the paper.

This result motivates keeping supervision token-local. The authors then turn to the follow-up question: if token-level updates are the right framing, why does the standard sampled-token implementation still fail in practice?

03 — Three Failure Modes

How Sampled-Token OPD Breaks

The authors ran standard sampled-token OPD on math reasoning (Qwen2.5-7B-Instruct student, OpenThinker3-7B teacher) and diagnosed three recurring failure modes. Each points to a different structural weakness of the one-token signal.

Imbalanced Supervision Signal objective

The sampled-token reward is log q(y_t|c_t) − log π_θ(y_t|c_t). Because the student generates these tokens, it tends to assign them higher probability than the teacher—making most rewards negative. Optimization becomes dominated by a small set of locally positive tokens (often short fillers), whose favorable scores may have little to do with reasoning quality.

Most sampled tokens fall below y=x — student is already overconfident on its own outputs. Positive rewards are rare, so training is driven by a noisy minority.

Unreliable Teacher on Drifted Prefixes objective

As long rollouts unfold, the student drifts into regions the teacher has never visited. In these off-distribution prefixes, the teacher's top token can still have high probability—but only because both models share surface patterns like repetition. The teacher then rewards the student for repeating the same token in a loop, since that token also has high teacher probability on any similar context.

Example — student falls into repetition; teacher still aligns locally:

Find the product of all real values of r such that ##q=0.82 ✓ ##q=0.79 ✓ ##q=0.75 ✓ ## ···(loop)

Red tokens = student in repetition loop. Green check = teacher still gives high probability (locally aligned with repetition pattern). Sampled-token OPD cannot distinguish this from genuine reasoning.

Tokenizer & Special-Token Mismatch implementation

Sampled-token OPD compares the exact student-generated token against the teacher distribution. When student and teacher use different tokenizers, the same raw text is segmented differently—so a student-generated token may not be a natural unit under the teacher. The result is spurious negative rewards on semantically correct outputs.

Student tokenization

<think>

<π=0.95 thinkπ=0.99 >π=0.98

Teacher tokenization

<think>

<thq≈0.5 inkq≈0.5 >q≈0.5

Student generates "<" as a single token. Teacher assigns it q("<"| c_t) ≈ 0.02 because it expects "<th" instead. Sampled-token OPD penalizes the student for correct output.

Step 1 of 3

Failure modes 1 and 2 are objective-level: they stem from the one-token signal being too coarse and too local. Failure mode 3 is implementation-level: it stems from treating different tokenizer conventions as compatible. The proposed fix addresses all three, but the paper shows that special-token masking alone (fixing mode 3) already brings sampled-token OPD from 36.4 → 40.7 average score—a substantial gain before the main improvement.

04 — Local Support Matching

Teacher Top-K Local Support Matching

The fix is elegant: at each prefix, instead of comparing teacher and student on one sampled token, compare them over the top-K tokens under the teacher's distribution. Then renormalize both distributions within that support set and compute the reverse-KL.

Support set at prefix c i,t S(c i,t) = TopK q (c i,t) # K highest-probability tokens under teacher q π̂ θ (v | c) = π θ (v | c) / Σ u \in S π θ (u | c) # renorm student q̂(v | c) = q(v | c) / Σ u \in S q(u | c) # renorm teacher

LSM objective (averaged over rollouts and positions) L LSM = E x,{o} [ Σ i,t Σ v \in S π̂ θ (v | c i,t) \cdot log( π̂ θ (v | c i,t) / q̂(v | c i,t) ) ]

Three practical choices accompany the objective:

Top-K Teacher Support

Use the K highest-probability teacher tokens as the comparison set. This directly counters failure modes 1 and 2: the signal covers a richer slice of the distribution (not one sample), and the teacher's own support anchors the comparison to regions the teacher finds plausible.

K=32 works well; performance is stable for K ∈ {16, 32, 48}

Support-Set Renormalization

Independently renormalize teacher and student over the top-K set before computing KL. Without this, the truncated probability masses are incomparable and optimization collapses rapidly. The ablation shows renormalization is non-negotiable—removing it causes immediate entropy collapse.

softmax(logits[S]) — applied separately to teacher and student

Top-p Rollout Sampling + Special-Token Masking

Generating rollouts with top-p=0.9 keeps trajectories in typical teacher-visible regions, making the top-K comparison more reliable. Special-token masking removes low-probability teacher signals on <EOS>, <think>, and other markers where tokenizer conventions diverge.

top_p=0.9 for sampling; mask {<EOS>, <think>, </think>}

Step 1 of 3

lsm_vs_sampled.py — comparing training signal quality select K to see signal expansion

Sampled-Token OPD (K=1)

One red token = entire gradient update. If the student's rollout accidentally sampled a filler token, the update is near-zero. If it sampled an outlier, the update explodes.

Teacher Top-K LSM (K=32)

Green region = teacher's top-K tokens. KL is computed over the full distribution shape within this region. Much harder to fool with a single noisy draw.

Why teacher top-K and not student top-K? The ablation in Table 4 shows both variants are competitive in the single-task setting, but in multi-task training, using the student's support set causes the math scores to degrade sharply (41.7 → 28.4 avg). The teacher defines what "reasonable continuations" look like, not the drifted student.

05 — Results

Math Reasoning and Agentic Tasks

The authors evaluate on two settings: single-task math reasoning and alternating multi-task training (math + ALFWorld agentic task). The student is Qwen2.5-7B-Instruct; math teacher is OpenThinker3-7B; ALFWorld teacher is a finetuned GiGPO checkpoint.

Multi-Task Training (+19.8% on Math)

In the multi-task setting, standard sampled-token OPD achieves 34.8 average math score. Local support matching raises this to 41.7—a +19.8% relative gain—while maintaining competitive ALFWorld performance (95.3 vs 90.6 for the unmasked baseline).

Math benchmarks — Sampled-Token OPD vs Local Support Matching (multi-task)

MATH-500 (baseline)

74.8

MATH-500 (ours)

82.0

AIME 2024 (baseline)

13.3

AIME 2024 (ours)

33.3

OlympiadBench (baseline)

40.5

OlympiadBench (ours)

44.0

Minerva (baseline)

32.1

Minerva (ours)

32.7

ALFWorld — agentic task success rate

Sampled-token OPD

90.6

Ours (w/o mask)

95.3

Ours (w/ mask)

97.7

Single-Task Math Reasoning

In the single-task setting the gains are similar: sampled-token OPD reaches 36.4 avg, masking alone gets to 40.7, and local support matching (with or without masking) reaches 41.5–41.7.

Method	MATH-500	AIME24	AIME25	Minerva	Olympiad	Avg
Qwen2.5-7B-It (student init)	68.2	13.3	0.0	26.5	32.9	28.2
OpenThinker3-7B (teacher)	92.2	53.3	40.0	39.0	55.6	56.0
Sampled-token OPD	80.0	10.0	16.7	32.4	43.1	36.4
Sampled-token OPD + mask	81.4	26.7	16.7	34.2	44.7	40.7
Ours (teacher top-K LSM, w/o mask)	80.4	23.3	26.7	34.2	43.9	41.7
Ours (teacher top-K LSM, w/ mask)	82.0	23.3	23.3	34.9	43.9	41.5

Ablation: What Actually Matters

The ablation (Table 3 + Figure 6) isolates each design choice. Teacher top-K alone is not sufficient—without top-p sampling, a top-K-only variant scores below sampled-token OPD on AIME24 avg@32 (17.7 vs 20.4). Adding top-p recovers and exceeds it (23.6). Renormalization is absolutely required: removing it causes rapid collapse within ~100 steps.

Method	AIME24 avg@32
Qwen2.5-7B-It (init)	10.0
Sampled-token OPD	20.4
Sampled-token OPD + top-p	21.6
Teacher top-K (no top-p)	17.7
Teacher top-K + top-p (LSM)	23.6

Remaining limitation: The LSM objective uses a truncated KL that ignores tokens outside the top-K support. The gradient for those out-of-support tokens is zero by construction, which introduces its own form of bias. The paper notes this explicitly and discusses augmented support variants (e.g., adding the sampled token to the support set), though the default teacher top-K formulation remains strongest in multi-task settings.