A Fragile Learning Signal
On-policy distillation (OPD) trains a student LLM to imitate a teacher by having the student generate rollouts, then comparing the student's choices against the teacher's distribution at each token position. The idea is elegant: the student explores its own manifold, and the teacher provides dense correction everywhere the student errs.
In virtually every production pipeline—Qwen3, MiMo-V2-Flash, GLM-5, Thinking Machines Lab—the correction boils down to a single number per token position:
This is cheap and simple, but the entire training signal for a given prefix is concentrated in a single token lottery. If that token happens to be a filler ("the", "and", "is") where the student already matches the teacher, the update is near zero. If it's a token where the student is overconfident, the update is strongly negative. The distribution of these rewards is wildly skewed—and the paper shows most tokens receive negative rewards in practice.
Context: "The answer is \boxed{" — teacher (purple bars) peaks at digits; student (teal) is overconfident on "1".
Highlighted region = tokens contributing to the gradient update.
The visualization above makes the problem visceral: with K=1 (sampled-token OPD), the gradient update depends on whichever token the student happened to sample. Switch to K=32 and the signal is computed over a rich local distribution—much harder to game with a single noisy draw.
Why Token-Level Beats Sequence-Level
Before diagnosing the implementation problems, the paper settles a theoretical question: should OPD even be token-level? The sequence-level reverse-KL assigns each token update a reward that sums future log-ratios ("reward-to-go"), which is unbiased—but expensive in variance.
Under bounded rewards, the worst-case variance of the token-level estimator (γ=0) scales as O(T²), while sequence-level (γ=1) scales as O(T⁴). For a 16K-token context that's a difference of 8 orders of magnitude. The paper validates this empirically with a two-task toy experiment.
Log-scale Y axis. Token-level (γ=0, green) stays orders of magnitude below sequence-level (γ=1, red) throughout training. Approximated from Figure 1a of the paper.
This result motivates keeping supervision token-local. The authors then turn to the follow-up question: if token-level updates are the right framing, why does the standard sampled-token implementation still fail in practice?
How Sampled-Token OPD Breaks
The authors ran standard sampled-token OPD on math reasoning (Qwen2.5-7B-Instruct student, OpenThinker3-7B teacher) and diagnosed three recurring failure modes. Each points to a different structural weakness of the one-token signal.
log q(y_t|c_t) − log π_θ(y_t|c_t).
Because the student generates these tokens, it tends to assign them higher probability than the teacher—making most rewards negative.
Optimization becomes dominated by a small set of locally positive tokens (often short fillers), whose favorable scores may have little to do with reasoning quality.
Most sampled tokens fall below y=x — student is already overconfident on its own outputs. Positive rewards are rare, so training is driven by a noisy minority.
Example — student falls into repetition; teacher still aligns locally:
Red tokens = student in repetition loop. Green check = teacher still gives high probability (locally aligned with repetition pattern). Sampled-token OPD cannot distinguish this from genuine reasoning.
Student generates "<" as a single token. Teacher assigns it q("<"| c_t) ≈ 0.02 because it expects "<th" instead. Sampled-token OPD penalizes the student for correct output.
Failure modes 1 and 2 are objective-level: they stem from the one-token signal being too coarse and too local. Failure mode 3 is implementation-level: it stems from treating different tokenizer conventions as compatible. The proposed fix addresses all three, but the paper shows that special-token masking alone (fixing mode 3) already brings sampled-token OPD from 36.4 → 40.7 average score—a substantial gain before the main improvement.
Teacher Top-K Local Support Matching
The fix is elegant: at each prefix, instead of comparing teacher and student on one sampled token, compare them over the top-K tokens under the teacher's distribution. Then renormalize both distributions within that support set and compute the reverse-KL.
π̂θ(v | c) = πθ(v | c) / Σu ∈ S πθ(u | c) # renorm student
q̂(v | c) = q(v | c) / Σu ∈ S q(u | c) # renorm teacher
Three practical choices accompany the objective:
Sampled-Token OPD (K=1)
One red token = entire gradient update. If the student's rollout accidentally sampled a filler token, the update is near-zero. If it sampled an outlier, the update explodes.
Teacher Top-K LSM (K=32)
Green region = teacher's top-K tokens. KL is computed over the full distribution shape within this region. Much harder to fool with a single noisy draw.
Math Reasoning and Agentic Tasks
The authors evaluate on two settings: single-task math reasoning and alternating multi-task training (math + ALFWorld agentic task). The student is Qwen2.5-7B-Instruct; math teacher is OpenThinker3-7B; ALFWorld teacher is a finetuned GiGPO checkpoint.
Multi-Task Training (+19.8% on Math)
In the multi-task setting, standard sampled-token OPD achieves 34.8 average math score. Local support matching raises this to 41.7—a +19.8% relative gain—while maintaining competitive ALFWorld performance (95.3 vs 90.6 for the unmasked baseline).
Single-Task Math Reasoning
In the single-task setting the gains are similar: sampled-token OPD reaches 36.4 avg, masking alone gets to 40.7, and local support matching (with or without masking) reaches 41.5–41.7.
| Method | MATH-500 | AIME24 | AIME25 | Minerva | Olympiad | Avg |
|---|---|---|---|---|---|---|
| Qwen2.5-7B-It (student init) | 68.2 | 13.3 | 0.0 | 26.5 | 32.9 | 28.2 |
| OpenThinker3-7B (teacher) | 92.2 | 53.3 | 40.0 | 39.0 | 55.6 | 56.0 |
| Sampled-token OPD | 80.0 | 10.0 | 16.7 | 32.4 | 43.1 | 36.4 |
| Sampled-token OPD + mask | 81.4 | 26.7 | 16.7 | 34.2 | 44.7 | 40.7 |
| Ours (teacher top-K LSM, w/o mask) | 80.4 | 23.3 | 26.7 | 34.2 | 43.9 | 41.7 |
| Ours (teacher top-K LSM, w/ mask) | 82.0 | 23.3 | 23.3 | 34.9 | 43.9 | 41.5 |
Ablation: What Actually Matters
The ablation (Table 3 + Figure 6) isolates each design choice. Teacher top-K alone is not sufficient—without top-p sampling, a top-K-only variant scores below sampled-token OPD on AIME24 avg@32 (17.7 vs 20.4). Adding top-p recovers and exceeds it (23.6). Renormalization is absolutely required: removing it causes rapid collapse within ~100 steps.
| Method | AIME24 avg@32 |
|---|---|
| Qwen2.5-7B-It (init) | 10.0 |
| Sampled-token OPD | 20.4 |
| Sampled-token OPD + top-p | 21.6 |
| Teacher top-K (no top-p) | 17.7 |
| Teacher top-K + top-p (LSM) | 23.6 |