Diffusion LLMs Knowledge Distillation Cross-Architecture Interactive

TIDE: Shrinking Diffusion LLMs 22× Without Losing the Code Superpower

Diffusion language models generate all tokens simultaneously — giving them a structural edge on code, where global coherence matters. The catch: competitive dLLMs need 8–100B parameters. TIDE is the first cross-architecture distillation framework, compressing a 16B MoE teacher into a 0.6B student with 22× less memory and a 16-point HumanEval advantage over same-size autoregressive models.

May 27, 2026 ~9 min read Paper: arXiv:2604.26951
01 — The Deployment Gap

Great at Code, Stuck at 31 GB

Diffusion language models like LLaDA and Dream have reached competitive performance on reasoning, knowledge, and code benchmarks — but only at 8–100B parameters. The 16B MoE teacher used in this work requires 31.3 GB of GPU memory and produces just 7.8 tokens/second. Nobody deploys that on commodity hardware.

TIDE distills these giants into a 0.6B student that fits in 1.4 GB — a 22× memory reduction — while preserving the dLLM's most striking advantage: code generation scores that beat an equivalently-sized autoregressive model by 16+ points on HumanEval.

Model comparison — memory footprint and HumanEval score scroll into view to animate
Peak GPU memory (GB, lower is better)
LLaDA2-mini (16B MoE teacher)
31.3 GB
WeDLM-8B (dense teacher)
15.5 GB
AR baseline (Qwen3-0.6B)
1.2 GB
TIDE student (BD3LM-0.6B)
1.4 GB
HumanEval pass@1 (higher is better)
AR baseline (Qwen3-0.6B)
32.3
BD3LM, no distillation
46.34
TIDE-Shared (WeDLM teacher)
48.78
TIDE-Cross (LLaDA2 teacher)
49.39

The 0.6B TIDE student occupies the same memory as a small AR model yet scores 17 points higher on HumanEval — nearly matching the much larger teacher. This asymmetry between memory and capability is the distillation payoff.

02 — Diffusion LLMs

Parallel Decoding: Why All Tokens at Once

An autoregressive model generates tokens left to right: predict token 1, then token 2 given token 1, and so on. Each token only sees the tokens before it. A diffusion language model works differently: start with an entirely masked sequence and simultaneously predict all positions over multiple denoising steps. Because every token attends to every other token (bidirectional attention), the model maintains global consistency throughout generation.

For code this matters a lot. Generating a function body requires keeping the signature, variable names, return types, and syntactic structure consistent across the whole output — something that's easier when you can revise all positions together rather than commit to each token irreversibly.

dLLM vs AR — denoising animation for "I don't wanna go" click Play
dLLM (TIDE) — parallel, bidirectional
AR model — sequential, left-to-right
step 0 / 4 — t = 1.00

Key observation in the animation: the dLLM unmasked "wan" (token 4) before "don't" (tokens 2–3) were revealed. That's bidirectional context — later tokens can inform earlier positions. The AR model must generate left-to-right, committing to each word before seeing what comes next.

03 — Three Cross-Architecture Barriers

Why dLLM Distillation Is Harder Than AR Distillation

Standard AR distillation (MiniLLM, GKD, DistiLLM) copies a teacher's token probabilities into a smaller student. For dLLMs with heterogeneous teacher and student architectures, three fundamental challenges make this impossible to do directly.

TEACHER LLaDA2-mini 16B MoE / GQA Ling tokenizer WeDLM-8B 8B dense / MHA Qwen3 tokenizer TIDE FRAMEWORK TIDAL dual-axis λ scheduling λ(t,p) = λ_train(p) × (1−t) CompDemo complementary mask splitting 2-pass teacher inference per sample Reverse CALM cross-tokenizer BCE alignment BCE(p_s ∥ p_t) — bounded gradients STUDENT BD3LM-0.6B Qwen3-0.6B base block diffusion 1.4 GB · 41 tok/s HumanEval 49.39 vs AR 32.3 (+17) temporal spatial vocab
  • Temporal reliability. At diffusion timestep t≈1, nearly all tokens are masked and the teacher is essentially guessing. Using noisy teacher signals at high t corrupts distillation. AR models don't have this issue — the teacher always sees the full left context.
  • Spatial scarcity. With 70–80% of tokens masked, the teacher has little context to form reliable predictions. Standard distillation uses the same masked input for teacher and student — wasting the teacher's capacity.
  • Vocabulary mismatch. LLaDA2 uses the Ling tokenizer; BD3LM uses Qwen3's tokenizer. Token-level KL divergence is undefined when vocabularies differ — you can't directly compare probability distributions over different token sets.

TIDE's three components address each barrier in sequence: TIDAL handles the temporal problem, CompDemo the spatial problem, and Reverse CALM the vocabulary problem.

04 — TIDAL

Knowing When to Trust the Teacher

The key insight is that teacher signal quality in dLLMs varies along two independent axes: the current diffusion timestep and the training progress of the student.

At high noise (t≈1, most tokens masked), the teacher's predictions are unreliable — it can barely see anything. At low noise (t≈0, few tokens masked), the teacher sees nearly the entire sequence and predicts confidently. Separately, early in training, the student is too immature to absorb complex teacher distributions without collapsing; later it can handle full supervision.

TIDAL — dual-axis lambda scheduling
λ_train(p) = λ_init + (λ_max − λ_init) × ½(1 − cos(π·p)) ← cosine ramp over training progress p ∈ [0,1]
λ_t(t, p) = λ_train(p) × (1 − t) ← zero out at high noise (t→1)
defaults: λ_init = 0.1, λ_max = 0.9

The interactive heatmap below shows how much weight is placed on the teacher signal — brighter = trust teacher more. The sweet spot is the top-left corner: low masking (t≈0, teacher is confident) combined with late training (p≈1, student is mature enough to learn).

λ(t, p) — how much to trust the teacher at each training moment hover canvas to inspect value
hover to inspect λ value

Compare this to single-axis AR distillation (TAID): it only schedules along the training progress axis — a horizontal slice through this heatmap. TIDAL adds the vertical axis, suppressing the noisy high-t region that AR distillation never had to worry about.

The interpolated target

At each masked position, the actual training target blends the student's own predictions with the teacher's predictions, weighted by λ_t:

Interpolated target and loss
r_t = softmax( (1−λ_t)·s + λ_t·t ) / T detached — no gradient through target
L_TIDAL = DKL( r_t ∥ softmax(s/T) ) × T²

When λ_t=0 (high noise or early training), the target equals the student's own output — the student trains against itself, which is a stable self-distillation objective. When λ_t=0.9 (low noise, late training), the target is dominated by the teacher.

05 — CompDemo + Reverse CALM

Richer Teacher Context and Cross-Tokenizer Alignment

CompDemo: let the teacher see more

Even with TIDAL suppressing high-noise signals, the teacher's context is still limited by masking. CompDemo exploits the discrete diffusion structure: randomly split the masked positions into two complementary subsets, run the teacher twice — each time revealing one subset as context — and merge the resulting logits.

1
Standard masked input (masking ratio ≈ 50%)
Normal dLLM training input. The teacher sees this and must predict all masked positions — but with limited context.
I [M] [M] wan [M] go
visible token masked [M]
2
Random mask split: M_A = {"don", "na"}, M_B = {"'t"}
The three masked positions are partitioned into two complementary subsets (ρ = 0.5 split ratio).
I [don] M_A ['t] M_B wan [na] M_A go
M_A (will become context in Pass 1) M_B (will become context in Pass 2)
3
Pass 1 — reveal M_A, predict M_B
Teacher sees "don" and "na" as demonstration context. Now it can leverage these tokens to better predict "'t" at the M_B position.
I don ✓ [M] ← predict wan na ✓ go
4
Pass 2 — reveal M_B, predict M_A
Teacher sees "'t" as demonstration context. Now it can better predict "don" and "na" at the M_A positions — with information about the middle of the phrase.
I [M] ← predict 't ✓ wan [M] ← predict go
5
Merge logits — every masked position gets richer teacher signal
Logits from Pass 1 (M_B positions) and Pass 2 (M_A positions) are merged. Every masked position now has a teacher prediction conditioned on complementary context. Cost: ~1.5× training time (teacher is frozen, no gradients).
I don ✓✓ 't ✓✓ wan na ✓✓ go
Step 1 of 5

Reverse CALM: cross-tokenizer alignment without gradient explosion

When teacher and student use different tokenizers (LLaDA2 uses Ling tokens; BD3LM uses Qwen3 tokens), token-level KL divergence is undefined. TIDE aligns them at the chunk level: find the minimal text spans that contain complete tokens from both vocabularies, then compare chunk-level probabilities.

The natural forward BCE objective has a dangerous failure mode: when the student is uncertain (p_s → 0) but the teacher is confident (p_t → 1), the gradient coefficient p_t/p_s diverges. Reverse CALM fixes this by swapping the arguments:

Forward CALM (unstable) vs Reverse CALM (bounded)
L_Fwd = − [ p_t · log p_s + (1−p_t) · log(1−p_s) ] gradient ∝ p_t/p_s → explodes when p_s→0
L_Rev = − [ p_s · log p_t + (1−p_s) · log(1−p_t) ] gradient ∝ log(p_t/(1−p_t)) → bounded by teacher
dual-end filtering: p_t≈0.5 (poorly aligned chunk) zeroes the gradient; low p_s suppresses noise via ∂p_s/∂θ

Reverse CALM is equivalent to minimizing the Bernoulli KL KL(p_s ∥ p_t) — a mode-seeking objective in scalar chunk space. Because TIDAL's interpolated target requires a stable gradient, it is counterproductive with the reverse objective and is not applied in the cross-tokenizer pipeline.

06 — Results

Code Is Where Diffusion Wins

TIDE evaluates across eight benchmarks: reasoning (GSM8K, MATH, BBH), knowledge (MMLU-Pro, MMLU), commonsense (HellaSwag), and code (HumanEval, MBPP). Both distillation pipelines are tested — cross-tokenizer (LLaDA2 → BD3LM) and shared-tokenizer (WeDLM → BD3LM).

HumanEval pass@1 — code generation benchmark scroll into view to animate
Baselines
Qwen3-0.6B (AR)
32.3
BD3LM, no distill
46.34
Shared-tokenizer pipeline (WeDLM → BD3LM)
KL baseline
41.46
TIDE-Shared (TIDAL + CompDemo)
48.78
Cross-tokenizer pipeline (LLaDA2 → BD3LM)
CALM baseline
43.90
TIDE-Cross (Reverse CALM)
49.39

All eight benchmarks

Benchmark AR (0.6B) BD3LM KL TIDE-Shared CALM TIDE-Cross
GSM8K59.6045.5643.9748.9848.6052.24
MATH32.4013.089.4011.1613.1413.20
BBH41.5026.3225.7926.7924.2127.37
MMLU-Pro24.7013.8013.1914.4813.4714.52
HellaSwag47.4039.2839.7840.5040.4239.88
MMLU52.8039.1539.5739.9239.4239.59
HumanEval32.3046.3441.4648.7843.9049.39
MBPP36.6037.8031.2037.8034.8038.40
Average40.9132.6730.5533.5532.2534.20

A clear asymmetry emerges: the AR model dominates on reasoning tasks (GSM8K 59.6 vs 52.2, MATH 32.4 vs 13.2), while TIDE-distilled dLLMs win on code (HumanEval 49.4 vs 32.3, MBPP 38.4 vs 36.6). The parallel generation process maintains global coherence across the whole output — and that structural property is exactly what structured code generation needs.

Deployment efficiency

ModelParamsPeak MemTokens/sHumanEval
LLaDA2-mini (teacher)16.3B31.3 GB7.8
WeDLM-8B (teacher)8.2B15.5 GB37.7
Qwen3-0.6B (AR baseline)0.6B1.2 GB51.332.3
BD3LM-0.6B, no distill0.6B1.4 GB42.146.3
TIDE distilled (best)0.6B1.4 GB41.049.4

Distillation adds only 2.6% throughput overhead over the undistilled BD3LM (41.0 vs 42.1 tok/s), confirming that the quality gains come entirely from training — not from any architectural change that would affect inference.