Great at Code, Stuck at 31 GB
Diffusion language models like LLaDA and Dream have reached competitive performance on reasoning, knowledge, and code benchmarks — but only at 8–100B parameters. The 16B MoE teacher used in this work requires 31.3 GB of GPU memory and produces just 7.8 tokens/second. Nobody deploys that on commodity hardware.
TIDE distills these giants into a 0.6B student that fits in 1.4 GB — a 22× memory reduction — while preserving the dLLM's most striking advantage: code generation scores that beat an equivalently-sized autoregressive model by 16+ points on HumanEval.
The 0.6B TIDE student occupies the same memory as a small AR model yet scores 17 points higher on HumanEval — nearly matching the much larger teacher. This asymmetry between memory and capability is the distillation payoff.
Parallel Decoding: Why All Tokens at Once
An autoregressive model generates tokens left to right: predict token 1, then token 2 given token 1, and so on. Each token only sees the tokens before it. A diffusion language model works differently: start with an entirely masked sequence and simultaneously predict all positions over multiple denoising steps. Because every token attends to every other token (bidirectional attention), the model maintains global consistency throughout generation.
For code this matters a lot. Generating a function body requires keeping the signature, variable names, return types, and syntactic structure consistent across the whole output — something that's easier when you can revise all positions together rather than commit to each token irreversibly.
Key observation in the animation: the dLLM unmasked "wan" (token 4) before "don't" (tokens 2–3) were revealed. That's bidirectional context — later tokens can inform earlier positions. The AR model must generate left-to-right, committing to each word before seeing what comes next.
Why dLLM Distillation Is Harder Than AR Distillation
Standard AR distillation (MiniLLM, GKD, DistiLLM) copies a teacher's token probabilities into a smaller student. For dLLMs with heterogeneous teacher and student architectures, three fundamental challenges make this impossible to do directly.
- Temporal reliability. At diffusion timestep t≈1, nearly all tokens are masked and the teacher is essentially guessing. Using noisy teacher signals at high t corrupts distillation. AR models don't have this issue — the teacher always sees the full left context.
- Spatial scarcity. With 70–80% of tokens masked, the teacher has little context to form reliable predictions. Standard distillation uses the same masked input for teacher and student — wasting the teacher's capacity.
- Vocabulary mismatch. LLaDA2 uses the Ling tokenizer; BD3LM uses Qwen3's tokenizer. Token-level KL divergence is undefined when vocabularies differ — you can't directly compare probability distributions over different token sets.
TIDE's three components address each barrier in sequence: TIDAL handles the temporal problem, CompDemo the spatial problem, and Reverse CALM the vocabulary problem.
Knowing When to Trust the Teacher
The key insight is that teacher signal quality in dLLMs varies along two independent axes: the current diffusion timestep and the training progress of the student.
At high noise (t≈1, most tokens masked), the teacher's predictions are unreliable — it can barely see anything. At low noise (t≈0, few tokens masked), the teacher sees nearly the entire sequence and predicts confidently. Separately, early in training, the student is too immature to absorb complex teacher distributions without collapsing; later it can handle full supervision.
λ_t(t, p) = λ_train(p) × (1 − t) ← zero out at high noise (t→1)
defaults: λ_init = 0.1, λ_max = 0.9
The interactive heatmap below shows how much weight is placed on the teacher signal — brighter = trust teacher more. The sweet spot is the top-left corner: low masking (t≈0, teacher is confident) combined with late training (p≈1, student is mature enough to learn).
Compare this to single-axis AR distillation (TAID): it only schedules along the training progress axis — a horizontal slice through this heatmap. TIDAL adds the vertical axis, suppressing the noisy high-t region that AR distillation never had to worry about.
The interpolated target
At each masked position, the actual training target blends the student's own predictions with the teacher's predictions, weighted by λ_t:
L_TIDAL = DKL( r_t ∥ softmax(s/T) ) × T²
When λ_t=0 (high noise or early training), the target equals the student's own output — the student trains against itself, which is a stable self-distillation objective. When λ_t=0.9 (low noise, late training), the target is dominated by the teacher.
Richer Teacher Context and Cross-Tokenizer Alignment
CompDemo: let the teacher see more
Even with TIDAL suppressing high-noise signals, the teacher's context is still limited by masking. CompDemo exploits the discrete diffusion structure: randomly split the masked positions into two complementary subsets, run the teacher twice — each time revealing one subset as context — and merge the resulting logits.
Reverse CALM: cross-tokenizer alignment without gradient explosion
When teacher and student use different tokenizers (LLaDA2 uses Ling tokens; BD3LM uses Qwen3 tokens), token-level KL divergence is undefined. TIDE aligns them at the chunk level: find the minimal text spans that contain complete tokens from both vocabularies, then compare chunk-level probabilities.
The natural forward BCE objective has a dangerous failure mode: when the student is uncertain (p_s → 0) but the teacher is confident (p_t → 1), the gradient coefficient p_t/p_s diverges. Reverse CALM fixes this by swapping the arguments:
L_Rev = − [ p_s · log p_t + (1−p_s) · log(1−p_t) ] gradient ∝ log(p_t/(1−p_t)) → bounded by teacher
dual-end filtering: p_t≈0.5 (poorly aligned chunk) zeroes the gradient; low p_s suppresses noise via ∂p_s/∂θ
Reverse CALM is equivalent to minimizing the Bernoulli KL KL(p_s ∥ p_t) — a mode-seeking objective in scalar chunk space. Because TIDAL's interpolated target requires a stable gradient, it is counterproductive with the reverse objective and is not applied in the cross-tokenizer pipeline.
Code Is Where Diffusion Wins
TIDE evaluates across eight benchmarks: reasoning (GSM8K, MATH, BBH), knowledge (MMLU-Pro, MMLU), commonsense (HellaSwag), and code (HumanEval, MBPP). Both distillation pipelines are tested — cross-tokenizer (LLaDA2 → BD3LM) and shared-tokenizer (WeDLM → BD3LM).
All eight benchmarks
| Benchmark | AR (0.6B) | BD3LM | KL | TIDE-Shared | CALM | TIDE-Cross |
|---|---|---|---|---|---|---|
| GSM8K | 59.60 | 45.56 | 43.97 | 48.98 | 48.60 | 52.24 |
| MATH | 32.40 | 13.08 | 9.40 | 11.16 | 13.14 | 13.20 |
| BBH | 41.50 | 26.32 | 25.79 | 26.79 | 24.21 | 27.37 |
| MMLU-Pro | 24.70 | 13.80 | 13.19 | 14.48 | 13.47 | 14.52 |
| HellaSwag | 47.40 | 39.28 | 39.78 | 40.50 | 40.42 | 39.88 |
| MMLU | 52.80 | 39.15 | 39.57 | 39.92 | 39.42 | 39.59 |
| HumanEval | 32.30 | 46.34 | 41.46 | 48.78 | 43.90 | 49.39 |
| MBPP | 36.60 | 37.80 | 31.20 | 37.80 | 34.80 | 38.40 |
| Average | 40.91 | 32.67 | 30.55 | 33.55 | 32.25 | 34.20 |
A clear asymmetry emerges: the AR model dominates on reasoning tasks (GSM8K 59.6 vs 52.2, MATH 32.4 vs 13.2), while TIDE-distilled dLLMs win on code (HumanEval 49.4 vs 32.3, MBPP 38.4 vs 36.6). The parallel generation process maintains global coherence across the whole output — and that structural property is exactly what structured code generation needs.
Deployment efficiency
| Model | Params | Peak Mem | Tokens/s | HumanEval |
|---|---|---|---|---|
| LLaDA2-mini (teacher) | 16.3B | 31.3 GB | 7.8 | — |
| WeDLM-8B (teacher) | 8.2B | 15.5 GB | 37.7 | — |
| Qwen3-0.6B (AR baseline) | 0.6B | 1.2 GB | 51.3 | 32.3 |
| BD3LM-0.6B, no distill | 0.6B | 1.4 GB | 42.1 | 46.3 |
| TIDE distilled (best) | 0.6B | 1.4 GB | 41.0 | 49.4 |
Distillation adds only 2.6% throughput overhead over the undistilled BD3LM (41.0 vs 42.1 tok/s), confirming that the quality gains come entirely from training — not from any architectural change that would affect inference.