Diffusion LMs Scaling Laws Continuous Diffusion Interactive

RePlaid: Continuous Diffusion Scales
Competitively with Discrete

Continuous diffusion language models were written off after Plaid reported a 64× compute gap versus autoregressive models. RePlaid fixes the comparison: align the architecture with modern discrete DLMs, train with the same protocol, and the gap shrinks to 20× — on par with Duo (22×) and approaching MDLM (14×). The result is a 22.1 PPL on OpenWebText, the best among all continuous DLMs and better than MDLM.

May 27, 2026 ~10 min read Paper: arXiv:2605.18530
01 — The 64× Myth

A 64× Number That Was Never a Fair Fight

When Plaid (2023) reported that continuous diffusion language models needed 64× more compute than autoregressive models to reach the same validation loss, the community largely concluded continuous diffusion was impractical at scale. That number became a citation shorthand for "continuous DLMs don't scale."

The problem: Plaid used a different transformer architecture than the modern discrete DLMs it was being compared against. Different dataset, different optimizer hyperparameters, different attention variant, different normalization — making the 64× figure a measurement of "everything different at once" rather than a measurement of the continuous diffusion inductive bias itself.

RePlaid asks: what if we hold everything else equal? Use the same DiT backbone, same SlimPajama dataset, same AdamW schedule, same FLOP-counting methodology as MDLM and Duo — and just change continuous vs. discrete. The answer: the compute gap falls to 20×, placing continuous diffusion squarely between MDLM (14×) and Duo (22×).

compute gap to match AR validation loss — lower is better animates on scroll
Autoregressive Baseline
AR (reference)
Discrete Diffusion
MDLM (low var.)
14×
Continuous Diffusion
RePlaid (s.c.) — ours
20×
Duo
22×
RePlaid (no s.c.)
27×
Original Plaid
64×
Key insight: The 64× figure was never an inherent limitation of continuous diffusion — it was an artifact of an apples-to-oranges architectural comparison. With a fair setup, continuous diffusion is competitive with discrete DLMs.
02 — Plaid Background

How Plaid Works: Gaussian Noise on Token Embeddings

Plaid is a Variational Diffusion Model (VDM) for text. Rather than operating on discrete token IDs directly (as MDLM and Duo do), Plaid first embeds each token into a low-dimensional continuous vector space using a learned embedding matrix E ∈ ℝV×d_e with d_e = 16 (not 768). Gaussian noise is then added to these embeddings, not to one-hot token vectors.

This has a key efficiency benefit: projecting V-dimensional one-hot vectors (V ≈ 32K) through the transformer hidden size h requires [L, V] × [V, h] multiplications. Plaid reduces this to [L, d_e] × [d_e, h] — roughly 50× fewer FLOPs at V=32K, h=768, d_e=16.

plaid forward & reverse diffusion — text as continuous embeddings

Training minimizes the Negative ELBO (NELBO), which has three terms: a prior KL loss, a reconstruction loss at t=0, and a diffusion loss across all timesteps. The denoiser x_θ(z_t, t) outputs a probability distribution over the vocabulary at each position — it's a transformer with bidirectional attention conditioned on the noise level t.

The noise schedule γ(t) controls how much signal remains at time t. Plaid inherits VDM's learnable schedule: the endpoints γ₀ and γ₁ plus a monotone neural net for the interior shape. Crucially, this schedule is learned by minimizing the Monte Carlo variance of the ELBO — a key mechanism explored in §05.

Plaid NELBO (training objective)
L = KL(q(z₁|x) ‖ p(z₁))  ← prior loss + 𝔼[−log⟨x_θ(z₀, 0), x⟩]  ← reconstruction at t=0 − ½ 𝔼t,z_t[SNR′(t) · ‖ê_θ(z_t, t) − e‖²]  ← diffusion loss
03 — RePlaid Architecture

The Architecture Fix: Aligning with Modern Discrete DLMs

To make a fair comparison with MDLM and Duo, RePlaid adopts the exact same Diffusion Transformer (DiT) backbone: bidirectional attention, RoPE positional embeddings, and AdaLN-Zero modulation. The original Plaid used different choices for each of these, making its 64× gap impossible to attribute to the continuous diffusion training objective alone.

Five concrete changes bring Plaid's architecture in line. Walk through them below — each is small individually, but together they drop the compute gap from 64× to 27× (without self-conditioning) and to 20× (with self-conditioning).

replaid architecture alignment — 5 changes from plaid to replaid step through to explore
1
Bidirectional Attention + RoPE
Plaid used standard learned positional embeddings with unspecified attention. RePlaid switches to RoPE (rotary position embeddings) with bidirectional attention — matching MDLM/Duo exactly.
attn = BidirectionalAttention(RoPE=True)
2
LayerNorm + MLP Biases
Enable biases in all MLP layers and use LayerNorm with learnable scale/shift. These small capacity additions are standard in modern transformers but were not present in original Plaid.
LayerNorm(bias=True); MLP(bias=True)
3
GELU(tanh) Activation
Replace the MLP activation with GELU with tanh approximation — the standard choice in modern language model transformers (GPT-4, Llama, MDLM).
act = GELU(approximate='tanh')
4
AdaLN-Zero with Learnable Gating
Inject the noise timestep t via AdaLN-Zero modulation — scale and shift the LayerNorm outputs as a function of t. Learnable zero-initialized gates on the residual branches allow stable training.
y = x + gate * Attn(AdaLN(x, t))
5
Remove FP32 Logit Head
Original Plaid computed the final output prior logits in FP32 (not BF16) — a major numerical confounder in FLOP-matched comparisons. RePlaid removes this and runs the full forward pass in BF16.
head = Linear(h, V, dtype=bfloat16)
Step 1 of 5
04 — Embedding Geometry

Why ELBO Beats Cross-Entropy: The Structured Embedding Space

Other continuous DLMs — FLM, LangFlow, CDCD — train with a cross-entropy (CE) loss instead of the ELBO's MSE-style diffusion loss. A natural question: can RePlaid just use CE loss too? The answer is no, and the reason is visible in the embedding geometry.

RePlaid's ELBO objective enforces a low-rank, structured embedding space: after training, 90% of the variance in E is explained by just 6 principal components, and a t-SNE plot shows clear clustering by part-of-speech (nouns cluster together, verbs cluster separately, etc.). Adding an auxiliary CE loss disperses the embeddings — 90% variance now requires 13 principal components — and the PPL bound degrades from 22.1 to 26.1.

learned token embedding geometry (d_e = 16, shown in 2D via t-SNE) toggle to compare
NOUN VERB ADJ DET PREP PUNCT

This low-rank geometry is not a coincidence — it's a consequence of the ELBO's signal-to-noise weighting. The ELBO's diffusion loss pays more attention to denoising at low noise levels (near clean data), which forces the model to create tight, discriminable clusters that are easy to recover from slightly corrupted versions. CE training, by contrast, applies uniform pressure across all token pairs, creating a more dispersed embedding geometry that is harder to denoise.

Component removedOWT PPL (1M steps)Delta
RePlaid (s.c.) — full model22.1
w/o output prior logits22.5+0.4
w/o self-conditioning23.6+1.5
w/o learnable noise schedule24.4+2.3
w/o learnable embeddings (frozen random)39.4+17.3

Freezing token embeddings causes the single largest PPL collapse — from 22.1 to 39.4, making RePlaid the worst DLM on OWT. Embedding geometry is the primary driver of RePlaid's gains.

05 — Noise Schedule

The Schedule Secret: Linear CE Emerges Automatically

A recurring challenge in diffusion language models is distributing denoising difficulty evenly across timesteps. If the model has to do most of its work at a narrow range of t values, training is inefficient. Recent CE-based methods (CDCD, FLM, LangFlow) hand-engineer time reparameterizations to achieve a near-linear per-timestep cross-entropy loss.

RePlaid doesn't need this heuristic. The ELBO's Monte Carlo variance is minimized by making the per-timestep diffusion loss flat (Proposition 1). And a flat diffusion loss, under a near-optimal denoiser, implies a near-linear per-timestep CE loss (Proposition 2) — exactly what CE-based methods manually engineer. The schedule learns it for free.

per-timestep CE loss vs t — linear distribution = even denoising difficulty
Proposition 2 — Per-Timestep CE Under Optimality
CE*γ*(t) = H(x) − I(e; z₀) ← constant offset + κ · t ← linear trend (κ = avg diffusion loss) + C(x | z_t) ← conditional total correlation (non-negative, increasing in t)

The linear trend term κ · t dominates in practice, so the CE loss is approximately linear in t. This means no manual time reparameterization is needed: optimizing the ELBO variance automatically provides what LangFlow engineers by hand.

Practical upshot: A frozen cosine noise schedule (standard for image diffusion) concentrates denoising difficulty in the middle of the trajectory — easy at both ends, hard in the middle. The learned schedule redistributes work evenly, which improves both training efficiency and final likelihood.
06 — Scaling Laws

On Par with Discrete: The First Fair Scaling Comparison

To measure scaling behavior, the authors perform an IsoFLOP analysis: for each of 5 compute budgets (6×10¹⁸ to 10²⁰ FLOPs), they train models of varying sizes and find the compute-optimal (Chinchilla-style) validation loss. These optimal (FLOPs, loss) pairs are then fit to a power law: L* ∝ Cα.

All five methods — AR, MDLM, Duo, RePlaid (s.c.), RePlaid (no s.c.) — exhibit nearly identical power-law exponents. The curves are parallel on a log-log plot; only their horizontal positions differ. RePlaid (s.c.) sits 20× to the right of AR, between MDLM (14×) and Duo (22×).

compute-optimal scaling laws — validation loss vs non-embedding FLOPs
AR (baseline) MDLM (14×) RePlaid s.c. (20×) Duo (22×) RePlaid no s.c. (27×)

One unexpected finding: in the over-trained regime (training a smaller model for much longer than compute-optimal), RePlaid (s.c.) overtakes MDLM. For a 66M parameter model, RePlaid achieves lower loss than MDLM when both are pushed past their compute-optimal points. This is significant for production scenarios where small, cheap-to-serve models are over-trained on large data budgets.

MethodCompute gap vs ARParams vs AR (optimal)
AR (reference)
MDLM (low var.)14×3.4×
RePlaid (s.c.) — ours20×1.0×
Duo22×3.4×
RePlaid (no s.c.)27×
Original Plaid64×

RePlaid (s.c.) uses 3.4× fewer parameters than MDLM and Duo to reach its compute-optimal frontier — a side-effect of the low-dimensional (d_e=16) embeddings avoiding the large embedding tables discrete DLMs need.

07 — Results

State-of-the-Art Perplexity and Generation Quality

At the standard 0.1B-parameter, 1M-step benchmark, RePlaid (s.c.) achieves 22.1 PPL on OpenWebText — the best among all continuous DLMs, and even lower than the discrete MDLM (23.1). On LM1B, it reaches 31.6, surpassing Duo (33.7) but behind MDLM (30.8).

test perplexity on OpenWebText — lower is better (animates on scroll)
Autoregressive
AR Transformer-XL
17.5
Discrete Diffusion
MDLM (low var.)
23.1
Duo
25.2
Continuous Diffusion
RePlaid (s.c.) — ours
22.1
RePlaid (no s.c.) — ours
23.6
Plaid (original)
24.4
LangFlow
38.4

PPL scale normalized to 50 = 100% bar width. AR reference = 17.5 nats.

For generation quality, RePlaid (no s.c.) consistently beats Duo at all sampling step counts T. Self-conditioning helps at large T (≥64 steps) and produces a GenPPL of ~21 at T=1024, the best among all diffusion LMs — but hurts at small T because the self-conditioning loop sharpens the sampling distribution below the reference entropy, reducing diversity.

generative perplexity vs sampling steps T — lower is better shows quality-compute tradeoff
RePlaid s.c. RePlaid no s.c. Duo MDLM LangFlow
Why does this matter? Unlike discrete DLMs which can only sample token-by-token (like AR), continuous DLMs can use ODE solvers and distillation to collapse the full sampling trajectory into far fewer steps. RePlaid's strong performance at T=32–64 steps (competitive with discrete methods' best) makes it a compelling foundation for few-step text generation research.