A 64× Number That Was Never a Fair Fight
When Plaid (2023) reported that continuous diffusion language models needed 64× more compute than autoregressive models to reach the same validation loss, the community largely concluded continuous diffusion was impractical at scale. That number became a citation shorthand for "continuous DLMs don't scale."
The problem: Plaid used a different transformer architecture than the modern discrete DLMs it was being compared against. Different dataset, different optimizer hyperparameters, different attention variant, different normalization — making the 64× figure a measurement of "everything different at once" rather than a measurement of the continuous diffusion inductive bias itself.
RePlaid asks: what if we hold everything else equal? Use the same DiT backbone, same SlimPajama dataset, same AdamW schedule, same FLOP-counting methodology as MDLM and Duo — and just change continuous vs. discrete. The answer: the compute gap falls to 20×, placing continuous diffusion squarely between MDLM (14×) and Duo (22×).
How Plaid Works: Gaussian Noise on Token Embeddings
Plaid is a Variational Diffusion Model (VDM) for text. Rather than operating on discrete token IDs directly (as MDLM and Duo do), Plaid first embeds each token into a low-dimensional continuous vector space using a learned embedding matrix E ∈ ℝV×d_e with d_e = 16 (not 768). Gaussian noise is then added to these embeddings, not to one-hot token vectors.
This has a key efficiency benefit: projecting V-dimensional one-hot vectors (V ≈ 32K) through the transformer hidden size h requires [L, V] × [V, h] multiplications. Plaid reduces this to [L, d_e] × [d_e, h] — roughly 50× fewer FLOPs at V=32K, h=768, d_e=16.
Training minimizes the Negative ELBO (NELBO), which has three terms: a prior KL loss, a
reconstruction loss at t=0, and a diffusion loss across all timesteps. The denoiser
x_θ(z_t, t) outputs a probability distribution over the vocabulary at each position —
it's a transformer with bidirectional attention conditioned on the noise level t.
The noise schedule γ(t) controls how much signal remains at time t. Plaid inherits VDM's learnable schedule: the endpoints γ₀ and γ₁ plus a monotone neural net for the interior shape. Crucially, this schedule is learned by minimizing the Monte Carlo variance of the ELBO — a key mechanism explored in §05.
The Architecture Fix: Aligning with Modern Discrete DLMs
To make a fair comparison with MDLM and Duo, RePlaid adopts the exact same Diffusion Transformer (DiT) backbone: bidirectional attention, RoPE positional embeddings, and AdaLN-Zero modulation. The original Plaid used different choices for each of these, making its 64× gap impossible to attribute to the continuous diffusion training objective alone.
Five concrete changes bring Plaid's architecture in line. Walk through them below — each is small individually, but together they drop the compute gap from 64× to 27× (without self-conditioning) and to 20× (with self-conditioning).
Why ELBO Beats Cross-Entropy: The Structured Embedding Space
Other continuous DLMs — FLM, LangFlow, CDCD — train with a cross-entropy (CE) loss instead of the ELBO's MSE-style diffusion loss. A natural question: can RePlaid just use CE loss too? The answer is no, and the reason is visible in the embedding geometry.
RePlaid's ELBO objective enforces a low-rank, structured embedding space: after training, 90% of the variance in E is explained by just 6 principal components, and a t-SNE plot shows clear clustering by part-of-speech (nouns cluster together, verbs cluster separately, etc.). Adding an auxiliary CE loss disperses the embeddings — 90% variance now requires 13 principal components — and the PPL bound degrades from 22.1 to 26.1.
This low-rank geometry is not a coincidence — it's a consequence of the ELBO's signal-to-noise weighting. The ELBO's diffusion loss pays more attention to denoising at low noise levels (near clean data), which forces the model to create tight, discriminable clusters that are easy to recover from slightly corrupted versions. CE training, by contrast, applies uniform pressure across all token pairs, creating a more dispersed embedding geometry that is harder to denoise.
| Component removed | OWT PPL (1M steps) | Delta |
|---|---|---|
| RePlaid (s.c.) — full model | 22.1 | — |
| w/o output prior logits | 22.5 | +0.4 |
| w/o self-conditioning | 23.6 | +1.5 |
| w/o learnable noise schedule | 24.4 | +2.3 |
| w/o learnable embeddings (frozen random) | 39.4 | +17.3 |
Freezing token embeddings causes the single largest PPL collapse — from 22.1 to 39.4, making RePlaid the worst DLM on OWT. Embedding geometry is the primary driver of RePlaid's gains.
The Schedule Secret: Linear CE Emerges Automatically
A recurring challenge in diffusion language models is distributing denoising difficulty evenly across timesteps. If the model has to do most of its work at a narrow range of t values, training is inefficient. Recent CE-based methods (CDCD, FLM, LangFlow) hand-engineer time reparameterizations to achieve a near-linear per-timestep cross-entropy loss.
RePlaid doesn't need this heuristic. The ELBO's Monte Carlo variance is minimized by making the per-timestep diffusion loss flat (Proposition 1). And a flat diffusion loss, under a near-optimal denoiser, implies a near-linear per-timestep CE loss (Proposition 2) — exactly what CE-based methods manually engineer. The schedule learns it for free.
The linear trend term κ · t dominates in practice, so the CE loss is approximately linear in t. This means no manual time reparameterization is needed: optimizing the ELBO variance automatically provides what LangFlow engineers by hand.
On Par with Discrete: The First Fair Scaling Comparison
To measure scaling behavior, the authors perform an IsoFLOP analysis: for each of 5 compute budgets (6×10¹⁸ to 10²⁰ FLOPs), they train models of varying sizes and find the compute-optimal (Chinchilla-style) validation loss. These optimal (FLOPs, loss) pairs are then fit to a power law: L* ∝ Cα.
All five methods — AR, MDLM, Duo, RePlaid (s.c.), RePlaid (no s.c.) — exhibit nearly identical power-law exponents. The curves are parallel on a log-log plot; only their horizontal positions differ. RePlaid (s.c.) sits 20× to the right of AR, between MDLM (14×) and Duo (22×).
One unexpected finding: in the over-trained regime (training a smaller model for much longer than compute-optimal), RePlaid (s.c.) overtakes MDLM. For a 66M parameter model, RePlaid achieves lower loss than MDLM when both are pushed past their compute-optimal points. This is significant for production scenarios where small, cheap-to-serve models are over-trained on large data budgets.
| Method | Compute gap vs AR | Params vs AR (optimal) |
|---|---|---|
| AR (reference) | 1× | 1× |
| MDLM (low var.) | 14× | 3.4× |
| RePlaid (s.c.) — ours | 20× | 1.0× |
| Duo | 22× | 3.4× |
| RePlaid (no s.c.) | 27× | — |
| Original Plaid | 64× | — |
RePlaid (s.c.) uses 3.4× fewer parameters than MDLM and Duo to reach its compute-optimal frontier — a side-effect of the low-dimensional (d_e=16) embeddings avoiding the large embedding tables discrete DLMs need.
State-of-the-Art Perplexity and Generation Quality
At the standard 0.1B-parameter, 1M-step benchmark, RePlaid (s.c.) achieves 22.1 PPL on OpenWebText — the best among all continuous DLMs, and even lower than the discrete MDLM (23.1). On LM1B, it reaches 31.6, surpassing Duo (33.7) but behind MDLM (30.8).
PPL scale normalized to 50 = 100% bar width. AR reference = 17.5 nats.
For generation quality, RePlaid (no s.c.) consistently beats Duo at all sampling step counts T. Self-conditioning helps at large T (≥64 steps) and produces a GenPPL of ~21 at T=1024, the best among all diffusion LMs — but hurts at small T because the self-conditioning loop sharpens the sampling distribution below the reference entropy, reducing diversity.