RePlaid: Continuous Diffusion Rivals Discrete

01 — The 64× Myth

A 64× Number That Was Never a Fair Fight

When Plaid (2023) reported that continuous diffusion language models needed 64× more compute than autoregressive models to reach the same validation loss, the community largely concluded continuous diffusion was impractical at scale. That number became a citation shorthand for "continuous DLMs don't scale."

The problem: Plaid used a different transformer architecture than the modern discrete DLMs it was being compared against. Different dataset, different optimizer hyperparameters, different attention variant, different normalization — making the 64× figure a measurement of "everything different at once" rather than a measurement of the continuous diffusion inductive bias itself.

RePlaid asks: what if we hold everything else equal? Use the same DiT backbone, same SlimPajama dataset, same AdamW schedule, same FLOP-counting methodology as MDLM and Duo — and just change continuous vs. discrete. The answer: the compute gap falls to 20×, placing continuous diffusion squarely between MDLM (14×) and Duo (22×).

compute gap to match AR validation loss — lower is better animates on scroll

Autoregressive Baseline

AR (reference)

1×

Discrete Diffusion

MDLM (low var.)

14×

Continuous Diffusion

RePlaid (s.c.) — ours

20×

Duo

22×

RePlaid (no s.c.)

27×

Original Plaid

64×

Key insight: The 64× figure was never an inherent limitation of continuous diffusion — it was an artifact of an apples-to-oranges architectural comparison. With a fair setup, continuous diffusion is competitive with discrete DLMs.

02 — Plaid Background

How Plaid Works: Gaussian Noise on Token Embeddings

Plaid is a Variational Diffusion Model (VDM) for text. Rather than operating on discrete token IDs directly (as MDLM and Duo do), Plaid first embeds each token into a low-dimensional continuous vector space using a learned embedding matrix E ∈ ℝ^V×d_e with d_e = 16 (not 768). Gaussian noise is then added to these embeddings, not to one-hot token vectors.

This has a key efficiency benefit: projecting V-dimensional one-hot vectors (V ≈ 32K) through the transformer hidden size h requires [L, V] × [V, h] multiplications. Plaid reduces this to [L, d_e] × [d_e, h] — roughly 50× fewer FLOPs at V=32K, h=768, d_e=16.

plaid forward & reverse diffusion — text as continuous embeddings

Training minimizes the Negative ELBO (NELBO), which has three terms: a prior KL loss, a reconstruction loss at t=0, and a diffusion loss across all timesteps. The denoiser x_θ(z_t, t) outputs a probability distribution over the vocabulary at each position — it's a transformer with bidirectional attention conditioned on the noise level t.

The noise schedule γ(t) controls how much signal remains at time t. Plaid inherits VDM's learnable schedule: the endpoints γ₀ and γ₁ plus a monotone neural net for the interior shape. Crucially, this schedule is learned by minimizing the Monte Carlo variance of the ELBO — a key mechanism explored in §05.

Plaid NELBO (training objective) L = KL(q(z₁|x) ‖ p(z₁)) \leftarrow prior loss + 𝔼[-log⟨x_θ(z₀, 0), x⟩] \leftarrow reconstruction at t=0 - ½ 𝔼 t,z_t [SNR'(t) \cdot ‖ê_θ(z_t, t) - e‖²] \leftarrow diffusion loss

03 — RePlaid Architecture

The Architecture Fix: Aligning with Modern Discrete DLMs

To make a fair comparison with MDLM and Duo, RePlaid adopts the exact same Diffusion Transformer (DiT) backbone: bidirectional attention, RoPE positional embeddings, and AdaLN-Zero modulation. The original Plaid used different choices for each of these, making its 64× gap impossible to attribute to the continuous diffusion training objective alone.

Five concrete changes bring Plaid's architecture in line. Walk through them below — each is small individually, but together they drop the compute gap from 64× to 27× (without self-conditioning) and to 20× (with self-conditioning).

replaid architecture alignment — 5 changes from plaid to replaid step through to explore

Bidirectional Attention + RoPE

Plaid used standard learned positional embeddings with unspecified attention. RePlaid switches to RoPE (rotary position embeddings) with bidirectional attention — matching MDLM/Duo exactly.

attn = BidirectionalAttention(RoPE=True)

LayerNorm + MLP Biases

Enable biases in all MLP layers and use LayerNorm with learnable scale/shift. These small capacity additions are standard in modern transformers but were not present in original Plaid.

LayerNorm(bias=True); MLP(bias=True)

GELU(tanh) Activation

Replace the MLP activation with GELU with tanh approximation — the standard choice in modern language model transformers (GPT-4, Llama, MDLM).

act = GELU(approximate='tanh')

AdaLN-Zero with Learnable Gating

Inject the noise timestep t via AdaLN-Zero modulation — scale and shift the LayerNorm outputs as a function of t. Learnable zero-initialized gates on the residual branches allow stable training.

y = x + gate * Attn(AdaLN(x, t))

Remove FP32 Logit Head

Original Plaid computed the final output prior logits in FP32 (not BF16) — a major numerical confounder in FLOP-matched comparisons. RePlaid removes this and runs the full forward pass in BF16.

head = Linear(h, V, dtype=bfloat16)

Step 1 of 5

04 — Embedding Geometry

Why ELBO Beats Cross-Entropy: The Structured Embedding Space

Other continuous DLMs — FLM, LangFlow, CDCD — train with a cross-entropy (CE) loss instead of the ELBO's MSE-style diffusion loss. A natural question: can RePlaid just use CE loss too? The answer is no, and the reason is visible in the embedding geometry.

RePlaid's ELBO objective enforces a low-rank, structured embedding space: after training, 90% of the variance in E is explained by just 6 principal components, and a t-SNE plot shows clear clustering by part-of-speech (nouns cluster together, verbs cluster separately, etc.). Adding an auxiliary CE loss disperses the embeddings — 90% variance now requires 13 principal components — and the PPL bound degrades from 22.1 to 26.1.

learned token embedding geometry (d_e = 16, shown in 2D via t-SNE) toggle to compare

NOUN VERB ADJ DET PREP PUNCT

This low-rank geometry is not a coincidence — it's a consequence of the ELBO's signal-to-noise weighting. The ELBO's diffusion loss pays more attention to denoising at low noise levels (near clean data), which forces the model to create tight, discriminable clusters that are easy to recover from slightly corrupted versions. CE training, by contrast, applies uniform pressure across all token pairs, creating a more dispersed embedding geometry that is harder to denoise.

Component removed	OWT PPL (1M steps)	Delta
RePlaid (s.c.) — full model	22.1	—
w/o output prior logits	22.5	+0.4
w/o self-conditioning	23.6	+1.5
w/o learnable noise schedule	24.4	+2.3
w/o learnable embeddings (frozen random)	39.4	+17.3

Freezing token embeddings causes the single largest PPL collapse — from 22.1 to 39.4, making RePlaid the worst DLM on OWT. Embedding geometry is the primary driver of RePlaid's gains.

05 — Noise Schedule

The Schedule Secret: Linear CE Emerges Automatically

A recurring challenge in diffusion language models is distributing denoising difficulty evenly across timesteps. If the model has to do most of its work at a narrow range of t values, training is inefficient. Recent CE-based methods (CDCD, FLM, LangFlow) hand-engineer time reparameterizations to achieve a near-linear per-timestep cross-entropy loss.

RePlaid doesn't need this heuristic. The ELBO's Monte Carlo variance is minimized by making the per-timestep diffusion loss flat (Proposition 1). And a flat diffusion loss, under a near-optimal denoiser, implies a near-linear per-timestep CE loss (Proposition 2) — exactly what CE-based methods manually engineer. The schedule learns it for free.

per-timestep CE loss vs t — linear distribution = even denoising difficulty

Proposition 2 — Per-Timestep CE Under Optimality CE* γ* (t) = H(x) - I(e; z₀) \leftarrow constant offset + κ \cdot t \leftarrow linear trend (κ = avg diffusion loss) + C(x | z_t) \leftarrow conditional total correlation (non-negative, increasing in t)

The linear trend term κ · t dominates in practice, so the CE loss is approximately linear in t. This means no manual time reparameterization is needed: optimizing the ELBO variance automatically provides what LangFlow engineers by hand.

Practical upshot: A frozen cosine noise schedule (standard for image diffusion) concentrates denoising difficulty in the middle of the trajectory — easy at both ends, hard in the middle. The learned schedule redistributes work evenly, which improves both training efficiency and final likelihood.

06 — Scaling Laws

On Par with Discrete: The First Fair Scaling Comparison

To measure scaling behavior, the authors perform an IsoFLOP analysis: for each of 5 compute budgets (6×10¹⁸ to 10²⁰ FLOPs), they train models of varying sizes and find the compute-optimal (Chinchilla-style) validation loss. These optimal (FLOPs, loss) pairs are then fit to a power law: L* ∝ C^α.

All five methods — AR, MDLM, Duo, RePlaid (s.c.), RePlaid (no s.c.) — exhibit nearly identical power-law exponents. The curves are parallel on a log-log plot; only their horizontal positions differ. RePlaid (s.c.) sits 20× to the right of AR, between MDLM (14×) and Duo (22×).

compute-optimal scaling laws — validation loss vs non-embedding FLOPs

AR (baseline) MDLM (14×) RePlaid s.c. (20×) Duo (22×) RePlaid no s.c. (27×)

One unexpected finding: in the over-trained regime (training a smaller model for much longer than compute-optimal), RePlaid (s.c.) overtakes MDLM. For a 66M parameter model, RePlaid achieves lower loss than MDLM when both are pushed past their compute-optimal points. This is significant for production scenarios where small, cheap-to-serve models are over-trained on large data budgets.

Method	Compute gap vs AR	Params vs AR (optimal)
AR (reference)	1×	1×
MDLM (low var.)	14×	3.4×
RePlaid (s.c.) — ours	20×	1.0×
Duo	22×	3.4×
RePlaid (no s.c.)	27×	—
Original Plaid	64×	—

RePlaid (s.c.) uses 3.4× fewer parameters than MDLM and Duo to reach its compute-optimal frontier — a side-effect of the low-dimensional (d_e=16) embeddings avoiding the large embedding tables discrete DLMs need.

07 — Results

State-of-the-Art Perplexity and Generation Quality

At the standard 0.1B-parameter, 1M-step benchmark, RePlaid (s.c.) achieves 22.1 PPL on OpenWebText — the best among all continuous DLMs, and even lower than the discrete MDLM (23.1). On LM1B, it reaches 31.6, surpassing Duo (33.7) but behind MDLM (30.8).

test perplexity on OpenWebText — lower is better (animates on scroll)

Autoregressive

AR Transformer-XL

17.5

Discrete Diffusion

MDLM (low var.)

23.1

Duo

25.2

Continuous Diffusion

RePlaid (s.c.) — ours

22.1

RePlaid (no s.c.) — ours

23.6

Plaid (original)

24.4

LangFlow

38.4

PPL scale normalized to 50 = 100% bar width. AR reference = 17.5 nats.

For generation quality, RePlaid (no s.c.) consistently beats Duo at all sampling step counts T. Self-conditioning helps at large T (≥64 steps) and produces a GenPPL of ~21 at T=1024, the best among all diffusion LMs — but hurts at small T because the self-conditioning loop sharpens the sampling distribution below the reference entropy, reducing diversity.

generative perplexity vs sampling steps T — lower is better shows quality-compute tradeoff

RePlaid s.c. RePlaid no s.c. Duo MDLM LangFlow

Why does this matter? Unlike discrete DLMs which can only sample token-by-token (like AR), continuous DLMs can use ODE solvers and distillation to collapse the full sampling trajectory into far fewer steps. RePlaid's strong performance at T=32–64 steps (competitive with discrete methods' best) makes it a compelling foundation for few-step text generation research.