Domino: Causal Quality at Parallel Speed

01 — The Trade-off

Two Methods, One Impossible Choice

Speculative decoding works by having a cheap draft model propose multiple tokens, then letting the expensive target model verify them all in one parallel pass. The key insight: if the draft model proposes a run of tokens that the target would have chosen anyway, the target gets to advance by multiple positions per verification call — effectively multiplying its throughput.

But there's a fundamental tension in how you build the draft model. The two main camps are:

Autoregressive drafters (e.g. EAGLE-3): Generate tokens one at a time, each conditioned on all previous draft tokens. High quality — the draft mirrors the target's causal factorization — but sequential: drafting γ tokens costs γ forward passes.
Parallel drafters (e.g. DFlash, DART): Generate the entire draft block in one non-autoregressive forward pass. Fast, but blind to intra-block causal dependencies — each position is conditioned only on the prefix, not on earlier draft tokens.

quality–cost trade-off · click a method to compare interactive

The numbers from Figure 1 of the paper tell the story clearly: on Qwen3-8B with a 16-token budget, EAGLE-3 achieves an acceptance length τ = 4.86 but only 3.28× speedup — its sequential draft execution eats into the gains. DFlash reaches 3.42× speedup because drafting is cheap, but τ drops to 4.03. Domino hits τ = 4.70 and 3.84× speedup in that micro-comparison, with the key difference being it uses far less per-draft latency than EAGLE-3 while recovering most of EAGLE-3's causal quality advantage.

Core question: Can we get causal dependency modeling without paying for sequential autoregressive execution? Domino's answer is yes — by separating where causal modeling happens from how draft tokens are generated.

02 — Speedup Formula

What Actually Determines Speedup

The speedup of speculative decoding over standard autoregressive decoding has a clean formula. Let τ be the expected number of tokens accepted per cycle (including the target model's bonus token), T_draft be drafting time, T_verify be verification time, and L_target be per-token latency without speculative decoding:

speedup formula η = L target / L spec = τ \cdot L target / (T draft + T verify) // two independent levers: increase τ (draft quality) OR decrease T_draft (draft efficiency)

This formula clarifies the trade-off precisely. T_verify is roughly fixed — it's one full target-model forward pass. So speedup is determined by two factors:

Higher τ (acceptance length) means each expensive target call advances more tokens. EAGLE-3 excels here: τ ≈ 4.86 on GSM8K (16-token budget, Qwen3-8B).
Lower T_draft means draft generation wastes less time. DFlash excels here: it costs roughly one draft-model forward + one LM-head call, regardless of block size γ.

latency breakdown · Qwen3-8B, 16-token budget, A100 from paper Figure 1

For autoregressive drafters, T_draft ≈ γ · (t_net + t_head) — it scales linearly with draft length. For parallel drafters, T_draft ≈ t_net^block + t_head^block — paid once for the whole block. Domino adds a tiny overhead: the Domino head costs 1.2ms on top of DFlash's 5.9ms draft time, a 2.8% increase in total draft-then-verify latency.

03 — Domino Architecture

Parallel Backbone + Lightweight Causal Head

Domino has two components that compose cleanly:

domino architecture · data flow diagram click components to explore

Parallel Draft Backbone

Domino instantiates the backbone as DFlash — a block-diffusion style drafter that generates representations for all γ positions in one non-autoregressive forward pass. Given the last verified token x_t as an anchor, it constructs a masked input block [x_t, MASK, MASK, …] and runs it through a 5-layer transformer alongside target-model context features C_t. This produces hidden states H_t…H_t+B-1 for the whole block at once. The frozen target LM head then converts each to base logits L_base_i.

Domino Head

The Domino head sits on top of the backbone and has two parts: a causal encoder (a GRU with hidden dimension 1024) that summarizes all previously sampled draft token embeddings, and a low-rank correction head (bottleneck dimension r=256) that maps [H_i; S_i-1] to a logit-space residual ΔL_i. The final distribution is simply L_base_i + ΔL_i. Only 56M extra parameters, added once — no repeated full LM-head calls.

Why logit space, not hidden space? If Domino corrected in hidden space, it would need to run the LM head again after each causal update — reintroducing the expensive sequential bottleneck. Correcting logits directly keeps the expensive LM-head computation parallel while the causal branch is cheap and sequential.

04 — Causal Correction

How the GRU Injects Causal Information

The central mechanism is surprisingly simple: a GRU reads the embeddings of sampled draft tokens one by one, maintaining a rolling causal state S_i. When predicting draft token at position i, the causal state S_i-1 captures everything about the preceding draft tokens — without any additional draft-model forward passes.

causal correction walkthrough · step through positions 1–4 use prev / next

Position 1 — no causal history yet

The GRU state S₀ is zero (no prior draft tokens). Correction ΔL₁ = 0. Final logit = base logit from backbone.

S_0 = zeros(1024) ΔL_1 = W2(σ(W1([H_1; S_0]))) # ≈ 0 L_1 = L_base_1 + ΔL_1 # = L_base_1 d_1 = sample(L_1)

Position 2 — first causal update

Sampled token d₁ is embedded and fed into the GRU, producing state S₁. Now ΔL₂ depends on what token was just drafted — if d₁ = "Paris", that shifts predictions for position 2.

E_1 = embed(d_1) # e.g. embed("Paris") S_1 = GRU(E_1, S_0) ΔL_2 = W2(σ(W1([H_2; S_1]))) # informed by "Paris" d_2 = sample(L_base_2 + ΔL_2)

Position 3 — accumulating context

The GRU now holds a summary of both d₁ and d₂. If d₂ = "is", the state S₂ encodes the partial draft "Paris is", steering position 3 toward likely continuations like "the" or "France's".

E_2 = embed(d_2) # e.g. embed("is") S_2 = GRU(E_2, S_1) # summarizes "Paris is" ΔL_3 = W2(σ(W1([H_3; S_2]))) # conditioned on "Paris is" d_3 = sample(L_base_3 + ΔL_3)

Position 4 — full causal chain

Each new position benefits from the full draft prefix so far. This replicates the quality benefit of autoregressive drafting without re-running the large draft network — just one lightweight GRU step per position.

E_3 = embed(d_3) S_3 = GRU(E_3, S_2) ΔL_4 = W2(σ(W1([H_4; S_3]))) d_4 = sample(L_base_4 + ΔL_4) # Total GRU cost: 4 × GRU(1024) steps — tiny

Step 1 of 4

The key insight is that the GRU state S_i is dirt cheap to compute — just an RNN step — while the backbone's hidden state H_i and base logits are already available from the parallel forward pass. The entire correction loop runs in 1.2ms on an A100, compared to 35.5ms for a full target-model verification pass.

Ablation result: Enabling the Domino head on the same backbone improves average acceptance length from τ = 3.49 to τ = 4.19 and average speedup from 2.84× to 3.31×. The correction itself — not training tricks — is the primary source of improvement.

05 — Training Strategy

Why Training Is Tricky (And How Domino Fixes It)

Training the causal correction branch introduces two failure modes that don't exist in standard parallel drafting. Domino addresses both with a two-part training strategy.

Problem 1: What prefix should the GRU see?

The GRU needs to read prefix draft tokens during training. The natural choice is self-generated prefixes (training-time testing, TTT) — sample from the model itself and train on those. EAGLE-3 uses this approach. But Domino instead uses teacher forcing: feed the GRU ground-truth token embeddings during training.

training strategy comparison · τ by method (Qwen3-8B, ShareGPT) data from paper Figure 4 right

TTT (self-generated)

τ = 3.80

Teacher Forcing

τ = 3.96

TF + Curriculum

τ = 4.19

DFlash reference

τ = 3.62

Teacher forcing works better for a subtle reason: the causal encoder's correction at position i only matters when all preceding draft tokens have been accepted by the target model. If position 1 is rejected, positions 2–4 are never reached. So training the GRU on clean ground-truth prefixes focuses learning exactly on the regime that matters — the accepted-prefix regime. Noisy self-generated prefixes create an input–output mapping that doesn't exist in the data distribution.

Problem 2: Backbone collapse

Teacher forcing introduces a new failure mode: since the correction branch receives clean, informative prefixes during training, it can shortcut the parallel backbone. The backbone's base logits become weak — all the predictive signal flows through the correction branch, which has no fallback if backbone representations degrade.

The fix is a base-anchored curriculum: jointly train on both base logits and final logits, with a loss weight λ_t that starts at 1.0 (pure backbone loss) and linearly anneals to 0.0 (pure final-logit loss):

base-anchored curriculum loss L = (1 - λ t) \cdot L final + λ t \cdot L base // λ_t: 1.0 \to 0.0 linearly over training; both terms are cross-entropy with exponential position decay

This forces the backbone to develop a strong base distribution early in training, before the Domino head takes over residual correction. Without it, backbone loss plateaus high; with it, both components learn complementary representations and backbone loss decreases steadily.

Three-way ablation summary: TTT → τ = 3.80. Teacher forcing alone → τ = 3.96 (+4.2%). Teacher forcing + curriculum → τ = 4.19 (+10.3% over TTT). Each component contributes independently.

06 — Results

Up to 7.92× Speedup on Qwen3

Domino is evaluated on Qwen3-4B and Qwen3-8B across 8 benchmarks spanning math, code, and dialogue. All models use a 16-token draft block. The comparison includes EAGLE-3 (autoregressive), DART and DFlash (parallel), and FR-Spec (vocabulary-efficient).

end-to-end speedup · Qwen3-8B, greedy (T=0), Transformers backend scroll into view to animate

— autoregressive baselines —

EAGLE-3 (budget=16)

2.21×

EAGLE-3 (budget=60)

2.56×

DART (budget=60)

2.29×

— parallel drafters —

DFlash (budget=16)

4.66×

Domino (budget=16)

5.49×

Average over GSM8K, MATH-500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca. Bar width scaled to 7.92× max (GSM8K peak).

The headline numbers: on Qwen3-8B with greedy decoding, Domino improves over DFlash from 4.66× to 5.49× average speedup (17.8% relative gain). The peak is 7.92× on GSM8K, where acceptance length reaches τ = 10.03. On sampling (T=1), average speedup goes from 3.96× (DFlash) to 4.46× (Domino).

Per-benchmark breakdown (Qwen3-8B, T=0)

Method	GSM8K	MATH-500	HumanEval	MBPP	MT-Bench	Avg
EAGLE-3 (16)	2.21×	2.09×	2.17×	1.93×	1.82×	1.97×
DART (60)	2.28×	2.29×	2.52×	2.39×	2.27×	2.29×
DFlash (16)	5.21×	6.18×	5.21×	4.71×	2.73×	4.66×
Domino (16)	7.92×	7.38×	5.89×	5.53×	3.29×	5.49×

The gains are largest on structured reasoning tasks (math +52%, code +13%) and smaller on open-ended dialogue (MT-Bench +20%). This pattern matches the hypothesis: reasoning tasks have more predictable token sequences, so causal correction — which learns to exploit those patterns — has more room to improve acceptance length.

High-concurrency serving (SGLang)

At serving scale (Table 2), Domino achieves up to 5.8× throughput on Qwen3-8B at concurrency=2 on GSM8K. The gains persist across all concurrency levels tested (2–32), confirming that Domino's improved draft quality translates to higher tokens-per-second in production serving environments, not just low-concurrency benchmarks.

Bottom line: By spending 56M parameters and 1.2ms to inject causal information into a block-parallel drafter, Domino consistently outperforms all autoregressive and parallel baselines on the same 16-token budget — achieving EAGLE-3-quality acceptance lengths with DFlash-quality drafting overhead.