Speculative Decoding LLM Inference Interactive

Domino: Causal Quality at Parallel Speed

Autoregressive drafters have quality but pay a sequential tax; parallel drafters are fast but miss intra-block causal dependencies. Domino decouples these concerns — a block-parallel backbone drafts the whole block at once, then a lightweight GRU head injects causal information through residual logit correction, gaining 16.6% acceptance length with only 2.8% extra latency.

June 1, 2026 ~14 min read Paper: arXiv:2605.29707
01 — The Trade-off

Two Methods, One Impossible Choice

Speculative decoding works by having a cheap draft model propose multiple tokens, then letting the expensive target model verify them all in one parallel pass. The key insight: if the draft model proposes a run of tokens that the target would have chosen anyway, the target gets to advance by multiple positions per verification call — effectively multiplying its throughput.

But there's a fundamental tension in how you build the draft model. The two main camps are:

  • Autoregressive drafters (e.g. EAGLE-3): Generate tokens one at a time, each conditioned on all previous draft tokens. High quality — the draft mirrors the target's causal factorization — but sequential: drafting γ tokens costs γ forward passes.
  • Parallel drafters (e.g. DFlash, DART): Generate the entire draft block in one non-autoregressive forward pass. Fast, but blind to intra-block causal dependencies — each position is conditioned only on the prefix, not on earlier draft tokens.
quality–cost trade-off · click a method to compare interactive

The numbers from Figure 1 of the paper tell the story clearly: on Qwen3-8B with a 16-token budget, EAGLE-3 achieves an acceptance length τ = 4.86 but only 3.28× speedup — its sequential draft execution eats into the gains. DFlash reaches 3.42× speedup because drafting is cheap, but τ drops to 4.03. Domino hits τ = 4.70 and 3.84× speedup in that micro-comparison, with the key difference being it uses far less per-draft latency than EAGLE-3 while recovering most of EAGLE-3's causal quality advantage.

Core question: Can we get causal dependency modeling without paying for sequential autoregressive execution? Domino's answer is yes — by separating where causal modeling happens from how draft tokens are generated.
02 — Speedup Formula

What Actually Determines Speedup

The speedup of speculative decoding over standard autoregressive decoding has a clean formula. Let τ be the expected number of tokens accepted per cycle (including the target model's bonus token), Tdraft be drafting time, Tverify be verification time, and Ltarget be per-token latency without speculative decoding:

speedup formula
η = Ltarget / Lspec = τ · Ltarget / (Tdraft + Tverify)
// two independent levers: increase τ (draft quality) OR decrease T_draft (draft efficiency)

This formula clarifies the trade-off precisely. Tverify is roughly fixed — it's one full target-model forward pass. So speedup is determined by two factors:

  • Higher τ (acceptance length) means each expensive target call advances more tokens. EAGLE-3 excels here: τ ≈ 4.86 on GSM8K (16-token budget, Qwen3-8B).
  • Lower Tdraft means draft generation wastes less time. DFlash excels here: it costs roughly one draft-model forward + one LM-head call, regardless of block size γ.
latency breakdown · Qwen3-8B, 16-token budget, A100 from paper Figure 1

For autoregressive drafters, Tdraft ≈ γ · (tnet + thead) — it scales linearly with draft length. For parallel drafters, Tdraft ≈ tnetblock + theadblock — paid once for the whole block. Domino adds a tiny overhead: the Domino head costs 1.2ms on top of DFlash's 5.9ms draft time, a 2.8% increase in total draft-then-verify latency.

03 — Domino Architecture

Parallel Backbone + Lightweight Causal Head

Domino has two components that compose cleanly:

domino architecture · data flow diagram click components to explore
Target Context C_t Masked Block [x_t, M, M, M…] Parallel Backbone (DFlash) 5 layers block-parallel forward pass Hidden H_i for all i LM Head (frozen) Base Logits L_base_i Domino Head GRU causal encoder + low-rank correction 56M params (+5.3%), 1.2ms overhead Correction ΔL_i low-rank r=256 Final Logits L_i = L_base + ΔL sample → draft token d_i sampled token feeds back into GRU for next position

Parallel Draft Backbone

Domino instantiates the backbone as DFlash — a block-diffusion style drafter that generates representations for all γ positions in one non-autoregressive forward pass. Given the last verified token xt as an anchor, it constructs a masked input block [xt, MASK, MASK, …] and runs it through a 5-layer transformer alongside target-model context features Ct. This produces hidden states Ht…Ht+B-1 for the whole block at once. The frozen target LM head then converts each to base logits Lbasei.

Domino Head

The Domino head sits on top of the backbone and has two parts: a causal encoder (a GRU with hidden dimension 1024) that summarizes all previously sampled draft token embeddings, and a low-rank correction head (bottleneck dimension r=256) that maps [Hi; Si-1] to a logit-space residual ΔLi. The final distribution is simply Lbasei + ΔLi. Only 56M extra parameters, added once — no repeated full LM-head calls.

Why logit space, not hidden space? If Domino corrected in hidden space, it would need to run the LM head again after each causal update — reintroducing the expensive sequential bottleneck. Correcting logits directly keeps the expensive LM-head computation parallel while the causal branch is cheap and sequential.
04 — Causal Correction

How the GRU Injects Causal Information

The central mechanism is surprisingly simple: a GRU reads the embeddings of sampled draft tokens one by one, maintaining a rolling causal state Si. When predicting draft token at position i, the causal state Si-1 captures everything about the preceding draft tokens — without any additional draft-model forward passes.

causal correction walkthrough · step through positions 1–4 use prev / next
1
Position 1 — no causal history yet
The GRU state S0 is zero (no prior draft tokens). Correction ΔL1 = 0. Final logit = base logit from backbone.
S_0 = zeros(1024) ΔL_1 = W2(σ(W1([H_1; S_0]))) # ≈ 0 L_1 = L_base_1 + ΔL_1 # = L_base_1 d_1 = sample(L_1)
2
Position 2 — first causal update
Sampled token d1 is embedded and fed into the GRU, producing state S1. Now ΔL2 depends on what token was just drafted — if d1 = "Paris", that shifts predictions for position 2.
E_1 = embed(d_1) # e.g. embed("Paris") S_1 = GRU(E_1, S_0) ΔL_2 = W2(σ(W1([H_2; S_1]))) # informed by "Paris" d_2 = sample(L_base_2 + ΔL_2)
3
Position 3 — accumulating context
The GRU now holds a summary of both d1 and d2. If d2 = "is", the state S2 encodes the partial draft "Paris is", steering position 3 toward likely continuations like "the" or "France's".
E_2 = embed(d_2) # e.g. embed("is") S_2 = GRU(E_2, S_1) # summarizes "Paris is" ΔL_3 = W2(σ(W1([H_3; S_2]))) # conditioned on "Paris is" d_3 = sample(L_base_3 + ΔL_3)
4
Position 4 — full causal chain
Each new position benefits from the full draft prefix so far. This replicates the quality benefit of autoregressive drafting without re-running the large draft network — just one lightweight GRU step per position.
E_3 = embed(d_3) S_3 = GRU(E_3, S_2) ΔL_4 = W2(σ(W1([H_4; S_3]))) d_4 = sample(L_base_4 + ΔL_4) # Total GRU cost: 4 × GRU(1024) steps — tiny
Step 1 of 4

The key insight is that the GRU state Si is dirt cheap to compute — just an RNN step — while the backbone's hidden state Hi and base logits are already available from the parallel forward pass. The entire correction loop runs in 1.2ms on an A100, compared to 35.5ms for a full target-model verification pass.

Ablation result: Enabling the Domino head on the same backbone improves average acceptance length from τ = 3.49 to τ = 4.19 and average speedup from 2.84× to 3.31×. The correction itself — not training tricks — is the primary source of improvement.
05 — Training Strategy

Why Training Is Tricky (And How Domino Fixes It)

Training the causal correction branch introduces two failure modes that don't exist in standard parallel drafting. Domino addresses both with a two-part training strategy.

Problem 1: What prefix should the GRU see?

The GRU needs to read prefix draft tokens during training. The natural choice is self-generated prefixes (training-time testing, TTT) — sample from the model itself and train on those. EAGLE-3 uses this approach. But Domino instead uses teacher forcing: feed the GRU ground-truth token embeddings during training.

training strategy comparison · τ by method (Qwen3-8B, ShareGPT) data from paper Figure 4 right
TTT (self-generated)
τ = 3.80
Teacher Forcing
τ = 3.96
TF + Curriculum
τ = 4.19
DFlash reference
τ = 3.62

Teacher forcing works better for a subtle reason: the causal encoder's correction at position i only matters when all preceding draft tokens have been accepted by the target model. If position 1 is rejected, positions 2–4 are never reached. So training the GRU on clean ground-truth prefixes focuses learning exactly on the regime that matters — the accepted-prefix regime. Noisy self-generated prefixes create an input–output mapping that doesn't exist in the data distribution.

Problem 2: Backbone collapse

Teacher forcing introduces a new failure mode: since the correction branch receives clean, informative prefixes during training, it can shortcut the parallel backbone. The backbone's base logits become weak — all the predictive signal flows through the correction branch, which has no fallback if backbone representations degrade.

The fix is a base-anchored curriculum: jointly train on both base logits and final logits, with a loss weight λt that starts at 1.0 (pure backbone loss) and linearly anneals to 0.0 (pure final-logit loss):

base-anchored curriculum loss
L = (1 − λt) · Lfinal + λt · Lbase
// λ_t: 1.0 → 0.0 linearly over training; both terms are cross-entropy with exponential position decay

This forces the backbone to develop a strong base distribution early in training, before the Domino head takes over residual correction. Without it, backbone loss plateaus high; with it, both components learn complementary representations and backbone loss decreases steadily.

Three-way ablation summary: TTT → τ = 3.80. Teacher forcing alone → τ = 3.96 (+4.2%). Teacher forcing + curriculum → τ = 4.19 (+10.3% over TTT). Each component contributes independently.
06 — Results

Up to 7.92× Speedup on Qwen3

Domino is evaluated on Qwen3-4B and Qwen3-8B across 8 benchmarks spanning math, code, and dialogue. All models use a 16-token draft block. The comparison includes EAGLE-3 (autoregressive), DART and DFlash (parallel), and FR-Spec (vocabulary-efficient).

end-to-end speedup · Qwen3-8B, greedy (T=0), Transformers backend scroll into view to animate
— autoregressive baselines —
EAGLE-3 (budget=16)
2.21×
EAGLE-3 (budget=60)
2.56×
DART (budget=60)
2.29×
— parallel drafters —
DFlash (budget=16)
4.66×
Domino (budget=16)
5.49×

Average over GSM8K, MATH-500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca. Bar width scaled to 7.92× max (GSM8K peak).

The headline numbers: on Qwen3-8B with greedy decoding, Domino improves over DFlash from 4.66× to 5.49× average speedup (17.8% relative gain). The peak is 7.92× on GSM8K, where acceptance length reaches τ = 10.03. On sampling (T=1), average speedup goes from 3.96× (DFlash) to 4.46× (Domino).

Per-benchmark breakdown (Qwen3-8B, T=0)

Method GSM8K MATH-500 HumanEval MBPP MT-Bench Avg
EAGLE-3 (16) 2.21× 2.09× 2.17× 1.93× 1.82× 1.97×
DART (60) 2.28× 2.29× 2.52× 2.39× 2.27× 2.29×
DFlash (16) 5.21× 6.18× 5.21× 4.71× 2.73× 4.66×
Domino (16) 7.92× 7.38× 5.89× 5.53× 3.29× 5.49×

The gains are largest on structured reasoning tasks (math +52%, code +13%) and smaller on open-ended dialogue (MT-Bench +20%). This pattern matches the hypothesis: reasoning tasks have more predictable token sequences, so causal correction — which learns to exploit those patterns — has more room to improve acceptance length.

High-concurrency serving (SGLang)

At serving scale (Table 2), Domino achieves up to 5.8× throughput on Qwen3-8B at concurrency=2 on GSM8K. The gains persist across all concurrency levels tested (2–32), confirming that Domino's improved draft quality translates to higher tokens-per-second in production serving environments, not just low-concurrency benchmarks.

Bottom line: By spending 56M parameters and 1.2ms to inject causal information into a block-parallel drafter, Domino consistently outperforms all autoregressive and parallel baselines on the same 16-token budget — achieving EAGLE-3-quality acceptance lengths with DFlash-quality drafting overhead.