Why does AR generate one token at a time?
When a standard autoregressive model generates "the cat sat on the mat," it makes six independent categorical decisions — one per token. Each decision samples from a softmax distribution over the entire vocabulary (often 128k entries), appends the result, and starts again. The sequential dependency is baked into the architecture: token i+1 can only be produced once token i is known.
This interface has two costs. At training time, the model learns to predict one token conditional on all previous ones — it never learns to jointly commit to, say, the noun phrase "the cat" as a unit. At inference time, each token requires a full forward pass through the backbone, making generation latency proportional to output length.
Speculative decoding and multi-token prediction heads are the usual answers — but both keep the large-vocabulary softmax intact and add machinery on top. BitLM's answer is more radical: replace the softmax with a completely different output interface.
In the animation above, AR spends one step per token while BitLM's backbone runs once per block of 4 tokens — the diffusion head handles sub-steps internally. The result: half the backbone calls for the same 8-token output.
Each token becomes a point on a binary hypercube
BitLM's core idea starts with a simple observation: a vocabulary of size V can be uniquely indexed with B = ⌈log₂ V⌉ bits. For a typical 128k tokenizer, that's B = 17. BitLM uses B = 18 to give a comfortable margin. Every token id y ∈ {0, ..., V−1} gets a fixed binary code:
The crucial point: this is a fixed identifier, not a learned codebook. The tokenizer doesn't change. Only the output layer changes — instead of a 128k-dimensional softmax, the model now deals with an 18-dimensional binary vector.
Why does this help?
On a softmax simplex, every vocabulary item is equidistant from every other — there's no geometry to exploit. On a binary hypercube, tokens that share many bits are nearby. Two tokens in the same byte-range of the tokenizer (say, token 1000 = 1111101000 and token 1001 = 1111101001) differ by exactly one bit. The geometry now carries information about the token index structure.
More importantly, the hypercube geometry makes joint denoising natural. A diffusion process can start from continuous Gaussian noise anywhere in ℝ¹⁸ and iteratively commit to the nearest ±1 vertex. This is the same idea as "Analog Bits" in image generation — applied to text.
The 3-bit toy above shows 8 tokens at the corners of a cube. In BitLM's actual 18-bit space, 128k tokens occupy 128k of the 2¹⁸ = 262,144 possible vertices — with 134,144 unused corners acting as "void" states that the sign() projection maps to real tokens by clamping.
A causal backbone plus a lightweight diffusion head
BitLM's design separates two responsibilities cleanly: the backbone reasons about what comes next, and the diffusion head realizes it as discrete tokens.
The block-causal mask (inset in the SVG) is the key structural choice. Within a block, all positions can attend to each other — so the backbone builds a joint representation of the block before the diffusion head decides which tokens to emit. Across blocks, strictly causal order is maintained, so left-to-right language structure is preserved.
When block size m = 1, the block-causal mask degenerates to standard causal attention and BitLM is equivalent to standard AR in binary space. At m = 4 (the setting used in experiments), the model emits 4 tokens per backbone call — a 4× inference parallelism factor in the backbone.
From Gaussian noise to committed tokens in K steps
Once the backbone produces the context latent C⁽ⁿ⁻¹⁾ for block n, the diffusion head takes over. It starts from pure Gaussian noise and iteratively "commits" bit-by-bit toward the target binary block.
Each step: Â0 = DiffHead(At_k, t_k; C(n-1))
At_{k-1} = (t_{k-1}/t_k) · At_k + (1 − t_{k-1}/t_k) · Â0
After K=15 steps, the model applies a hard sign() projection to snap the continuous values back to the ±1 binary hypercube. The 18 resulting bits are interpreted as a token ID via the inverse map φ⁻¹.
The visualization below shows how a block of 4 token slots starts from noise and progressively commits as denoising steps accumulate. Each row represents one token in the block; each column represents one of the 18 bits. Gray = uncertain, blue = committed to +1, red = committed to −1.
Notice that commitment isn't uniform: some bits "lock in" early (usually the high-order bits that determine the token's coarse identity), while lower-order bits stay uncertain longer. This mirrors how humans resolve ambiguous words — first the grammatical category, then the specific lexeme.
K = 15 steps and CFG = 9.0 are the sweet spot
Two inference hyperparameters determine output quality: the number of denoising steps K and the classifier-free guidance (CFG) scale ω. Both follow the familiar diffusion quality-speed trade-off.
Classifier-free guidance amplifies the conditional signal: the model is run twice per denoising step — once conditioned on the context C⁽ⁿ⁻¹⁾ and once with a null context — and the prediction is extrapolated in the direction of the conditional:
More steps always help up to a point — beyond K≈15, the improvement plateaus and computation grows linearly. High CFG scales require more steps to stabilize (the guidance gradient is sharper), which is why the K=15, CFG=9 combination wins: enough steps to resolve the sharp gradient, not so many as to waste compute.
There's also an interesting interaction: at K=1 with strong CFG, quality is worse than no guidance at all (the model overshoots in a single step). This mirrors observations in continuous diffusion for images.
Smooth scaling, promising XSum numbers
BitLM was pretrained on FineWeb-350BT (350B tokens from FineWeb) and fine-tuned on XSum summarization. The model follows the Qwen-3 backbone architecture with a BitDance-style diffusion head, trained at four scales: 0.6B, 1.7B, 4B, and 8B parameters.
Scaling behavior
Pretraining loss decreases smoothly across all four scales with no signs of instability — evidence that the binary-space output interface is compatible with standard scaling recipes. The model doesn't require any special architecture modifications beyond swapping the softmax head for the diffusion head.
XSum ROUGE results
After supervised fine-tuning on XSum, BitLM 8B with the diffusion head achieves ROUGE-1/2/L of 26.05 / 6.44 / 20.12 — above the Lead-3 baseline and approaching (but not yet matching) the pointer-generator baselines from 2017. The paper is transparent about this gap: the current result validates the binary-space approach is viable, while identifying concrete directions for improvement (better fine-tuning, adaptive block sizes, hybrid designs).
Notably, the diffusion head outperforms the LM head on the same backbone — the binary denoising interface is not just comparable to the softmax, it's better on this task after fine-tuning. This suggests that joint within-block token generation has a real inductive bias advantage for language tasks with multi-word coherence requirements.
Full results table
| Method | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Lead-3 | 16.30 | 1.60 | 11.95 |
| PTGEN (See et al., 2017) | 29.70 | 9.21 | 23.24 |
| PTGEN + Coverage | 28.10 | 8.02 | 21.72 |
| BitLM 8B — LM Head (pretrain) | 10.06 | 2.64 | 8.78 |
| BitLM 8B — LM Head (fine-tune) | 23.20 | 4.45 | 18.04 |
| BitLM 8B — Diff Head (pretrain) | 19.49 | 2.03 | 15.19 |
| BitLM 8B — Diff Head (fine-tune) | 26.05 | 6.44 | 20.12 |