BitLM: Generating Phrases, Not Tokens, with Binary Diffusion

01 — The One-Token Bottleneck

Why does AR generate one token at a time?

When a standard autoregressive model generates "the cat sat on the mat," it makes six independent categorical decisions — one per token. Each decision samples from a softmax distribution over the entire vocabulary (often 128k entries), appends the result, and starts again. The sequential dependency is baked into the architecture: token i+1 can only be produced once token i is known.

This interface has two costs. At training time, the model learns to predict one token conditional on all previous ones — it never learns to jointly commit to, say, the noun phrase "the cat" as a unit. At inference time, each token requires a full forward pass through the backbone, making generation latency proportional to output length.

The key question BitLM asks: Is the vocabulary softmax a fundamental requirement of language modeling, or is it just a historical interface choice? If tokens can be represented differently, do better generation regimes become possible?

Speculative decoding and multi-token prediction heads are the usual answers — but both keep the large-vocabulary softmax intact and add machinery on top. BitLM's answer is more radical: replace the softmax with a completely different output interface.

generation-compare.canvas press Play to animate

Standard AR — one token at a time

waiting...

BitLM — block of 4, committed together

waiting...

step 0 / 16

In the animation above, AR spends one step per token while BitLM's backbone runs once per block of 4 tokens — the diffusion head handles sub-steps internally. The result: half the backbone calls for the same 8-token output.

02 — Tokens as Binary Codes

Each token becomes a point on a binary hypercube

BitLM's core idea starts with a simple observation: a vocabulary of size V can be uniquely indexed with B = ⌈log₂ V⌉ bits. For a typical 128k tokenizer, that's B = 17. BitLM uses B = 18 to give a comfortable margin. Every token id y ∈ {0, ..., V−1} gets a fixed binary code:

Binary Encoding φ(y) = 2 \cdot bin 18 (y) - 1 \in {-1, +1}¹⁸ <!-- maps 0\to-1, 1\to+1, placing each token on a vertex of the 18-dim hypercube -->

The crucial point: this is a fixed identifier, not a learned codebook. The tokenizer doesn't change. Only the output layer changes — instead of a 128k-dimensional softmax, the model now deals with an 18-dimensional binary vector.

Why does this help?

On a softmax simplex, every vocabulary item is equidistant from every other — there's no geometry to exploit. On a binary hypercube, tokens that share many bits are nearby. Two tokens in the same byte-range of the tokenizer (say, token 1000 = 1111101000 and token 1001 = 1111101001) differ by exactly one bit. The geometry now carries information about the token index structure.

More importantly, the hypercube geometry makes joint denoising natural. A diffusion process can start from continuous Gaussian noise anywhere in ℝ¹⁸ and iteratively commit to the nearest ±1 vertex. This is the same idea as "Analog Bits" in image generation — applied to text.

hypercube.canvas — 3-bit toy vocabulary (8 tokens) hover a token to highlight

The 3-bit toy above shows 8 tokens at the corners of a cube. In BitLM's actual 18-bit space, 128k tokens occupy 128k of the 2¹⁸ = 262,144 possible vertices — with 134,144 unused corners acting as "void" states that the sign() projection maps to real tokens by clamping.

03 — Architecture

A causal backbone plus a lightweight diffusion head

BitLM's design separates two responsibilities cleanly: the backbone reasons about what comes next, and the diffusion head realizes it as discrete tokens.

The block-causal mask (inset in the SVG) is the key structural choice. Within a block, all positions can attend to each other — so the backbone builds a joint representation of the block before the diffusion head decides which tokens to emit. Across blocks, strictly causal order is maintained, so left-to-right language structure is preserved.

When block size m = 1, the block-causal mask degenerates to standard causal attention and BitLM is equivalent to standard AR in binary space. At m = 4 (the setting used in experiments), the model emits 4 tokens per backbone call — a 4× inference parallelism factor in the backbone.

Separation of concerns: The backbone handles what to say (left-to-right causal computation), the diffusion head handles how to say it (joint lexical realization). This is different from speculative decoding, where a draft model handles "what" while the verifier corrects it — here there is no correction step, just progressive commitment.

04 — Iterative Denoising

From Gaussian noise to committed tokens in K steps

Once the backbone produces the context latent C⁽ⁿ⁻¹⁾ for block n, the diffusion head takes over. It starts from pure Gaussian noise and iteratively "commits" bit-by-bit toward the target binary block.

Denoising Schedule (straight-line ODE) At = (1 − t) · A0 + t · ε <!-- t=1: pure noise; t=0: clean binary block --> Each step: Â0 = DiffHead(At_k, t_k; C(n-1)) At_{k-1} = (t_{k-1}/t_k) · At_k + (1 − t_{k-1}/t_k) · Â0

After K=15 steps, the model applies a hard sign() projection to snap the continuous values back to the ±1 binary hypercube. The 18 resulting bits are interpreted as a token ID via the inverse map φ⁻¹.

The visualization below shows how a block of 4 token slots starts from noise and progressively commits as denoising steps accumulate. Each row represents one token in the block; each column represents one of the 18 bits. Gray = uncertain, blue = committed to +1, red = committed to −1.

denoising.canvas — target: "the" | "cat" | "sat" | "on" step through or watch full denoising

step 0 / 15 noise — no commitment yet

Notice that commitment isn't uniform: some bits "lock in" early (usually the high-order bits that determine the token's coarse identity), while lower-order bits stay uncertain longer. This mirrors how humans resolve ambiguous words — first the grammatical category, then the specific lexeme.

05 — CFG & Denoising Steps

K = 15 steps and CFG = 9.0 are the sweet spot

Two inference hyperparameters determine output quality: the number of denoising steps K and the classifier-free guidance (CFG) scale ω. Both follow the familiar diffusion quality-speed trade-off.

Classifier-free guidance amplifies the conditional signal: the model is run twice per denoising step — once conditioned on the context C⁽ⁿ⁻¹⁾ and once with a null context — and the prediction is extrapolated in the direction of the conditional:

Classifier-Free Guidance Â 0,CFG = (1+ω) \cdot DiffHead(A t, t; C) - ω \cdot DiffHead(A t, t; \emptyset) <!-- ω=0: no guidance; ω=9: strong guidance used in BitLM experiments -->

cfg-ablation.canvas — ROUGE-1 vs denoising steps, 3 CFG values hover to read values

■ CFG = 1.0 (no guidance) ■ CFG = 5.0 ■ CFG = 9.0 (used in paper)

More steps always help up to a point — beyond K≈15, the improvement plateaus and computation grows linearly. High CFG scales require more steps to stabilize (the guidance gradient is sharper), which is why the K=15, CFG=9 combination wins: enough steps to resolve the sharp gradient, not so many as to waste compute.

There's also an interesting interaction: at K=1 with strong CFG, quality is worse than no guidance at all (the model overshoots in a single step). This mirrors observations in continuous diffusion for images.

06 — Scaling & Results

Smooth scaling, promising XSum numbers

BitLM was pretrained on FineWeb-350BT (350B tokens from FineWeb) and fine-tuned on XSum summarization. The model follows the Qwen-3 backbone architecture with a BitDance-style diffusion head, trained at four scales: 0.6B, 1.7B, 4B, and 8B parameters.

Scaling behavior

Pretraining loss decreases smoothly across all four scales with no signs of instability — evidence that the binary-space output interface is compatible with standard scaling recipes. The model doesn't require any special architecture modifications beyond swapping the softmax head for the diffusion head.

XSum ROUGE results

After supervised fine-tuning on XSum, BitLM 8B with the diffusion head achieves ROUGE-1/2/L of 26.05 / 6.44 / 20.12 — above the Lead-3 baseline and approaching (but not yet matching) the pointer-generator baselines from 2017. The paper is transparent about this gap: the current result validates the binary-space approach is viable, while identifying concrete directions for improvement (better fine-tuning, adaptive block sizes, hybrid designs).

Lead-3 (baseline)

16.30

PTGEN (See et al., 2017)

29.70

PTGEN + Coverage

28.10

BitLM 8B — LM Head (FT)

23.20

BitLM 8B — Diff Head (FT) ★

26.05

Notably, the diffusion head outperforms the LM head on the same backbone — the binary denoising interface is not just comparable to the softmax, it's better on this task after fine-tuning. This suggests that joint within-block token generation has a real inductive bias advantage for language tasks with multi-word coherence requirements.

Full results table

Method	ROUGE-1	ROUGE-2	ROUGE-L
Lead-3	16.30	1.60	11.95
PTGEN (See et al., 2017)	29.70	9.21	23.24
PTGEN + Coverage	28.10	8.02	21.72
BitLM 8B — LM Head (pretrain)	10.06	2.64	8.78
BitLM 8B — LM Head (fine-tune)	23.20	4.45	18.04
BitLM 8B — Diff Head (pretrain)	19.49	2.03	15.19
BitLM 8B — Diff Head (fine-tune)	26.05	6.44	20.12

The bigger takeaway: BitLM shows that replacing the vocabulary softmax with binary-space diffusion is viable at scale. The remaining performance gap versus optimized seq2seq baselines is a tuning gap, not an architectural ceiling — and those baselines were themselves replaced by LLMs years ago. BitLM's contribution is proving the design space exists, not declaring it closed.