FP4 at Scale: NVFP4 Pretraining for LLMs

01 — Why FP4 Is Hard

A single outlier ruins the whole block

4-bit floating point (FP4 E2M1) can only represent 16 distinct values: 0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6. To represent a tensor in FP4 you pick a scale factor that maps the largest magnitude value to ±6, then round every other value to the nearest FP4 step. This works fine when a block of activations is well-behaved — but transformer layers are full of outliers: dimensions that carry unusually large activations.

If one value in a 16-element block is 18.3 while the rest sit between ±1.5, the scale is forced to accommodate 18.3. The grid spacing becomes 18.3/6 ≈ 3.05. Any value smaller than 1.525 (half a grid step) rounds to zero — it simply vanishes from the representation. Below, toggle between a clean block and one with an outlier to see how one large value silences the rest.

FP4 block quantization — outlier effect click buttons to switch

The scale is a shared resource. A 32× outlier forces everyone else to share the same coarse grid. In the outlier block above, 6 of the 15 normal values round to zero entirely — their information is lost for the forward pass and the resulting gradient update. This is the core challenge of 4-bit training: the dynamic range budget is tiny, and outliers are common.

💡

FP8 handles this gracefully because its 256 representable values cover a much wider relative range. Going from FP8 to FP4 cuts the mantissa from 3 bits to 1 bit — so the grid has only 8 positive steps instead of 128. Even a modest outlier can steal precious resolution.

02 — NVFP4 Format

Two levels of scaling, half the block size

NVIDIA's NVFP4 addresses the dynamic-range problem with two design choices relative to the standard MXFP4 microscaling format:

Smaller blocks (16 vs 32 elements). Smaller blocks mean a narrower dynamic range within each group, so outliers affect fewer neighbours.
More precise per-block scale (E4M3 vs UE8M0). An E4M3 scale has 4 exponent + 3 mantissa bits, giving fractional precision. UE8M0 is power-of-two only — it can lose up to a full factor of 2 from rounding.
Two-level scaling. A per-tensor FP32 scale brings values into the range representable by the per-block E4M3 scale, which then fine-tunes each block's dynamic range.

The table below shows the hardware speedup on NVIDIA Blackwell Tensor Cores. On GB300, NVFP4 delivers 6× the arithmetic throughput of BF16 — the same as MXFP4, but with meaningfully better convergence.

Format	Element	Scale	Block	Speedup GB200	Speedup GB300
BF16	—	—	—	1×	1×
MXFP8	E5M2/E4M3	UE8M0	32	2×	2×
MXFP4	E2M1	UE8M0	32	4×	6×
NVFP4	E2M1	E4M3	16	4×	6×

Hardware throughput is identical to MXFP4 on Blackwell — the advantage of NVFP4 is purely numerical. Experiments later show that MXFP4 needs 36% more training tokens to reach the same validation loss as NVFP4, a significant efficiency gap even before counting the memory savings from half-precision weights.

03 — Training Recipe

Four fixes that make FP4 training stable

Switching every matrix multiply to NVFP4 without any mitigations causes training to diverge on the 12B model before the first trillion tokens. NVIDIA's recipe adds four targeted techniques. Click through them below — each one addresses a distinct failure mode.

NVFP4 training methodology — step by step use Prev / Next

Mixed Precision — protect sensitive layers

Keep the final ~15% of transformer blocks (FFN + Mamba-2 tails) in BF16. These layers require wider dynamic range. Training diverges if every layer is quantized to FP4. The first two blocks are also left in BF16 (16% total for the 12B model).

layers 0–1 → BF16 | layers 2–53 → NVFP4 | layers 54–61 → BF16

Random Hadamard Transform — smooth weight-gradient outliers

Before quantizing the two inputs to the weight-gradient GEMM (Wgrad), multiply each by a random 16×16 Hadamard matrix. A single outlier spreads its energy across all 16 elements — no single value dominates the scale, so the whole block is representable in FP4.

Wgrad_input ← H · x, where H is randomised 16×16 Hadamard

2D Block Scaling — keep the chain rule intact for weights

Standard 1D scaling (1×16 blocks along rows) gives a different quantized weight in the backward pass because the matrix is transposed. 2D scaling uses 16×16 blocks — the same scale covers both forward and backward access patterns, so W_fprop = W_bprop.

W: 16×16 block scale | activations/gradients: 1×16 block scale

Stochastic Rounding — remove gradient quantization bias

Deterministic round-to-nearest can systematically round small gradient values in the same direction, accumulating bias. Stochastic rounding rounds each value up or down with probability proportional to distance — unbiased in expectation. Applied only to gradient tensors (not weights or activations).

P(round up) = (x − floor_fp4(x)) / fp4_step_size

Step 1 of 4

The ablation study in the paper removed one technique at a time starting from a checkpoint trained on 3.43T tokens. Every removal worsened convergence: without stochastic rounding the loss rose by ~1.5%, without the Hadamard transform by ~1.2%, and without 2D scaling by ~0.8%. No single technique is redundant.

04 — 2D Weight Scaling

Why the chain rule breaks with 1D quantization

This is the paper's most subtle algorithmic insight. During the forward pass, weight W is read row by row for the output projection. During the backward pass, the same matrix is accessed column by column (it's transposed for the activation-gradient GEMM). If quantization uses 1D blocks along a fixed dimension, the same physical weight W[i][j] gets a different scale factor — and therefore a different quantized value — depending on whether you're going forward or backward.

Why consistency matters Forward pass: y = W fprop x Backward pass: \partialx = W bprop T \partialy With 1D scaling: W_fprop \neq W_bprop \to gradient is computing \partial/\partialW of a DIFFERENT function than the forward used. With 2D block scaling: W_fprop = W_bprop \to chain rule holds correctly.

The interactive grid below shows a concrete 4×4 weight matrix. In Standard 1D mode each row uses its own maximum for scaling. Toggle to Backward pass to see how the same matrix, now accessed column-by-column, assigns completely different scales to most cells — the red cells are those where the quantized value would differ between forward and backward. With NVFP4 2D Scaling (2×2 blocks in this toy example), the same block scale covers both access patterns: every cell stays green.

Weight matrix W — quantization consistency click a mode to compare

consistent (same quantized value in forward + backward)

inconsistent (chain rule broken)

Scale = max of each row

With 1D scaling, 13 of 16 cells are quantized inconsistently between forward and backward passes. The gradient computed by the backward pass is not the gradient of the loss with respect to the actual forward-pass function — the chain rule is silently broken. At small scales this introduces tolerable noise, but over 10 trillion tokens it compounds into measurable loss degradation.

05 — Hadamard Outlier Smoothing

Spreading one big value across 16 small ones

The Random Hadamard Transform (RHT) is applied to both inputs of the weight-gradient GEMM before quantization. A Hadamard matrix H is an orthogonal matrix whose entries are all ±1/√N — it preserves the L2 norm of any vector while redistributing its energy.

The key property: if one element dominates (say, value 15.8 versus 15 values near 0.2), after multiplying by H₁₆ each of the 16 output elements picks up roughly ±15.8/4 ≈ ±3.95 from that outlier, plus small contributions from the rest. The result is 16 values all near ±4 — well within FP4's representable range of ±6. No single element steals the scale factor.

Because H is orthogonal (H^TH = I), the transform cancels when both GEMM operands are transformed: (Hx)^T(Hy) = x^Ty. The dot-product result is unchanged, so the mathematical computation is exact — only the quantization grid alignment improves.

Random Hadamard Transform — outlier redistribution click Play to animate

block element index → before transform

The paper uses a 16×16 Hadamard matrix (d=16), applied to Wgrad inputs only. Larger matrices distribute energy more uniformly but add compute overhead. Experiments showed no measurable gain beyond d=16 for the 12B model, and d=4 was insufficient — larger models have more structured outliers that require wider mixing.

⚠

RHT is applied only to Wgrad inputs, not to W directly. Applying it to W would break the 2D scaling consistency (the transform changes which values share a block), so RHT and 2D scaling target different GEMMs: Hadamard handles Wgrad, 2D scaling handles Fprop and Dgrad through consistent weight quantization.

06 — Results

10 trillion tokens, loss gap under 1.5%

The paper trains a 12B-parameter hybrid Mamba-Transformer (Nemotron-H architecture) on 10 trillion tokens — the longest publicly documented 4-bit pretraining run. The FP8 baseline follows the same training schedule and data mixture. Evaluations are run in BF16 after training.

The NVFP4 validation loss tracks FP8 within 1% during the stable phase and widens to ~1.5% during learning-rate decay. Downstream task accuracy is nearly indistinguishable:

Task	FP8	NVFP4	Delta
General
MMLU (5-shot)	77.36	76.57	−0.79
MMLU-Pro (5-shot)	62.62	62.58	−0.04
AGIEval English CoT	67.01	70.31	+3.30
Math
GSM8k CoT	89.08	92.27	+3.19
MATH	83.32	81.48	−1.84
Multilingual
Global MMLU	74.00	74.94	+0.94
MGSM	81.87	85.53	+3.66
Code
HumanEval+	59.93	57.43	−2.50
MBPP+	59.11	55.91	−3.20
Commonsense
ARC Challenge	91.81	91.81	0.00
HellaSwag	83.83	83.09	−0.74
PIQA	82.64	82.70	+0.06

The code benchmarks (HumanEval+, MBPP+) show the largest degradation, but the paper notes these evaluations are noisy and an earlier checkpoint matches FP8 on coding tasks. Most other domains are within noise or NVFP4 is slightly ahead. NVFP4 on coding is an open area for future work.

NVFP4 vs MXFP4: token efficiency

A separate 8B model experiment compares NVFP4 and MXFP4 directly against a BF16 reference. MXFP4's loss gap is ~2.5% vs BF16 while NVFP4's is ~1.5%. More concretely, MXFP4 needs to train on more tokens to close the gap:

Tokens needed to reach same validation loss as NVFP4 @ 1T tokens

NVFP4 (baseline)

1.00T tokens

MXFP4 (to match)

1.36T tokens (+36%)

🔬

Key takeaway: MXFP4 requires training on 36% more tokens to match the validation loss of NVFP4 trained on 1T tokens. On GB300 hardware where both formats run at 6× BF16 throughput, this means NVFP4 effectively offers 36% more efficient utilisation of compute at equivalent accuracy. Future work includes quantizing attention layers and extending to mixture-of-experts architectures.