Your GPU's report card
You just paid $2/hr for an A100. Its spec sheet says 312 TFLOPS (BF16, no sparsity). But your training loop is actually computing at 94 TFLOPS. That 30% gap — the wasted compute — is what MFU exposes. Model FLOPs Utilization is simply:
The numerator is how many floating-point operations your model actually needs per second. The denominator is how many the GPU could do. The ratio tells you what fraction of theoretical peak you are exploiting.
Critically, MFU uses the model's theoretical minimum FLOPs — the math the model absolutely must do — not what the hardware actually executed. So MFU can only approach, never exceed, 100%. A value near 100% would mean every transistor is doing useful arithmetic all the time, with zero overhead for memory transfers, communication, or kernel launch latency.
The efficiency spectrum
Click any labeled system below to see where it falls. The contrast between the GPT-3 original training run (~21%) and a modern FlashAttention-2 run (~72%) on the same A100 hardware captures three years of systems engineering progress.
A MFU above 50% is generally considered excellent. Below 30% usually signals something is wrong — memory bottlenecks, excessive communication, or fragmented kernel launches. Between 30–50% is typical for a production training run with model/pipeline/tensor parallelism.
Where does all the math come from?
Before you can calculate MFU, you need to know how many FLOPs a single training step actually requires. For a decoder-only transformer trained with the Adam optimizer, the canonical formula (from PaLM Appendix B, implemented in nanoGPT) is:
N = parameters, L = layers, d = d_model, T = sequence length
The 6N term dominates for large models with short sequences. It breaks down as: 2N for the forward pass (each parameter participates in one multiply-add per token) plus 4N for the backward pass (roughly 2× the forward cost, for both weight gradients and activation gradients).
The 12·L·d·T term captures attention score computation — the Q·Kᵀ and softmax·V operations that scale with sequence length. For a 7B model at T=2048, this is only ~7% of total FLOPs. But push the sequence to T=8192 and it climbs to ~22%.
The 6N approximation
For most practical settings (large models, sequence lengths up to a few thousand tokens), ignoring the attention term introduces <10% error and is the standard shortcut. The rule becomes:
- Training: ~6N FLOPs per token (2 forward + 4 backward)
- Inference: ~2N FLOPs per token (forward pass only)
Calculating MFU in practice
In a real training loop, you measure throughput (tokens processed per second) and convert it to an MFU. The steps are:
- Measure: tokens processed per second across your full GPU cluster
- Compute: FLOPs per token = 6N + 12·L·d·T (or just 6N for large models)
- Divide: (tokens/sec × FLOPs/token) / (num_GPUs × peak_FLOPs_per_GPU)
Use the per-GPU version to get a number independent of cluster size. That way a 512-GPU job and a 4096-GPU job that are equally well-optimized report the same MFU.
≡ (tokens_per_sec_per_GPU × FLOPs_per_token) / peak_FLOPs_GPU
Interactive calculator
Select your hardware and model, then drag the throughput slider to see what MFU you'd achieve. The color bar gives an instant quality signal.
A worked example
Training a 7B model on a single A100 at 4,000 tokens/sec per GPU, sequence length 2048:
- FLOPs/token ≈ 6 × 7×10⁹ + 12 × 32 × 4096 × 2048 ≈ 45.2 billion
- Required TFLOPS = 4,000 × 45.2×10⁹ ≈ 180.8 TFLOPS
- MFU = 180.8 / 312 ≈ 57.9% — excellent!
Why you never reach 100%
If MFU measures how much of the GPU's peak you use, what accounts for the missing 30–70%? There are five main categories of loss, roughly in order of how often they dominate in practice. Step through them below.
Accounting for gradient checkpointing
MFU measures efficiency against the theoretical minimum FLOPs — the math your model must do with no checkpointing. But what if you're using gradient checkpointing (activation recomputation)? In that case, the GPU is doing more arithmetic than the theoretical minimum, even though it's being used efficiently.
HFU (Hardware FLOPs Utilization) accounts for recomputation. It uses the number of FLOPs actually executed as the numerator, including recomputed forward passes.
Full checkpointing: factor ≈ 4/3 (forward done twice, backward once)
Selective checkpointing (attention only): factor ≈ 1.05–1.15
With full gradient checkpointing, instead of 2N (forward) + 4N (backward) = 6N FLOPs per token, you execute 2N + 2N (recomputed forward) + 4N = 8N FLOPs. So HFU = MFU × 8/6 = MFU × 1.33.
Which metric to report?
Report both if your run uses gradient checkpointing, along with whether you used full or selective recomputation. MFU gives a lower bound on hardware utilization; HFU gives the actual hardware utilization. Neither is "better" — they answer different questions.
- MFU: "How efficient is my training relative to the minimum possible compute?"
- HFU: "How hard is my hardware actually working?"
Published MFU values
Here are real MFU numbers from published systems, all measured on A100 (BF16, 312 TFLOPS peak, no sparsity). They span a 3.4× range — from the original GPT-3 training run to a modern FlashAttention-2 optimized run on the same hardware class.
Rules of thumb
| MFU Range | Signal | Common Causes |
|---|---|---|
| <20% | Something is broken | Micro-batch too small, heavy Python overhead, misconfigured parallelism |
| 20–35% | Room for improvement | Missing FlashAttention, poor batch size, unoptimized communication |
| 35–55% | Good / typical production | Standard optimized config with tensor+data parallelism |
| 55–75% | Excellent | FlashAttention-2, fused kernels, large batches, single-node or tight NVLink |
| >75% | Exceptional / narrow conditions | Small models on single GPU, highly tuned CUDA kernels, minimal overhead |