LLM Training Performance Infrastructure Interactive

MFU: The Number That Tells You How Hard Your GPUs Are Actually Working

Model FLOPs Utilization (MFU) answers a deceptively simple question: of the 312 trillion floating-point operations your A100 can theoretically do per second, how many are you actually using to train your model? The answer — typically 30–70% for a well-run training job — packs an enormous amount of information about your hardware setup, your parallelism strategy, and your kernel efficiency.

June 1, 2026 ~12 min read Concept: MFU / HFU
01 — What Is MFU?

Your GPU's report card

You just paid $2/hr for an A100. Its spec sheet says 312 TFLOPS (BF16, no sparsity). But your training loop is actually computing at 94 TFLOPS. That 30% gap — the wasted compute — is what MFU exposes. Model FLOPs Utilization is simply:

Definition
MFU = (tokens/sec × FLOPs per token) / peak hardware FLOPs

The numerator is how many floating-point operations your model actually needs per second. The denominator is how many the GPU could do. The ratio tells you what fraction of theoretical peak you are exploiting.

Critically, MFU uses the model's theoretical minimum FLOPs — the math the model absolutely must do — not what the hardware actually executed. So MFU can only approach, never exceed, 100%. A value near 100% would mean every transistor is doing useful arithmetic all the time, with zero overhead for memory transfers, communication, or kernel launch latency.

Origin. MFU was introduced by the PaLM team (Chowdhery et al., 2022) and popularized in the nanoGPT codebase by Karpathy. It has become the standard efficiency metric for LLM training runs.

The efficiency spectrum

Click any labeled system below to see where it falls. The contrast between the GPT-3 original training run (~21%) and a modern FlashAttention-2 run (~72%) on the same A100 hardware captures three years of systems engineering progress.

mfu_gauge.canvas — MFU efficiency spectrum click a system to select

A MFU above 50% is generally considered excellent. Below 30% usually signals something is wrong — memory bottlenecks, excessive communication, or fragmented kernel launches. Between 30–50% is typical for a production training run with model/pipeline/tensor parallelism.

02 — FLOPs in a Transformer

Where does all the math come from?

Before you can calculate MFU, you need to know how many FLOPs a single training step actually requires. For a decoder-only transformer trained with the Adam optimizer, the canonical formula (from PaLM Appendix B, implemented in nanoGPT) is:

FLOPs per token (forward + backward, no checkpointing)
C = 6N + 12 · L · d · T
N = parameters, L = layers, d = d_model, T = sequence length

The 6N term dominates for large models with short sequences. It breaks down as: 2N for the forward pass (each parameter participates in one multiply-add per token) plus 4N for the backward pass (roughly 2× the forward cost, for both weight gradients and activation gradients).

The 12·L·d·T term captures attention score computation — the Q·Kᵀ and softmax·V operations that scale with sequence length. For a 7B model at T=2048, this is only ~7% of total FLOPs. But push the sequence to T=8192 and it climbs to ~22%.

flops_breakdown.canvas — proportion of FLOPs per component click to change model / sequence length

The 6N approximation

For most practical settings (large models, sequence lengths up to a few thousand tokens), ignoring the attention term introduces <10% error and is the standard shortcut. The rule becomes:

  • Training: ~6N FLOPs per token (2 forward + 4 backward)
  • Inference: ~2N FLOPs per token (forward pass only)
Quick sanity check. A 7B model trained for 1 trillion tokens on a single A100 (312 TFLOPS) with 50% MFU would take: 6×7×10⁹ × 10¹² / (312×10¹² × 0.5) ≈ 269,000 seconds ≈ 3.1 days of A100 time. Real Llama 2 7B used ~184,320 A100-hours for 2T tokens — consistent with a fleet of GPUs at ~35–40% MFU.
03 — The MFU Formula

Calculating MFU in practice

In a real training loop, you measure throughput (tokens processed per second) and convert it to an MFU. The steps are:

  • Measure: tokens processed per second across your full GPU cluster
  • Compute: FLOPs per token = 6N + 12·L·d·T (or just 6N for large models)
  • Divide: (tokens/sec × FLOPs/token) / (num_GPUs × peak_FLOPs_per_GPU)

Use the per-GPU version to get a number independent of cluster size. That way a 512-GPU job and a 4096-GPU job that are equally well-optimized report the same MFU.

Per-GPU MFU
MFU = (tokens_per_sec_total × FLOPs_per_token) / (num_GPUs × peak_FLOPs_GPU)

≡ (tokens_per_sec_per_GPU × FLOPs_per_token) / peak_FLOPs_GPU

Interactive calculator

Select your hardware and model, then drag the throughput slider to see what MFU you'd achieve. The color bar gives an instant quality signal.

mfu_calculator — live MFU estimate adjust sliders to compute
4,000 tok/s
FLOPs per token
Required TFLOPS (per GPU)
GPU peak TFLOPS 312 TFLOPS
MFU = —

A worked example

Training a 7B model on a single A100 at 4,000 tokens/sec per GPU, sequence length 2048:

  • FLOPs/token ≈ 6 × 7×10⁹ + 12 × 32 × 4096 × 2048 ≈ 45.2 billion
  • Required TFLOPS = 4,000 × 45.2×10⁹ ≈ 180.8 TFLOPS
  • MFU = 180.8 / 312 ≈ 57.9% — excellent!
04 — What Kills MFU?

Why you never reach 100%

If MFU measures how much of the GPU's peak you use, what accounts for the missing 30–70%? There are five main categories of loss, roughly in order of how often they dominate in practice. Step through them below.

1
Memory bandwidth bound (small batches)
Modern GPUs have far more compute than memory bandwidth. At small batch sizes, loading weights and activations from HBM takes longer than the actual math — the compute units sit idle waiting for data. This is the "roofline": below the compute-to-memory ratio threshold, you're bandwidth-bound, not compute-bound. Fix: increase batch size until you're compute-bound.
A100: 312 TFLOPS compute, 2 TB/s memory BW → 156 FLOPs/byte "ridge point"
2
Communication overhead (multi-GPU)
Data parallelism requires AllReduce after each backward pass. Tensor/pipeline parallelism requires point-to-point sends at every layer boundary. These communication ops do zero useful model arithmetic. On typical DGX A100 boxes, well-overlapped communication costs 5–15% of training time; poor overlap doubles that.
Tensor parallel: send (batch × d) activations every layer → bandwidth-limited
3
Kernel efficiency (unfused ops, small tiles)
Every CUDA kernel launch has overhead. Unfused operations (separate LayerNorm, dropout, bias-add) each write intermediate results back to HBM and reload them. FlashAttention fuses the full attention computation in on-chip SRAM, eliminating 5–10× the memory traffic. Custom CUDA kernels for common patterns (fused softmax, RMSNorm) recover several MFU percentage points.
Fused ops: stay in L2/shared memory vs unfused: round-trip to HBM per op
4
Pipeline bubbles (pipeline parallelism)
Pipeline parallelism divides the model across GPUs by layer. During the startup (forward fill) and teardown (backward flush) of each microbatch pipeline, some GPUs idle. The "bubble fraction" is (p−1)/(m+p−1) where p is pipeline stages and m is microbatches per batch. With p=8 stages and m=32 microbatches: (8−1)/(32+8−1) = 17.9% bubble.
bubble = (p-1) / (m + p - 1) → minimize by maximizing microbatches m
5
Framework / Python overhead
PyTorch's CPU-side dispatch, logging, checkpointing, and data loading all take time during which the GPU is idle. Async data prefetch, compiled CUDA graphs (torch.compile), and minimizing Python loops in the training step can recover 2–5% MFU in CPU-bound situations.
torch.compile() + CUDA graphs: eliminate per-op CPU overhead
Step 1 of 5
The big wins, in order. Most engineers find the largest MFU gains from: (1) increasing batch size to exit the memory-bandwidth-bound regime, (2) switching to FlashAttention, and (3) using bf16/fp16 throughout. Addressing pipeline bubbles and communication overlap are the next tier.
05 — MFU vs HFU

Accounting for gradient checkpointing

MFU measures efficiency against the theoretical minimum FLOPs — the math your model must do with no checkpointing. But what if you're using gradient checkpointing (activation recomputation)? In that case, the GPU is doing more arithmetic than the theoretical minimum, even though it's being used efficiently.

HFU (Hardware FLOPs Utilization) accounts for recomputation. It uses the number of FLOPs actually executed as the numerator, including recomputed forward passes.

Relationship
HFU = MFU × recomputation_factor

Full checkpointing: factor ≈ 4/3 (forward done twice, backward once)
Selective checkpointing (attention only): factor ≈ 1.05–1.15

With full gradient checkpointing, instead of 2N (forward) + 4N (backward) = 6N FLOPs per token, you execute 2N + 2N (recomputed forward) + 4N = 8N FLOPs. So HFU = MFU × 8/6 = MFU × 1.33.

mfu_vs_hfu.canvas — what the hardware actually executes MFU measures model math only; HFU counts recomputation too
No gradient checkpointing
Forward pass — 2N
Backward pass — 4N
Total model FLOPs per token — 6N
MFU = HFU  ·  No extra FLOPs, no gap between the two metrics.
Full gradient checkpointing
Forward pass (1st) — 2N
Recomputed forward — 2N
Backward pass — 4N
Total executed FLOPs — 8N
HFU > MFU  ·  Amber = recomputation. HFU/MFU = 8N/6N ≈ 1.33

Which metric to report?

Report both if your run uses gradient checkpointing, along with whether you used full or selective recomputation. MFU gives a lower bound on hardware utilization; HFU gives the actual hardware utilization. Neither is "better" — they answer different questions.

  • MFU: "How efficient is my training relative to the minimum possible compute?"
  • HFU: "How hard is my hardware actually working?"
Selective checkpointing (e.g., only recomputing attention, not the FFN) is a common middle ground — it saves ~40–50% of activation memory vs no checkpointing, while adding only 5–15% extra FLOPs. The HFU/MFU ratio stays close to 1.0–1.15.
06 — In the Wild

Published MFU values

Here are real MFU numbers from published systems, all measured on A100 (BF16, 312 TFLOPS peak, no sparsity). They span a 3.4× range — from the original GPT-3 training run to a modern FlashAttention-2 optimized run on the same hardware class.

A100 BF16 · MFU (%)
GPT-3 (175B)
OG training, 2020
21%
Typical Megatron run
stock config, no FlashAttn
35%
PaLM 540B (TPUv4)
Chowdhery et al., 2022
46.2%
Megatron-LM 1T
3072 A100s, Narayanan et al.
52%
nanoGPT (GPT-2 scale)
Karpathy, single A100
~57%
FlashAttention-2
end-to-end training, Dao 2023
72%
The 50% milestone. Crossing 50% MFU is often cited as the threshold for a "well-optimized" training run. The key techniques that get you there: FlashAttention, full BF16 training, large micro-batch sizes to stay compute-bound, and overlapping all-reduce with the backward pass.

Rules of thumb

MFU RangeSignalCommon Causes
<20% Something is broken Micro-batch too small, heavy Python overhead, misconfigured parallelism
20–35% Room for improvement Missing FlashAttention, poor batch size, unoptimized communication
35–55% Good / typical production Standard optimized config with tensor+data parallelism
55–75% Excellent FlashAttention-2, fused kernels, large batches, single-node or tight NVLink
>75% Exceptional / narrow conditions Small models on single GPU, highly tuned CUDA kernels, minimal overhead