Pretraining is a Club Most Can't Afford to Join
The dominant pretraining recipe — autoregressive training on internet-scale raw text — works, but concentrates foundational AI research inside compute-rich organizations. OLMo3 7B consumes 252× the training FLOPs of HRM-Text. Qwen3.5 2B uses 36 trillion training tokens — 900× more than 40 billion. For academic groups and small labs, this is an insurmountable wall.
HRM-Text asks: what if you co-designed the architecture and the objective together, specifically for the data- and compute-limited regime? Can smarter architecture and smarter training signal make each token worth far more?
HRM-Text sits in the same performance band as models trained on 100–900× more tokens — but at the leftmost edge of the x-axis. Three ingredients make this possible: hierarchical recurrent architecture, MagicNorm stabilization, and response-only PrefixLM training.
Slow Strategy, Fast Execution
Standard Transformers process a sequence in one forward pass. The Hierarchical Recurrent Model (HRM) runs the same parameters multiple times in two nested loops, inspired by the frontoparietal loop in the brain: a fast layer for execution and a slow layer for strategic context.
HRM uses 2 outer H-cycles, each with 3 fast L-module steps followed by 1 slow H-module step: 8 total forward steps through the same recurrent core — equivalent to 4× the compute of a single-pass Transformer, without 4× the parameters. This is depth through time, not through width.
Making Deep Recurrence Trainable
PostNorm keeps activations bounded but blocks gradients in deep networks. PreNorm lets gradients flow but allows residual variance to grow unboundedly. In a recurrent model applied N times, both problems compound multiplicatively. Neither option survives deep recurrence at language-model scale.
MagicNorm exploits the asymmetry between forward and backward horizons under truncated BPTT. Each module has internal PreNorm blocks capped with a final normalization at its exit. Forward: exit norm applied N=8 times → bounded variance. Backward with K≤5 steps: PreNorm identity path dominates → stable optimization.
Every Gradient Should Count
Standard pretraining computes loss over every token — including prompt tokens the model will never need to generate. HRM-Text applies a task-completion objective: compute loss exclusively on response tokens, conditioned on the instruction. The PrefixLM mask lets instruction tokens attend bidirectionally, giving the model an encoder-like view of the full prompt.
Example: What is the capital of France? (5 tokens) + Paris (1 response token). HRM-Text only computes loss on “Paris”.
The empirical payoff: attention entropy increases, response-token NLL decreases — more informative gradients per training token.
Three Ingredients, Each One Matters
An incremental FLOPs-matched ablation across all three axes shows each ingredient contributes meaningfully.
Task-completion alone: +21 GSM8K. PrefixLM adds +21 DROP. HRM adds +7–9 across reasoning benchmarks. The combination is multiplicative.
Competitive Performance at 100–900× Lower Cost
HRM-Text achieves near-competitive performance on reasoning-heavy benchmarks (MATH, GSM8K, DROP, ARC-C) while trailing on broad factual-knowledge benchmarks (MMLU) — expected, since factual coverage scales more with data breadth than reasoning depth.
MMLU (general knowledge)
GSM8K (math reasoning)
Full benchmark comparison
| Model | FLOPs | Tokens | MMLU | ARC-C | DROP | GSM8K | MATH |
|---|---|---|---|---|---|---|---|
| HRM-Text 1B | 1× | 40B | 60.7 | 81.9 | 82.2 | 84.5 | 56.2 |
| Huginn 3.5B | 127× | 0.8T | 31.4 | 38.2 | 17.8 | 34.6 | 12.6 |
| OLMo3 7B | 252× | 6T | 65.8 | 81.6 | 71.5 | 75.5 | 40.0 |
| Llama3.2 3B | 162× | 9T | 58.0 | 69.1 | 45.2 | 77.7 | 48.0 |
| Gemma3 4B | 96× | 4T | 59.6 | 56.2 | 60.1 | 38.4 | 24.2 |
| Qwen3.5 2B | 432× | 36T | 64.5 | 81.0 | 30.8 | 53.0 | 34.2 |
HRM-Text outperforms Huginn 3.5B on every benchmark — while using 4× fewer FLOPs and 20× fewer tokens. This is an existence proof: there is at least one point in architecture+objective space where the compute-to-performance ratio is radically better. Exploring that space is now accessible to anyone with two 8-GPU nodes and $1,500.