Efficient Pretraining Recurrent Architecture From Scratch Interactive

HRM-Text: $1,500 to Match
Trillion-Token Baselines

The dominant belief is that you need trillions of tokens and millions of dollars to pretrain a competitive language model from scratch. HRM-Text breaks this: a 1B-parameter hierarchical recurrent model trained on 40 billion tokens for $1,500 reaches the same benchmark neighborhood as 2–7B models trained on 4–36 trillion tokens — using 96–432× less compute and 100–900× fewer tokens.

June 1, 2026 ~12 min read Paper: arXiv:2605.20613
01 — The Compute Divide

Pretraining is a Club Most Can't Afford to Join

The dominant pretraining recipe — autoregressive training on internet-scale raw text — works, but concentrates foundational AI research inside compute-rich organizations. OLMo3 7B consumes 252× the training FLOPs of HRM-Text. Qwen3.5 2B uses 36 trillion training tokens — 900× more than 40 billion. For academic groups and small labs, this is an insurmountable wall.

HRM-Text asks: what if you co-designed the architecture and the objective together, specifically for the data- and compute-limited regime? Can smarter architecture and smarter training signal make each token worth far more?

efficiency_scatter.py — tokens vs avg benchmark scorehover dots

HRM-Text sits in the same performance band as models trained on 100–900× more tokens — but at the leftmost edge of the x-axis. Three ingredients make this possible: hierarchical recurrent architecture, MagicNorm stabilization, and response-only PrefixLM training.

02 — Dual-Timescale HRM

Slow Strategy, Fast Execution

Standard Transformers process a sequence in one forward pass. The Hierarchical Recurrent Model (HRM) runs the same parameters multiple times in two nested loops, inspired by the frontoparietal loop in the brain: a fast layer for execution and a slow layer for strategic context.

hrm_architecture — dual-timescale recurrent designH = slow strategic; L = fast execution

HRM uses 2 outer H-cycles, each with 3 fast L-module steps followed by 1 slow H-module step: 8 total forward steps through the same recurrent core — equivalent to 4× the compute of a single-pass Transformer, without 4× the parameters. This is depth through time, not through width.

03 — MagicNorm

Making Deep Recurrence Trainable

PostNorm keeps activations bounded but blocks gradients in deep networks. PreNorm lets gradients flow but allows residual variance to grow unboundedly. In a recurrent model applied N times, both problems compound multiplicatively. Neither option survives deep recurrence at language-model scale.

magicnorm.py — forward variance vs backward gradient healthclick to compare

MagicNorm exploits the asymmetry between forward and backward horizons under truncated BPTT. Each module has internal PreNorm blocks capped with a final normalization at its exit. Forward: exit norm applied N=8 times → bounded variance. Backward with K≤5 steps: PreNorm identity path dominates → stable optimization.

04 — Task-Completion + PrefixLM

Every Gradient Should Count

Standard pretraining computes loss over every token — including prompt tokens the model will never need to generate. HRM-Text applies a task-completion objective: compute loss exclusively on response tokens, conditioned on the instruction. The PrefixLM mask lets instruction tokens attend bidirectionally, giving the model an encoder-like view of the full prompt.

prefixlm_attn.py — causal vs PrefixLM attention patternsprompt in amber, response in white
Causal: triangular, instruction tokens autoregressively predict each other
PrefixLM: instruction tokens attend bidirectionally; loss only on response

Example: What is the capital of France? (5 tokens) + Paris (1 response token). HRM-Text only computes loss on “Paris”.

The empirical payoff: attention entropy increases, response-token NLL decreases — more informative gradients per training token.

05 — Ablation

Three Ingredients, Each One Matters

An incremental FLOPs-matched ablation across all three axes shows each ingredient contributes meaningfully.

ablation.py — incremental contribution of each ingredientstep through each change
1
Baseline: Transformer + P(x) + Causal
Standard autoregressive pretraining. Loss on all tokens. Causal masking throughout.
MMLU: 40.6 · ARC-C: 51.9 · DROP: 38.2 · GSM8K: 48.4 · MATH: 35.4
2
+ task-completion P(xₐ|xₙ) + Causal
Response-only loss. Every gradient update targets inference-time behavior. Same Transformer, same causal mask.
MMLU: 47.7 (+7.2) · ARC-C: 62.9 (+11.0) · DROP: 54.2 (+16.0) · GSM8K: 69.8 (+21.4) · MATH: 47.0 (+11.6)
3
+ PrefixLM attention mask
Instruction tokens attend bidirectionally. Higher attention entropy — the model uses the full prompt. Still Transformer.
MMLU: 53.2 (+5.5) · ARC-C: 74.3 (+11.4) · DROP: 75.3 (+21.1) · GSM8K: 75.1 (+5.3) · MATH: 48.4 (+1.4)
4
+ HRM architecture (replace Transformer)
Hierarchical dual-timescale recurrence. Same objective, same PrefixLM mask, same FLOPs.
MMLU: 60.7 (+7.5) · ARC-C: 81.9 (+7.6) · DROP: 82.2 (+6.9) · GSM8K: 84.5 (+9.4) · MATH: 56.2 (+7.8)
Step 1 of 4

Task-completion alone: +21 GSM8K. PrefixLM adds +21 DROP. HRM adds +7–9 across reasoning benchmarks. The combination is multiplicative.

06 — Results

Competitive Performance at 100–900× Lower Cost

HRM-Text achieves near-competitive performance on reasoning-heavy benchmarks (MATH, GSM8K, DROP, ARC-C) while trailing on broad factual-knowledge benchmarks (MMLU) — expected, since factual coverage scales more with data breadth than reasoning depth.

results.py — MMLU and GSM8K vs baselinesbars animate on scroll

MMLU (general knowledge)

HRM-Text 1B
40B tokens · 1×
60.7
Huginn 3.5B
0.8T · 127×
31.4
Llama3.2 3B
9T · 162×
58.0
Gemma3 4B
4T · 96×
59.6
Qwen3.5 2B
36T · 432×
64.5
OLMo3 7B
6T · 252×
65.8

GSM8K (math reasoning)

HRM-Text 1B
40B tokens · 1×
84.5
Huginn 3.5B
0.8T · 127×
34.6
Llama3.2 3B
9T · 162×
77.7
Gemma3 4B
4T · 96×
38.4
Qwen3.5 2B
36T · 432×
53.0
OLMo3 7B
6T · 252×
75.5

Full benchmark comparison

ModelFLOPsTokensMMLUARC-CDROPGSM8KMATH
HRM-Text 1B40B60.781.982.284.556.2
Huginn 3.5B127×0.8T31.438.217.834.612.6
OLMo3 7B252×6T65.881.671.575.540.0
Llama3.2 3B162×9T58.069.145.277.748.0
Gemma3 4B96×4T59.656.260.138.424.2
Qwen3.5 2B432×36T64.581.030.853.034.2

HRM-Text outperforms Huginn 3.5B on every benchmark — while using 4× fewer FLOPs and 20× fewer tokens. This is an existence proof: there is at least one point in architecture+objective space where the compute-to-performance ratio is radically better. Exploring that space is now accessible to anyone with two 8-GPU nodes and $1,500.