HRM-Text: Efficient Pretraining Beyond Scaling

01 — The Compute Divide

Pretraining is a Club Most Can't Afford to Join

The dominant pretraining recipe — autoregressive training on internet-scale raw text — works, but concentrates foundational AI research inside compute-rich organizations. OLMo3 7B consumes 252× the training FLOPs of HRM-Text. Qwen3.5 2B uses 36 trillion training tokens — 900× more than 40 billion. For academic groups and small labs, this is an insurmountable wall.

HRM-Text asks: what if you co-designed the architecture and the objective together, specifically for the data- and compute-limited regime? Can smarter architecture and smarter training signal make each token worth far more?

efficiency_scatter.py — tokens vs avg benchmark scorehover dots

HRM-Text sits in the same performance band as models trained on 100–900× more tokens — but at the leftmost edge of the x-axis. Three ingredients make this possible: hierarchical recurrent architecture, MagicNorm stabilization, and response-only PrefixLM training.

02 — Dual-Timescale HRM

Slow Strategy, Fast Execution

Standard Transformers process a sequence in one forward pass. The Hierarchical Recurrent Model (HRM) runs the same parameters multiple times in two nested loops, inspired by the frontoparietal loop in the brain: a fast layer for execution and a slow layer for strategic context.

hrm_architecture — dual-timescale recurrent designH = slow strategic; L = fast execution

HRM uses 2 outer H-cycles, each with 3 fast L-module steps followed by 1 slow H-module step: 8 total forward steps through the same recurrent core — equivalent to 4× the compute of a single-pass Transformer, without 4× the parameters. This is depth through time, not through width.

03 — MagicNorm

Making Deep Recurrence Trainable

PostNorm keeps activations bounded but blocks gradients in deep networks. PreNorm lets gradients flow but allows residual variance to grow unboundedly. In a recurrent model applied N times, both problems compound multiplicatively. Neither option survives deep recurrence at language-model scale.

magicnorm.py — forward variance vs backward gradient healthclick to compare

MagicNorm exploits the asymmetry between forward and backward horizons under truncated BPTT. Each module has internal PreNorm blocks capped with a final normalization at its exit. Forward: exit norm applied N=8 times → bounded variance. Backward with K≤5 steps: PreNorm identity path dominates → stable optimization.

04 — Task-Completion + PrefixLM

Every Gradient Should Count

Standard pretraining computes loss over every token — including prompt tokens the model will never need to generate. HRM-Text applies a task-completion objective: compute loss exclusively on response tokens, conditioned on the instruction. The PrefixLM mask lets instruction tokens attend bidirectionally, giving the model an encoder-like view of the full prompt.

prefixlm_attn.py — causal vs PrefixLM attention patternsprompt in amber, response in white

Causal: triangular, instruction tokens autoregressively predict each other

PrefixLM: instruction tokens attend bidirectionally; loss only on response

Example: What is the capital of France? (5 tokens) + Paris (1 response token). HRM-Text only computes loss on “Paris”.

The empirical payoff: attention entropy increases, response-token NLL decreases — more informative gradients per training token.

05 — Ablation

Three Ingredients, Each One Matters

An incremental FLOPs-matched ablation across all three axes shows each ingredient contributes meaningfully.

ablation.py — incremental contribution of each ingredientstep through each change

Baseline: Transformer + P(x) + Causal

Standard autoregressive pretraining. Loss on all tokens. Causal masking throughout.

MMLU: 40.6 · ARC-C: 51.9 · DROP: 38.2 · GSM8K: 48.4 · MATH: 35.4

+ task-completion P(xₐ|xₙ) + Causal

Response-only loss. Every gradient update targets inference-time behavior. Same Transformer, same causal mask.

MMLU: 47.7 (+7.2) · ARC-C: 62.9 (+11.0) · DROP: 54.2 (+16.0) · GSM8K: 69.8 (+21.4) · MATH: 47.0 (+11.6)

+ PrefixLM attention mask

Instruction tokens attend bidirectionally. Higher attention entropy — the model uses the full prompt. Still Transformer.

MMLU: 53.2 (+5.5) · ARC-C: 74.3 (+11.4) · DROP: 75.3 (+21.1) · GSM8K: 75.1 (+5.3) · MATH: 48.4 (+1.4)

+ HRM architecture (replace Transformer)

Hierarchical dual-timescale recurrence. Same objective, same PrefixLM mask, same FLOPs.

MMLU: 60.7 (+7.5) · ARC-C: 81.9 (+7.6) · DROP: 82.2 (+6.9) · GSM8K: 84.5 (+9.4) · MATH: 56.2 (+7.8)

Step 1 of 4

Task-completion alone: +21 GSM8K. PrefixLM adds +21 DROP. HRM adds +7–9 across reasoning benchmarks. The combination is multiplicative.

06 — Results

Competitive Performance at 100–900× Lower Cost

HRM-Text achieves near-competitive performance on reasoning-heavy benchmarks (MATH, GSM8K, DROP, ARC-C) while trailing on broad factual-knowledge benchmarks (MMLU) — expected, since factual coverage scales more with data breadth than reasoning depth.

results.py — MMLU and GSM8K vs baselinesbars animate on scroll

MMLU (general knowledge)

HRM-Text 1B
40B tokens · 1×

60.7

Huginn 3.5B
0.8T · 127×

31.4

Llama3.2 3B
9T · 162×

58.0

Gemma3 4B
4T · 96×

59.6

Qwen3.5 2B
36T · 432×

64.5

OLMo3 7B
6T · 252×

65.8

GSM8K (math reasoning)

HRM-Text 1B
40B tokens · 1×

84.5

Huginn 3.5B
0.8T · 127×

34.6

Llama3.2 3B
9T · 162×

77.7

Gemma3 4B
4T · 96×

38.4

Qwen3.5 2B
36T · 432×

53.0

OLMo3 7B
6T · 252×

75.5

Full benchmark comparison

Model	FLOPs	Tokens	MMLU	ARC-C	DROP	GSM8K	MATH
HRM-Text 1B	1×	40B	60.7	81.9	82.2	84.5	56.2
Huginn 3.5B	127×	0.8T	31.4	38.2	17.8	34.6	12.6
OLMo3 7B	252×	6T	65.8	81.6	71.5	75.5	40.0
Llama3.2 3B	162×	9T	58.0	69.1	45.2	77.7	48.0
Gemma3 4B	96×	4T	59.6	56.2	60.1	38.4	24.2
Qwen3.5 2B	432×	36T	64.5	81.0	30.8	53.0	34.2

HRM-Text outperforms Huginn 3.5B on every benchmark — while using 4× fewer FLOPs and 20× fewer tokens. This is an existence proof: there is at least one point in architecture+objective space where the compute-to-performance ratio is radically better. Exploring that space is now accessible to anyone with two 8-GPU nodes and $1,500.