r/LocalLLaMA · · 1 min read

Sapient Intelligence releases HRM-Text 1B: 40B tokens, ~$1k pretrain, beats Llama3.2 3B on MATH and DROP

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Sapient Intelligence releases HRM-Text 1B: 40B tokens, ~$1k pretrain, beats Llama3.2 3B on MATH and DROP

Sapient Intelligence (the HRM/hierarchical reasoning folks) dropped HRM-Text 1B today. Posting because the benchmark chart is interesting enough to be worth a look even if you're skeptical of the marketing.

The training numbers:

  • 1B params, trained from scratch on 16 GPUs in 1.9 days
  • 40B unique tokens (they claim ~1/1000 the data of comparable models — chart shows 100×–900× less than Gemma3 4B / Llama3.2 3B / Qwen3.5 2B / Olmo3 7B)
  • ~$1,000 reported budget

https://preview.redd.it/18dykreus22h1.png?width=1978&format=png&auto=webp&s=05c33d8682ccfec8d8ebb6e6ed96c7fba57bb2b1

Where it actually wins (per their chart):

  • MATH: 56.2 vs Llama3.2 3B 48.0, Olmo3 7B 40.0, GPT-3.5 34.1
  • DROP: 82.2 vs Olmo3 7B 71.5, Llama3.2 3B 45.2, GPT-3.5 64.1

Where it's roughly tied or behind:

  • ARC-C: 81.9 — basically a tie with Olmo3 7B (81.6) and Qwen3.5 2B (81.2)
  • MMLU: 60.7behind Qwen3.5 2B (64.7) and Olmo3 7B (65.8)

So the pattern is what you'd expect from something called a "Hierarchical Reasoning Model" — punches well above weight on multi-step reasoning (MATH, DROP), only middling on knowledge recall (MMLU). The MMLU gap is the validating part of the story: 40B tokens is just not enough to pack in world knowledge.

Links:

Caveats worth flagging before anyone gets too hyped:

  1. These are their own self-reported numbers on their own chart. Independent eval pending.
  2. MATH/DROP are exactly the kinds of benchmarks most vulnerable to test-set contamination in "structured token" pretraining curricula. Curious what people find with held-out reasoning evals.
  3. The original HRM paper got mixed reception on whether the hierarchical mechanism generalizes — would love to hear from anyone who actually runs it whether it feels qualitatively different from a normal 1B.

Anyone tried it yet?

submitted by /u/Turbulent-Sky5396
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA