Nemotron-3-Super-120B-A12B (hybrid Mamba+MoE) holds perfect needle retrieval to 504K tokens on 4×3090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| TLDR: The Mamba/SSM layers keep a constant-size recurrent state instead of a growing KV cache, so context is nearly free. Full needle retrieval at half a million tokens, fully on-GPU, ~71GB. The new imatrix gguf here https://huggingface.co/mradermacher/NVIDIA-Nemotron-3-Super-120B-A12B-BF16-i1-GGUF/resolve/main/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.i1-Q4_K_S.gguf Solo setup, local only. Pulled NVIDIA's Nemotron-3-Super (nemotron_h: hybrid Mamba2 + periodic attention + MoE, A12B active, trained for 1M ctx) as the i1-Q4_K_S from mradermacher (71GB) and ran it across 4×3090. ## Numbers (llama.cpp-latest, i1-Q4_K_S, fully GPU-resident, q8_0 KV) Decode (t/s): 72tg short · 67tg 30K · 51tg 96K · 47tg 126K · 39tg 200K · 34tg 269K · 23tg 504K Prefill (t/s): ~2080pp 30K · 1469pp 200K · 885pp 504K Needle-in-haystack (codes planted at 10/50/90% depth): exact recall at EVERY depth tested, up to 504,482 tokens. No miss. VRAM: ~20GB/card Full-attention models pay for a KV cache that grows with context, so decode craters as you fill. Nemotron's Mamba layers carry a fixed-size state — only the few attention layers have KV (2 KV heads, tiny). Net: decode at 500K (23 t/s) is about the speed a comparable full-attention MoE (MiniMax-M2.7-REAP, also ~74GB, A10B) ran at 30K (24.5 t/s) on the same box/engine. Same-box head-to-head: Nemotron ~2.7× the decode at a 30K spine and held precision to 500K. Buried standing instructions lose to a later conflicting one (recency bias) — a "frozen contract" planted near the top flipped when I contradicted it at the end. Put hard rules near the end / in system, not buried in a long spine. [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.