Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Just saw Xiaomi MiMo announce MiMo-V2.5-Pro UltraSpeed, claiming they broke the 1,000 tokens/sec output barrier on a 1 trillion parameter MoE model. According to them, they’re doing it on a single standard 8-GPU node, not custom wafer-scale hardware like Cerebras and not SRAM-heavy hardware like Groq.
Crazy if true.
[link] [comments]
More from r/LocalLLaMA
-
Friends from the localllama community, if you love local llm, don't participate in the IPO (spaceX, OpenAI, Anthropic)
Jun 8
-
An Implementation of NanoQuant: A flexible binary quantization method
Jun 8
-
I bundled a fully local LLM inside my Unity game. No internet, no cloud, no API key. The conversation is the gameplay.
Jun 8
-
Nex N2 has a funny "few words do trick" reasoning
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.