mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput. The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200: The full report includes all steps to reproduce these results. The results hold up across quantization type (eQ8_0, Q4K), model (dense and MoE), and GPU. Please see the full report for more details: https://github.com/EricLBuehler/mistral.rs/blob/master/releases/v0.8.2/report.md If you want to try this out, you can install mistral.rs easily: Then, you can start a OpenAI-compatible server on port 1234 and a web chat UI with built-in agentic features:
Reproductions, criticism, and benchmark suggestions are welcome! Check out the GitHub for more details, documentation, and examples: https://github.com/EricLBuehler/mistral.rs [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.