r/LocalLLaMA · · 1 min read

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput.

The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:

https://preview.redd.it/jmdsjkrbfo4h1.png?width=3312&format=png&auto=webp&s=8a69286b73a8fad4edc671cb9ca8ad3f3cd74d1c

The full report includes all steps to reproduce these results. The results hold up across quantization type (eQ8_0, Q4K), model (dense and MoE), and GPU. Please see the full report for more details: https://github.com/EricLBuehler/mistral.rs/blob/master/releases/v0.8.2/report.md

If you want to try this out, you can install mistral.rs easily:

# Mac/Linux: curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh # Windows irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex 

Then, you can start a OpenAI-compatible server on port 1234 and a web chat UI with built-in agentic features:

mistralrs serve --agent -m google/gemma-4-E4B-it --quant 4

Reproductions, criticism, and benchmark suggestions are welcome!

Check out the GitHub for more details, documentation, and examples: https://github.com/EricLBuehler/mistral.rs

https://reddit.com/link/1tttevw/video/z0ayf1f1go4h1/player

submitted by /u/EricBuehler
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA