Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I'm running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it's just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working with Ubuntu 24.04. That container has it up in minutes, real timesaver. Anyway, my personal use case for LLM's is primarily for Frigate to review camera footage and cut down on "notification noise" (it's like having a human review footage to determine what I need to know about and what I don't). The other use is for HomeAssistant. I ditched all my Alexa devices and replaced it with this (it's amazing). Anyway, I wanted to be sure I was getting the absolute most of out my hardware for speed and efficiency. I had Claude write me a script that would do batch testing of of the two models I got great accuracy out for those two use cases.
The MI60 (and MI50) get a speed boost on the _0 and _1 quants inherently, which is why I use them. The only reason for not using 4_1 for both is the size. I use 3 slots, each with their own cache so the size difference between the qwen 4_0 and 4_1 was eating too much space for my desired context size. The final result of the testing had a HUGE impact on the speed of both HA (less than 1.2 seconds to complete my voice commands) and Frigate (less than 18 seconds for review summaries of footage). I figured I'd share this here in case it helps anyone else. The following is generated by Claude (summary of what the script did, and it generated the table of results from the outcome of running the script): The benchmark sweep script executed 30 total runs across 8 sections, testing two models — Gemma 4 26B Q4_1 and Qwen3 35B Q4_0 — against three KV cache pre-fill depths (0, 1,000, and 6,000 tokens) with a fixed 512-token prompt and 128 generation tokens per run, each repeated 5 times internally by llama-bench for statistical stability. The knobs turned were: flash attention on vs. off; KV cache quantisation at three levels (f16 default, q8_0, and q4_0); ubatch size at four values (512, 2048, 4096, and 8192); logical batch size at two values (2048 and 8192); CPU thread count at three values (8, 12, and 24); and two ROCm-specific environment variables — [link] [comments] |
More from r/LocalLLaMA
-
Any reason to run dense over MOE for RAGs?
May 23
-
Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.
May 23
-
GPT 5.5 "secret sauce" is just having the thinking be some stupid caveman mode?
May 23
-
What is the current best Small Language Model that can be run without GPU?
May 23
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.