r/LocalLLaMA · May 23, 2026 · 3 min read

Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else

I'm running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it's just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working with Ubuntu 24.04. That container has it up in minutes, real timesaver.

Anyway, my personal use case for LLM's is primarily for Frigate to review camera footage and cut down on "notification noise" (it's like having a human review footage to determine what I need to know about and what I don't). The other use is for HomeAssistant. I ditched all my Alexa devices and replaced it with this (it's amazing).

Anyway, I wanted to be sure I was getting the absolute most of out my hardware for speed and efficiency. I had Claude write me a script that would do batch testing of of the two models I got great accuracy out for those two use cases.

Gemma 4 26B.A4B Q4_1
Qwen3 35B.A3B Q4_0

The MI60 (and MI50) get a speed boost on the _0 and _1 quants inherently, which is why I use them. The only reason for not using 4_1 for both is the size. I use 3 slots, each with their own cache so the size difference between the qwen 4_0 and 4_1 was eating too much space for my desired context size.

The final result of the testing had a HUGE impact on the speed of both HA (less than 1.2 seconds to complete my voice commands) and Frigate (less than 18 seconds for review summaries of footage). I figured I'd share this here in case it helps anyone else. The following is generated by Claude (summary of what the script did, and it generated the table of results from the outcome of running the script):

The benchmark sweep script executed 30 total runs across 8 sections, testing two models — Gemma 4 26B Q4_1 and Qwen3 35B Q4_0 — against three KV cache pre-fill depths (0, 1,000, and 6,000 tokens) with a fixed 512-token prompt and 128 generation tokens per run, each repeated 5 times internally by llama-bench for statistical stability. The knobs turned were: flash attention on vs. off; KV cache quantisation at three levels (f16 default, q8_0, and q4_0); ubatch size at four values (512, 2048, 4096, and 8192); logical batch size at two values (2048 and 8192); CPU thread count at three values (8, 12, and 24); and two ROCm-specific environment variables — GGML_ROCM_FORCE_MMQ (1 vs. 0, switching between quantised matmul kernels and rocBLAS GEMM) and HSA_ENABLE_SDMA (enabled vs. disabled, switching between DMA and blit-copy memory transfers). Sections 1 through 7 each varied exactly one parameter while holding all others at the production baseline, enabling clean attribution of any performance change to a single cause. Section 8 then stacked three combinations of the most promising individual results — SDMA disabled with q8_0 KV, SDMA disabled with q4_0 KV, and SDMA disabled plus MMQ off plus q8_0 KV — to determine whether gains compounded or cancelled when applied together. The production llama-server container was stopped before each run to ensure exclusive GPU access, and each model configuration was launched as a fresh throwaway container from the same image used in production, with identical device mappings, volume mounts, and environment variables.

https://preview.redd.it/mb0jdzqg1x2h1.png?width=1278&format=png&auto=webp&s=6f2f23c55b45bbb4b9bfebd1af4874f0a21069de

submitted by /u/FantasyMaster85
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA