r/LocalLLaMA · · 2 min read

Optimizing speed & quality on Qwen3.6 27b

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Does the inference speed below seem optimal for the hardware, or could there be further room for improvement ?

I’ve been trying to use Qwen3.6 27b for agentic harnesses like Pi/Hermes. Because of the long horizon required of agentic tasks, I been trying to maximize speed while retaining as close to full precision as possible.

The inference speed can vary widely between ~300-500 tok/s for prompt processing, ~22-30 tok/sec of token generation at a context window of 100k. This is with 40GB of VRAM (1x2060super8gb, 2x5060ti16gb). I have a good amount of DDR4 3200 RAM running at 4-channel, but I didn’t want to compromise on speed at all. I tried to get to 128k context window as much as I can without spilling into RAM, but I had to compromise and land at 100k because there just didn’t seem any way.

Here’s my llama.cpp command, running on Ubuntu:

CUDA_DEVICE_ORDER=PCI_BUS_ID \

path/llama-server \

-m path/unsloth/Qwen3.6-27B-MTP-Q8_0.gguf \

-mm path/mmproj-BF16.gguf --image-min-tokens 1024 --no-mmproj-offload \

--port 8080 --host 0.0.0.0 --alias model\

--temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking": true}' \

--spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-p-min 0.75 --spec-draft-type-k q4_0 --spec-draft-type-v q4_0 \

-t 12 -fa on -np 1 --kv-unified --cache-idle-slots --jinja \

-lv 4 -fitt 0,0,2250 -c 100000 \

My question to the community is whether this seems optimal or not, or if there are any other flags or variables that I’m not using that mould help further squeeze out more performance on my hardware?

(Lastly I hope that my llama.cpp setup, hardware info, and performance can serve as a useful reference for others. I started my obsessive local model journey in 11/2025 and it’s been a good opportunity to learn about how to run these models and what goes into it, before inevitably getting crushed by the big companies in the future. Looking forward to learning about how to train micro models and fine tuning next.)

submitted by /u/Ambitious_Fold_2874
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA