r/LocalLLaMA · · 1 min read

Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

For those of you running Qwen3.6:27B on 16GB VRAM, what quantization did you settle on?

For my primary purpose as a HA voice assistant, I've found my ideal target to be >50 tg and >800 pp. Qwen3.5:9B works really fast, but I'm experimenting with higher intelligence. Offloaded the vision model to CPU because it is infrequently used.

Currently running Qwen3.6-27B-Q3_K_S.gguf with 64 layers on GPU at the following speeds:

prompt eval time = 462.66 ms / 507 tokens ( 0.91 ms per token, 1095.83 tokens per second) eval time = 18710.17 ms / 884 tokens ( 21.17 ms per token, 47.25 tokens per second) total time = 19172.84 ms / 1391 tokens draft acceptance rate = 0.59677 ( 481 accepted / 806 generated) prompt eval time = 6001.34 ms / 8561 tokens ( 0.70 ms per token, 1426.51 tokens per second) eval time = 2404.46 ms / 147 tokens ( 16.36 ms per token, 61.14 tokens per second) total time = 8405.80 ms / 8708 tokens draft acceptance rate = 0.80357 ( 90 accepted / 112 generated) 

Config:

 -m /models/Qwen3.6-27B/Qwen3.6-27B-Q3_K_S.gguf --mmproj /models/Qwen3.6-27B/mmproj-BF16.gguf --no-mmproj-offload --host 0.0.0.0 --port 8080 --jinja -fa on --temp 0.6 --top-p 0.95 --top-k 20 --min_p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --cache-ram 0 --fit on -np 2 --fit-ctx 32000 --cache-type-k q8_0 --cache-type-v q8_0 --cache-type-k-draft q8_0 --cache-type-v-draft q8_0 --log-verbosity 4 --chat-template-kwargs '{"preserve_thinking": true}' --spec-type draft-mtp --spec-draft-n-max 2 
submitted by /u/InternationalNebula7
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA