Best Settings for 48GB VRAM + Qwen 3.6 27B
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hey everyone, I've been running Qwen3.6 27B (Q8_0) across an RTX 4090 + RTX 3090 setup using llama.cpp with tensor split, and I wanted to share what's been working best for me so far. See if anyone has any better settings
Hardware: RTX 4090 (24GB) + RTX 3090 (24GB), 48GB VRAM total
OS Arch Linux (using igpu for display)
Settings:
- Quant: Q8_0
- Split mode:
tensor - Layers on GPU:
-ngl 999 - Context: 250k (
-c 250000) - Speculative decoding:
--spec-type draft-mtp --spec-draft-n-max 4 - parallel requests:
-np 3 - Unified KV cache:
-kvu - Chat template:
--chat-template-kwargs '{"preserve_thinking": true}' - Flags:
--no-mmap -fa on --jinja -fit off --no-op-offload - Vision: mmproj-F16 with
--no-mmproj-offload
This gives me 75-100t/s tg and 1500 pp 250k un quantized context + vision + MTP
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.