Qwen 27b MTP Config, Llama.cpp Single 3090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
What setup are you using for qwen 27b on a single 3090?
Here's what I've started using today. It has to compact often but I'm worried about giving up more accuracy and reliability with a lower quant:
llama-server -m /Models/q3.6/Qwen3.6-27B-Q5_K_S.gguf -c 65536 -ngl -1 -t 8 -ctk q8_0 -ctv q8_0 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2 --fit off --mmproj /Models/q3.6/mmproj-Qwen3.6-27B-f16.gguf --no-mmproj-offload
I'm getting around 65tk/s.
I've also seen these recommendations: https://github.com/noonghunna/club-3090/blob/master/docs/SINGLE_CARD.md
They seem to be using the q4 quant. How are you weighing the tradeoffs?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.