Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774)
My run command:
```
llama-server \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--presence_penalty 0.0 \
--min-p 0.00 \
--gpu-layers all \
-m /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \
-a llama.cpp \
--host 0.0.0.0 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--chat-template-kwargs '{"preserve_thinking":true}' \
--flash-attn on
```
The built in web UI shows that context size is 137k.
By adding `spec-type draft-mtp --spec-draft-n-max 2`, the reported context size drops to 14k. Is this normal?
[link] [comments]
More from r/LocalLLaMA
-
Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m?
May 27
-
Stop traumatizing AI into loops and turn hallucinations into an honest "I don't know!" by being NICE to them (Proof of Concept, Research, I don't want to sell anything)
May 27
-
Cactus Hybrid Router: Gemma4-2B can match Gemini-3.1-Flash-Lite by routing 15-55% of tasks to Gemini And Running The Rest Locally.
May 26
-
Small comparison on full compute performance (Anima) of 5090 (600,475 and 400W) vs 6000 PRO MaxQ (325W), and 6000 PRO WS/SE (600W).
May 26
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.