r/LocalLLaMA · · 1 min read

Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774)

My run command:

```

llama-server \

--temp 0.6 \

--top-p 0.95 \

--top-k 20 \

--presence_penalty 0.0 \

--min-p 0.00 \

--gpu-layers all \

-m /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \

-a llama.cpp \

--host 0.0.0.0 \

--cache-type-k q8_0 --cache-type-v q8_0 \

--chat-template-kwargs '{"preserve_thinking":true}' \

--flash-attn on

```

The built in web UI shows that context size is 137k.

By adding `spec-type draft-mtp --spec-draft-n-max 2`, the reported context size drops to 14k. Is this normal?

submitted by /u/regunakyle
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA