r/LocalLLaMA · June 5, 2026 · 1 min read

PSA: You may not need to quantize spec draft when using MTP

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Using `--spec-draft-type-k q4_0 --spec-draft-type-v q4_0` might actually decrease your context size!

With quantized spec draft, my context size is 83200. Without it (i.e. using the default fp16 spec draft), context size increased to 91648.

I reported this in a llama.cpp discussion and am17an (the GOAT behind MTP in llama.cpp) confirmed my findings as expected:

Discussion (0)

No comments yet. Sign in and be the first to say something.