r/LocalLLaMA · · 1 min read

PSA: You may not need to quantize spec draft when using MTP

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Using `--spec-draft-type-k q4_0 --spec-draft-type-v q4_0` might actually decrease your context size!

With quantized spec draft, my context size is 83200. Without it (i.e. using the default fp16 spec draft), context size increased to 91648.

I reported this in a llama.cpp discussion and am17an (the GOAT behind MTP in llama.cpp) confirmed my findings as expected:

https://github.com/ggml-org/llama.cpp/discussions/24102

submitted by /u/regunakyle
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA