PSA: You may not need to quantize spec draft when using MTP
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Using `--spec-draft-type-k q4_0 --spec-draft-type-v q4_0` might actually decrease your context size!
With quantized spec draft, my context size is 83200. Without it (i.e. using the default fp16 spec draft), context size increased to 91648.
I reported this in a llama.cpp discussion and am17an (the GOAT behind MTP in llama.cpp) confirmed my findings as expected:
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.