llama.cpp docker images to run MTP models
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
This is follow up from previous post: https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/
There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to date is an issue, so I built Docker images to make running them easier. If you are already using llama.cpp Docker images, it would be straightforward to switch over until official builds support MTP.
Here, pick your flavour:
havenoammo/llama:cuda13-server havenoammo/llama:cuda12-server havenoammo/llama:vulkan-server havenoammo/llama:intel-server havenoammo/llama:rocm-server
I have not been able to test all of them, as I only run cuda13 for now. Feel free to give it a test and see if it works for your hardware.
Also, Unsloth released MTP models for Qwen 3.6, which makes my previous grafted models obsolete. You can find them here if you missed them:
- https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF
- https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
I believe they quantize some of the MTP layers. I kept mine at Q8 quantization for improved prediction. It is possible that higher quantization for MTP layers makes them more precise, giving you more speed at the cost of more VRAM usage. I will keep my versions for now until I finish doing some benchmarks and I am sure they are fully obsolete.
Quick edit: They do quantize MTP layers at lower quantization levels. Here is a comparison:
| Tensor | havenoammo (UD XL + Q8_0 MTP) | Unsloth (UD XL) |
|---|---|---|
blk.64.attn_k.weight | Q8_0 | Q3_K |
blk.64.attn_k_norm.weight | F32 | F32 |
blk.64.attn_norm.weight | F32 | F32 |
blk.64.attn_output.weight | Q8_0 | Q4_K |
blk.64.attn_q.weight | Q8_0 | Q3_K |
blk.64.attn_q_norm.weight | F32 | F32 |
blk.64.attn_v.weight | Q8_0 | Q5_K |
blk.64.ffn_down.weight | Q8_0 | Q4_K |
blk.64.ffn_gate.weight | Q8_0 | Q3_K |
blk.64.ffn_up.weight | Q8_0 | Q3_K |
blk.64.nextn.eh_proj.weight | Q8_0 | Q8_0 |
blk.64.nextn.enorm.weight | F32 | F32 |
blk.64.nextn.hnorm.weight | F32 | F32 |
blk.64.nextn.shared_head_norm.weight | F32 | F32 |
blk.64.post_attention_norm.weight | F32 | F32 |
| MTP layers size | 430.41 MB | 222.33 MB |
Will do some benchmarks to see if quantization causes any precision/speed loss for multi-token prediction. Until then if you have VRAM, feel free to test out my releases.
- https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF
- https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF
Finally, here is how I use it:
docker run --gpus all --rm \ -p 8080:8080 \ -v ./models:/models \ havenoammo/llama:cuda13-server \ -m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \ --port 8080 \ --host 0.0.0.0 \ -n -1 \ --parallel 1 \ --ctx-size 262144 \ --fit-target 844 \ --mmap \ -ngl -1 \ --flash-attn on \ --metrics \ --temp 1.0 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --jinja \ --chat-template-kwargs '{"preserve_thinking":true}' \ --ubatch-size 512 \ --batch-size 2048 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --spec-type mtp \ --spec-draft-n-max 3 Adjust as you see fit. What matters most for MTP is --spec-type mtp and --spec-draft-n-max 3.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.