Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hey everyone,
I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100, and I'm currently getting around 55 tokens/sec.
I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output quality.
55 TPS seems lower than I expected for MTP on a V100, but I may be missing something obvious.
Current command:
llama-server \ -m ../NewModels/Qwen3.6-MTP-27B-Q4_K_M.gguf \ --port 9932 \ --host 0.0.0.0 \ -ngl 65 \ --reasoning-budget 0 \ --ctx-size 262144 \ --parallel 2 \ --no-mmproj \ --cont-batching \ --flash-attn on \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --spec-type ngram-mod \ --spec-ngram-mod-n-match 24 \ --spec-ngram-mod-n-max 64 \ --chat-template-kwargs '{"enable_thinking":false}' Hardware:
- GPU: Tesla V100 (32GB)
- llama.cpp: (latest commit)
- Model: Qwen3.6-MTP-27B-Q4_K_M.gguf
A few questions:
- Is 55 TPS roughly what you'd expect from a V100 with this setup?
- Are any of my current flags suboptimal?
- Has anyone benchmarked different values for:
--parallel--spec-draft-n-max- KV cache quantization
- MTP settings
- Is my very large
--ctx-size 262144hurting generation speed even when conversations are short? - Any recent llama.cpp optimizations that significantly improved throughput on V100s?
Would appreciate benchmark numbers from anyone running Qwen3.6 27B (or similar 30B-class models) on V100, A100, 3090, 4090, etc.
Note: 55 tps, got once during first attempt, but on average, its 44-48 tps.
Thanks!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.