r/LocalLLaMA · June 10, 2026 · 1 min read

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hey everyone,

I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100, and I'm currently getting around 55 tokens/sec.

I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output quality.

55 TPS seems lower than I expected for MTP on a V100, but I may be missing something obvious.

Current command:

llama-server \ -m ../NewModels/Qwen3.6-MTP-27B-Q4_K_M.gguf \ --port 9932 \ --host 0.0.0.0 \ -ngl 65 \ --reasoning-budget 0 \ --ctx-size 262144 \ --parallel 2 \ --no-mmproj \ --cont-batching \ --flash-attn on \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --spec-type ngram-mod \ --spec-ngram-mod-n-match 24 \ --spec-ngram-mod-n-max 64 \ --chat-template-kwargs '{"enable_thinking":false}'

Hardware:

GPU: Tesla V100 (32GB)
llama.cpp: (latest commit)
Model: Qwen3.6-MTP-27B-Q4_K_M.gguf

A few questions:

Is 55 TPS roughly what you'd expect from a V100 with this setup?
Are any of my current flags suboptimal?
Has anyone benchmarked different values for:
- --parallel
- --spec-draft-n-max
- KV cache quantization
- MTP settings
Is my very large --ctx-size 262144 hurting generation speed even when conversations are short?
Any recent llama.cpp optimizations that significantly improved throughput on V100s?

Would appreciate benchmark numbers from anyone running Qwen3.6 27B (or similar 30B-class models) on V100, A100, 3090, 4090, etc.

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

Discussion (0)

More from r/LocalLLaMA