r/LocalLLaMA · · 1 min read

Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hey everyone,

I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100, and I'm currently getting around 55 tokens/sec.

I'm trying to find out whether there are any configuration changes that could increase throughput further without reducing output quality.

55 TPS seems lower than I expected for MTP on a V100, but I may be missing something obvious.

Current command:

llama-server \ -m ../NewModels/Qwen3.6-MTP-27B-Q4_K_M.gguf \ --port 9932 \ --host 0.0.0.0 \ -ngl 65 \ --reasoning-budget 0 \ --ctx-size 262144 \ --parallel 2 \ --no-mmproj \ --cont-batching \ --flash-attn on \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --spec-type ngram-mod \ --spec-ngram-mod-n-match 24 \ --spec-ngram-mod-n-max 64 \ --chat-template-kwargs '{"enable_thinking":false}' 

Hardware:

  • GPU: Tesla V100 (32GB)
  • llama.cpp: (latest commit)
  • Model: Qwen3.6-MTP-27B-Q4_K_M.gguf

A few questions:

  1. Is 55 TPS roughly what you'd expect from a V100 with this setup?
  2. Are any of my current flags suboptimal?
  3. Has anyone benchmarked different values for:
    • --parallel
    • --spec-draft-n-max
    • KV cache quantization
    • MTP settings
  4. Is my very large --ctx-size 262144 hurting generation speed even when conversations are short?
  5. Any recent llama.cpp optimizations that significantly improved throughput on V100s?

Would appreciate benchmark numbers from anyone running Qwen3.6 27B (or similar 30B-class models) on V100, A100, 3090, 4090, etc.

Note: 55 tps, got once during first attempt, but on average, its 44-48 tps.

Thanks!

submitted by /u/abubakkar_s
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA