r/LocalLLaMA · May 30, 2026 · 1 min read

Can't get over 250TPS on RTX5090 with Qwen3.5-4B

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

My main model is qwen3.6-27b-mtp and I'm getting around 100tps and 2500tps prefill, which is great. I've tried adding a second small model for auxiliary tasks, and even when it's the only model running, it doesn't go over 200-250tps.

I'm building llama.cpp and running on docker windows. I've also tried havenoammo/llama:cuda13-server, and get exactly the same performance so I think my build flags are OK. I've also tested with LM Studio and performance is similar.

I think I should be getting much better performance out of a tiny 4B model on an RTX5090, and have tried everything I can think of, and still there's a bottleneck somewhere.

GPU use is low(ish), around 50%, and CPU is basically idle.

My docker-compose.yml:

llama2: image: havenoammo/llama:cuda13-server container_name: llama-cuda13-3 runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] ports: - "8081:8080" volumes: - E:\user\Documents\LM Studio Models\unsloth:/models - ./model2.ini:/app/models.ini environment: - NVIDIA_VISIBLE_DEVICES=all command: > --models-preset /app/models.ini --port 8080 --host 0.0.0.0 -t 8 -n -1 restart: unless-stopped

and models2.ini:

version = 1 [*] n-gpu-layers = -1 batch-size = 4096 ubatch-size = 4096 jinja = true cache-type-k = q8_0 cache-type-v = q8_0 perf = true metrics = true parallel = 4 cont-batching = true kv-unified = true ctx-checkpoints = 8 [qwen3.5-4b] load-on-startup = true model = /models/Qwen3.5-4B-GGUF/Qwen3.5-4B-Q4_K_S.gguf ; mmproj = /models/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf ctx-size = 32000 chat-template-kwargs = {} reasoning = off temp = 1 top-p = 1 top-k = 20 min-p = 0.0 presence-penalty = 2.0 repeat-penalty = 1.0 flash-attn = on

submitted by /u/luckyj
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA