Can't get over 250TPS on RTX5090 with Qwen3.5-4B
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
My main model is qwen3.6-27b-mtp and I'm getting around 100tps and 2500tps prefill, which is great. I've tried adding a second small model for auxiliary tasks, and even when it's the only model running, it doesn't go over 200-250tps.
I'm building llama.cpp and running on docker windows. I've also tried havenoammo/llama:cuda13-server, and get exactly the same performance so I think my build flags are OK. I've also tested with LM Studio and performance is similar.
I think I should be getting much better performance out of a tiny 4B model on an RTX5090, and have tried everything I can think of, and still there's a bottleneck somewhere.
GPU use is low(ish), around 50%, and CPU is basically idle.
My docker-compose.yml:
llama2: image: havenoammo/llama:cuda13-server container_name: llama-cuda13-3 runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] ports: - "8081:8080" volumes: - E:\user\Documents\LM Studio Models\unsloth:/models - ./model2.ini:/app/models.ini environment: - NVIDIA_VISIBLE_DEVICES=all command: > --models-preset /app/models.ini --port 8080 --host 0.0.0.0 -t 8 -n -1 restart: unless-stopped and models2.ini:
version = 1 [*] n-gpu-layers = -1 batch-size = 4096 ubatch-size = 4096 jinja = true cache-type-k = q8_0 cache-type-v = q8_0 perf = true metrics = true parallel = 4 cont-batching = true kv-unified = true ctx-checkpoints = 8 [qwen3.5-4b] load-on-startup = true model = /models/Qwen3.5-4B-GGUF/Qwen3.5-4B-Q4_K_S.gguf ; mmproj = /models/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf ctx-size = 32000 chat-template-kwargs = {} reasoning = off temp = 1 top-p = 1 top-k = 20 min-p = 0.0 presence-penalty = 2.0 repeat-penalty = 1.0 flash-attn = on [link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.