r/LocalLLaMA · · 2 min read

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost.

Setup: 4x 5060 ti (16GB) with P2P

If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB models for 425 to 475 used.

A giant caveat is this type of configuration is only viable if your only interested in strictly inference.

The VLLM Command Used:

export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 export VLLM_WORKER_MULTIPROC_METHOD=spawn export SAFETENSORS_FAST_GPU=1 export NCCL_P2P_DISABLE=0 export NCCL_CUMEM_ENABLE=1 export CUDA_DEVICE_ORDER=PCI_BUS_ID export TORCH_FLOAT32_MATMUL_PRECISION=high export PYTORCH_ALLOC_CONF=expandable_segments:True # dropped: VLLM_USE_FLASHINFER_MOE_FP8 (dense model), VLLM_TEST_FORCE_FP8_MARLIN (test native FP8 first) vllm serve /data/models/Qwen/Qwen3.6-27B-FP8 \ --host 0.0.0.0 --port 8080 \ --tensor-parallel-size 4 \ --performance-mode interactivity \ --trust-remote-code \ --language-model-only \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --max-model-len 262144 \ --kv-cache-dtype bfloat16 \ --max-num-seqs 4 \ --gpu-memory-utilization 0.92 \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}' \ --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \ --async-scheduling \ --attention-backend flashinfer \ --enable-prefix-caching 

Benchmark Command:
vllm bench serve --backend vllm --base-url http://localhost:8080 --endpoint /v1/completions --model /data/models/Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 4096 --random-output-len 1024 --num-prompts 40 --max-concurrency 1 --num-warmups 5 --ignore-eos --seed 1234 --percentile-metrics ttft,tpot,itl,e2el --save-result --result-filename qwen36_c1_4k.json

============ Serving Benchmark Result ============ Successful requests: 40 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 735.75 Total input tokens: 163840 Total generated tokens: 40960 Request throughput (req/s): 0.05 Output token throughput (tok/s): 55.67 Peak output token throughput (tok/s): 25.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 278.36 ---------------Time to First Token---------------- Mean TTFT (ms): 4226.91 Median TTFT (ms): 4315.47 P99 TTFT (ms): 4320.32 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 13.85 Median TPOT (ms): 13.44 P99 TPOT (ms): 25.61 ---------------Inter-token Latency---------------- Mean ITL (ms): 40.91 Median ITL (ms): 40.84 P99 ITL (ms): 41.59 ----------------End-to-end Latency---------------- Mean E2EL (ms): 18393.49 Median E2EL (ms): 17991.18 P99 E2EL (ms): 30508.70 ---------------Speculative Decoding--------------- Acceptance rate (%): 65.25 Acceptance length: 2.96 Drafts: 13853 Draft tokens: 41559 Accepted tokens: 27116 Per-position acceptance (%): Position 0: 78.29 Position 1: 64.14 Position 2: 53.31 ================================================== 

note: I forgot I had --max-num-seqs at 4 but I benchmarked with 1 concurrency.

submitted by /u/joorklee
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA