2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise) Just wanted to share my current setup, that might help some users out there... This is my exact config, my 2 extremely old 2080Ti gpus where upgraded in china to have 22GB vram each... and on ebay i bought a NVLINK (i do not recommend bying it, as no meassurable difference appears) Quantisation i run is IQ4_XS if i change the kv cache to q8_0 it sometimes happens during long coding sessions that the model loops, this is why i run kv-cache@f16 and never have this problem since then. i use the hauhaucs qwen3.6 model uncensored on IQ4 matrix quants. You can also forget about MTP as you are compute bound with those cards and not bandwidth bound. The absolut biggest boost came from --split-mode tensor , this gave me a boost from 14 token/s to 38t/s i think without the power limit we should get 45 token/s Btw. this is a < 1k USD setup running on 400w peak on the wall, and it works great with hermes and opencode. the jinja template i use is this one: [link] [comments] |
More from r/LocalLLaMA
-
I built a self-hosted open-source MCP server that gives any local LLM real financial data — SEC filings, 13F, insider & congressional trades, short data, FRED
May 15
-
Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.
May 15
-
MTP Incoming today!
May 15
-
GitHub - pwilkin/openmoss: OpenMOSS pure C++ pipeline based on GGML
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.