Combined RTX5080 & 4060 for inference ?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hey, I currently use my RTX 4060 8G for inference with Qwen 3.6-35B-A3B Q8 (q8 for everything weight,value,key) max 60k context per agent (for quality over speed, with CPU &DDR4 offloading) but :
In a few months I plan to add a 2nd GPU to the rig by moving the 4060 from it's current PCIE 5-x16 to PCIE 3-x16 and adding the new GPU on the PCIE 5-x16 slot. My budget for the upgrade (GPU + new powersupply) is in the 1500-2000€ but I'd be much more comfortable in the lower half of that range. TLDRI'm thinking of :
Questions: A. Does anyone use a comparable setup (very fast 16GB card + slower 8GB) and could tell me their stats with Qwen 27B specifying split type, MTP used or not, quants & context size please ? Its certain the bottleneck will be the 4060, but I'm uncertain how badly it will be. B. Even if you don't have one, do you think the proposed setup would work well for llama.cpp (or vllm) ? If not what would you recommend instead ? C. Even if your setup is not exactly comparable, but you have multiple GPUs, do you use llama.cpp or vllm : C.1. when using only one session at a time (no subagents) ? C.2. when hosting your own subagents (maybe only one running at a time still, but there's more KV to hold) ? D. On splitting weights between 2 cards there are 2 ways to do it, either layer or tensor. Layer is slower but does not depend on PCIE speed and tensor split can be quicker with good PCIE speed. Any tips and tricks from people having done this with some really asymmetrical GPUs ? E. For those that have 24GB VRAM total, what quantization of weights, key values do you use for QW3.6 27B and how much context do you manage to have with it ? F. For those that have R9700, are the real performance really that bad ? Only ~30% better pp & 50% better tg with R9700 than with my 300$ 4060 ? Or is it a pb with benchmarks being old (newer versions ROCM...) or performance being much better on recent models ? More details
(Also, AMD is apparently known for not working super well with VR so not really .
Benchmarks in question on older llama models :[link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.