r/LocalLLaMA · · 4 min read

Combined RTX5080 & 4060 for inference ?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Combined RTX5080 & 4060 for inference ?

Hey, I currently use my RTX 4060 8G for inference with Qwen 3.6-35B-A3B Q8 (q8 for everything weight,value,key) max 60k context per agent (for quality over speed, with CPU &DDR4 offloading) but :

  1. I only get ~100pp & 20tg at max when context is still low on Qwen 3.6-35B-A3B Q8, so I'd like to increase this speed. (weights Q4 only gave me ~30 tg instead so I preferred to keep quality)
  2. I'd like to go toward Qwen 27B (at least Q4-Q6) for more quality with at least 20tg but hopefully more 30-40+.
  3. I also play PCVR games which are very demanding, and I won't be able to use multiple GPUs for it, so I need one big GPU, not multiple small ones.
  4. Motherboard (Asus ProArt B660-CREATOR D4) only has 2 PCIE slots (Technically 3 there's a PCIE 3-x1 but it doesn't seem worth it...) PCIE 5-x16 and PCIE 3-x16, and apparently PCIE 3-x16 is equivalent in speed to PCIE4-x8.

In a few months I plan to add a 2nd GPU to the rig by moving the 4060 from it's current PCIE 5-x16 to PCIE 3-x16 and adding the new GPU on the PCIE 5-x16 slot.

My budget for the upgrade (GPU + new powersupply) is in the 1500-2000€ but I'd be much more comfortable in the lower half of that range.

TLDR

I'm thinking of :

  • RTX5080 on PCIE5x16 + RTX4060 on PCIE3x16
  • Using only the 5080 in games.
  • Using both with llama.cpp or vllm, splitting tensors (if faster for me, otherwise layers) between the two cards to be able to use 24GB of VRAM.

Questions:

A. Does anyone use a comparable setup (very fast 16GB card + slower 8GB) and could tell me their stats with Qwen 27B specifying split type, MTP used or not, quants & context size please ? Its certain the bottleneck will be the 4060, but I'm uncertain how badly it will be.

B. Even if you don't have one, do you think the proposed setup would work well for llama.cpp (or vllm) ? If not what would you recommend instead ?

C. Even if your setup is not exactly comparable, but you have multiple GPUs, do you use llama.cpp or vllm :

C.1. when using only one session at a time (no subagents) ?

C.2. when hosting your own subagents (maybe only one running at a time still, but there's more KV to hold) ?

D. On splitting weights between 2 cards there are 2 ways to do it, either layer or tensor. Layer is slower but does not depend on PCIE speed and tensor split can be quicker with good PCIE speed. Any tips and tricks from people having done this with some really asymmetrical GPUs ?

E. For those that have 24GB VRAM total, what quantization of weights, key values do you use for QW3.6 27B and how much context do you manage to have with it ?

F. For those that have R9700, are the real performance really that bad ? Only ~30% better pp & 50% better tg with R9700 than with my 300$ 4060 ? Or is it a pb with benchmarks being old (newer versions ROCM...) or performance being much better on recent models ?

More details

  • At first I thought maybe I'd replace the 4060 with R9700 AI pro because I really would have liked 32GB VRAM to be confortable with QW27B Q8 + bit more future proof, but I looked at llama.cpp benchmarks on old llama models (Links at the bottom of the post) and i was super disappointed (See image) :
  • I can apparently only expect ~30% better pp & 50% better tg with R9700, or same pp and 2.6x faster tg with 7900XTX.
    • For the super weak performance improvement on the R9700, given the price tag (I'm in Europe) it really does not seem worth it at all. So many people have been touting having bought this card multiple times lately but the price vs performance really does not seem to be there according to those benchmarks ??
    • Better picture for 7900XTX (much faster tg, slightly slower pp than R9700) but its starting to get old, gotta find a used one that is neither a scam or bad state, it has less VRAM and less future-proof.

(Also, AMD is apparently known for not working super well with VR so not really .

  • Looking at RTX numbers, off course the 5090 destroys everything, (I was still a bit disappointed that its only ~4x better than my current 4060 given the price difference...) but it's way out of budget.
  • RTX 5080 looks like an amazing contender, 16GB would not allow me to run QW27B at all, but it seems it is possible to split the model between 2 cards, so just keeping my 4060 I'd have 24GB total, which should be enough for Q4-Q6 27B I think. Maybe by the time I buy the rumored SUPER version with 24GB VRAM will be there and that would be ~~perfect, but otherwise, it seems enough for my use-case.

Benchmarks in question on older llama models :

submitted by /u/dry3ss
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA