I picked up a 7900 XTX earlier which runs qwen3.6-27b fine, but not to my like. Its compute performance is quite unstable for me. With MTP the decode speed can reach 40-60 t/s, but prefill is just too slow. Regardless of whether I used ROCm or Vulkan, the prefill speed varies between 300t/s and 500 t/s, even with very long prompts.
I've been itching to try out an ultra-budget 24GB setup using dual 3060s. I managed to snag a second 3060 at a reasonable price in last few days. So I took out the 7900 XTX, installed the 3060s, and began testing.
Test Configuration
- Test Platform: i7 4770k + Gigabyte GA-Z87MX-D3H
- Quite an ancient platform, used for over a decade. But interestingly, it supports SLI by splitting PCIe 3.0 x16 into two PCIe 3.0 x8 when both slots used. Newer motherboards don't seem to offer such split but many offer one full-speed PCIe 5.0 x16 slot plus one PCIe 4.0 x4 slot. As we know, PCIe 4.0 x4 is equivalent to PCIe 3.0 x8. Therefore this old platform is on par with newer ones in terms of PCIe bottleneck.
- Monitor is plugged into the motherboard using iGPU.
- OS: Kubuntu 24.04
- CUDA: 13.2
- Models:
- unsloth/Qwen3.6-27B-MTP-GGUF
- unsloth/Qwen3.6-27B-GGUF
- Quantization: Qwen3.6-27B-Q4_K_S.gguf
- Software: llama.cpp 5/25/2026 master, self-compiled with CUDA support (official pre-compiled Linux CUDA binaries are not available for download).
- Pre-requisite installation:
sudo apt install nvidia-cuda-toolkit
- Settings (detailed config at the end of the post):
- Tensor parallel:
-sm tensor -ts 1,1 -sm tensor cannot be enabled at the same time as -ctk and -ctv. This means KV cache quantization cannot be used, limiting the context window to around 64k. I usually need a 160k context, so this is a bit frustrating. --spec-type draft-mtp --spec-draft-n-max 1. --spec-draft-n-max 2 can be unstable due to transitent VRAM peaks causing OOM. Thanks u/laul_pogan for pointing out.
Test Result
2.16.262.271 I slot print_timing: id 0 | task 701 | prompt eval time = 3056.70 ms / 1394 tokens ( 2.19 ms per token, 456.05 tokens per second) 2.16.262.276 I slot print_timing: id 0 | task 701 | eval time = 22538.95 ms / 975 tokens ( 23.12 ms per token, 43.26 tokens per second) 2.16.262.277 I slot print_timing: id 0 | task 701 | total time = 25595.65 ms / 2369 tokens 2.16.262.291 I slot print_timing: id 0 | task 701 | graphs reused = 1016 2.16.262.292 I slot print_timing: id 0 | task 701 | draft acceptance = 0.77618 ( 593 accepted / 764 generated) 2.16.262.310 I statistics draft-mtp: #calls(b,g,a) = 10 1038 1038, #gen drafts = 1038, #acc drafts = 959, #gen tokens = 2076, #acc tokens = 1792, dur(b,g,a) = 0.018, 8380.839, 3.772 ms 2.16.263.267 I slot release: id 0 | task 701 | stop processing: n_tokens = 12343, truncated = 0
The initial peak speeds reached pp 600+ t/s and tg 50 t/s. At an actual context length of 12k, prompt processing (pp) hits 456.05 t/s, and text generation (tg) is at 43.26 t/s. This vastly exceeded my expectations. While it doesn't match the maximum peak speed of the 7900 XTX, the speed is incredibly stable, and the GPU utilization stays pegged at 100% for long durations. I have to say, CUDA is simply much more mature.
BTW, with MTP off, context can be extended to 96k without MTP, the pp speed remains at 600+ t/s, and the tg speed drops to 31 t/s, which is still quite decent.
| Scenario | Context Window | Prefill (pp) | Generation (tg) |
| MTP Initial Peak | 64k | 620 t/s | 50 t/s |
| MTP @ 32k | 64k | 482 t/s | 36.36 t/s |
| No MTP Initial Peak | 96k | 620 t/s | 31 t/s |
| No MTP @ 20k | 96k | 605 t/s | 29.10 t/s |
| No MTP @ 50k | 96k | 438 t/s | 26.59 t/s |
Conclusion
Cons
SPLIT_MODE_TENSOR currently cannot be used alongside KV cache quantization, making 24GB feel a bit tight. However, this is definitely not a niche demand; simple Q8 quantization could double the context to 128k / 192k. The future looks promising.
Pros
- Incredible value for money. Depends on where you are two 3060s could cost as low as $400.
- The CUDA ecosystem is mature. GPU utilization stays stable at 100% for long stretches, and once compiled, it works flawlessly without needing constant troubleshooting. Peace of mind.
- The 3060 has a slim form factor, with short single- or dual-fan variants available, making it compatible with most ATX and mATX motherboards and cases without any hassle.
Inferences
- Using dual 16GB cards that are slightly faster (e.g., 4060 Ti, 5060 Ti) will probably yield even better results, though the price-to-performance ratio will drop. Again, CUDA just offers better utilization. Having 32GB this way sould be much faster than, e.g., the crippled AI Pro R9700, and still cost less.
Other Notes
- I also gave vLLM a brief try, but it seems poorly optimized for VRAM-constrained scenarios and kept hitting OOM no matter what. Plus, vLLM takes too long to start up, making debugging a pain, so I stopped messing with it.
Appendix
Detailed Configuration:
--no-mmproj-offload \ -dev CUDA0,CUDA1 -sm tensor -ts 1,1 \ --fit off \ --host 0.0.0.0 --port "$PORT" \ -t 0 -ngl 99 -np 1 \ --kv-unified --flash-attn on --ctx-size 64000 \ # or 96000 --spec-type draft-mtp --spec-draft-n-max 1 \ # or remove this line -rea on \ --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
submitted by
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.