r/LocalLLaMA · May 22, 2026 · 5 min read

Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

#gpu

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

Scrambling to max StrixHalo (+NVLink dual eGPU 3090 mod)

https://preview.redd.it/kz66mxzseq2h1.jpg?width=4096&format=pjpg&auto=webp&s=da98623808c4bde0dc79b239c8cf8930c5572769

https://preview.redd.it/ocsigi0veq2h1.jpg?width=4096&format=pjpg&auto=webp&s=eb4b053e46e434b2c54de7fff6c584e01c80ea5e

This pic is not representing bench setup, just happily captured while I figured out running same model over 3 GPUs. Halo is always busy, 3090s are waiting Halo does his job.

In short.

1. Strix halo alone (124GB UMA VRAM) is already nice but adding 1 or 2 eGPUs is pretty good for running the recently popular 27B or 31B dense models.

2. The native bandwidth limit of eGPUs can be mitigated. I tried scrambling a 2slot NVLink (cheaper than 3 slots) setup with a simple cooling mod on 3090s. You might experience up to several times better PP/s and TG/s on small densed models, depending on the situation, and it can be useful in multi coding agents scenarios.

3. Depending on KVcache types in vLLM, not only max context length and concurrent requests change but speed differs a lot in longer context. It might look good at beginning but not promising longer run.

4. For power efficiency, 27B dense models get better PP/s and TG/s per watt on eGPU. But for 122B, running on Strix halo alone via llama cpp showed better power efficiency than combined 3 GPUs.

5. NVLink does not do anything on llama.cpp's layer split, I have tried recent -sm tensor, gaining Tg/s was 30%ish but pp/s down performance was too big, so I stopped, and continue to vLLM on dual 3090.

I was getting a bit frustrated by the relatively slow PP/s on 27B, 31B densed models of my Bosgame M5 Strix Halo, So I decided to do some scrambling to overcome it. Recently, these dense models are getting much more attention than 70B+ MoE models. To run them better I bought single 3090 via local second hand market, after I saw improvement, then quickly moved to dual egpu setup via both nvme pcie 4x4.

I was hesitated to try NVLink since no gurantee on my eGPU case, and 3 slot NVLink was too expensive(600USD+). Still I wanted to see if I could improve the eGPU's PHB speed which has to go through CPU.
But most 3090 cards including mine are 3 slot thick, so I end up buying a 2slot bridge for around $250 including custom fees.
For this, I removed the 3 fan shroud on the top 3090 and roughly attached 120mm fans with a 3D printed side blow duct to make it fit. Surprisingly, the temperature of this modded 3090 actually stays lower than the unmodded one on bottom.

Test Environment:

Fedora 43
llama cpp: Strix halo performance power mode, build 9221.
- 122B test was split by -sm layer using rocm7.2.3 and cuda.
- 27B test used rocm 7.2.3 as baseline. (Comparing rocm 7.2.3 and vulkan radv, rocm has better pp/s and vulkan has better tg/s). Benchmarks were repeated only 2 times.
- Note: Since MTP is not fully implemented in llama cpp benchmarks yet, I borrowed the code_python MTP metrics (-pp/s% and +tg/s%) from kyuz0's strix halo toolbox for the 27B and 122B (using 35B A3B Moe stats) to plot simulated MTP lines. (https://kyuz0.github.io/amd-strix-halo-toolboxes/mtp.html)
vLLM: Nightly build. 3090s are power limited to 230W each.
vLLM benchmarks followed the Club 3090 direction:
- Narrative: "Write a detailed 800-word essay explaining transformer attention." (max_tokens=1000)
- Code: "Write a Python implementation of quicksort with comments explaining each step." (max_tokens=800)
- Sampling: temp=0.6, top_p=0.95, top_k=20, presence_penalty=0.0, enable_thinking=false. Three warmups and five measured runs.
- Since Club 3090 doesn't have benchmarks based on context depth, I added those tests.

Benched vLLM models - Qwen 3.6 27B

Recipe	Quantization	KV cache	Context	Concurrency	Drafter
docker-compose-dual (small, INT4 Standard)	AutoRound INT4	fp8_e5m2	131K	4 (total ~524K)	MTP=3
turbo (High-Concurrency)	AutoRound INT4	TQ3 (3-bit)	262K	4 (total ~1048K)	MTP=3
mixed-bf16 (Precision,kinda Q6 feeling)	Mixed (INT4+8)	bfloat16	110K	2 (total ~220K)	MTP=3
mixed-fp8 (Sweet Spot)	Mixed (INT4+8)	fp8_e5m2	131K	2 (total ~262K)	MTP=2
autoround INT8 (Largest)	AutoRound INT8	fp8_e5m2	115K	1 (total ~115K)	MTP=3

Mixed bf16, Mixed fp8, Autoround INT8 recipes are small edited from Club 3090's recipe to look for better than Q4 level of quantization.
(I noticed MTP 2 on mixed-fp8 recipe while I am writing, too much work again to fix, so, keep it mind some different condition)

Benched vLLM models - Qwen 3.6 27B

Recipe	KV cache	Context	Concurrency	Drafter
awq-bf16 (pure AWQ)	bf16	262K	262K × 1, 131K × 2, 65K × 4	MTP=4
awq_autoround (hybrid awq)	bf16	262K	262K × 1, 131K × 2, 65K × 4	MTP=4
int8 (larger context)	INT8	340K ~ 392K	262K × 1, 170K × 2, 98K × 4	MTP=4
docker-compose-bf16 (default)	bf16	60K	60K × 1	MTP=4

Awq_autoround recipe is also small edited from original.

Results:

Triple : dual 3090 + Strix halo

122B Q4 K XL unsloth, q8_0, Strix Halo vs Triple

https://preview.redd.it/k3owfjdupq2h1.png?width=1600&format=png&auto=webp&s=0ac542116870087ebdbeeb959ab7bb6e398b802b

https://preview.redd.it/avlcn0hpoq2h1.png?width=1600&format=png&auto=webp&s=a824f6b42c48e2b4e3ae7690a36b473ca8d8c81c

Strix halo (llama cpp 27B MTP Q6 K XL unsloth, 25GB including mmproj)
vs Dual 3090, Qwen3.6-27B-Mixed-AutoRound Minachist 28.9GB)
I chose these quants since considerably good enough quality and size wise close

https://preview.redd.it/gl5xz5ufqq2h1.png?width=1600&format=png&auto=webp&s=4f14f93ffacd94fbb68c6bb52f462012fad0882f

https://preview.redd.it/n93cgeshqq2h1.png?width=1600&format=png&auto=webp&s=98d219e97e13137db627d66d84124aae84275a74

Power efficiency
Rough calculation, but for 27B dense models, the eGPU setup has better power efficiency. However, when running the 122B model, Strix halo alone running on llama cpp was actually more power efficient.

https://preview.redd.it/s2ryohacsq2h1.png?width=1600&format=png&auto=webp&s=e0764be736283bb211e52ed67110b0b9e28fc8ad

https://preview.redd.it/8xdltx0esq2h1.png?width=1600&format=png&auto=webp&s=2d0d2a8b637aae66c5c2511c95e2b1c6baae8ae5

NVLink on / off

Tested NVLink on vs off. As concurrency and context go up, NVLink defends the bandwidth bottleneck pretty well.

BF16 cache senario

https://preview.redd.it/92qm9owysq2h1.png?width=1600&format=png&auto=webp&s=af40d019a444877c1d7128b30dbc5b0d80837c66

https://preview.redd.it/6zqs4g80tq2h1.png?width=1600&format=png&auto=webp&s=4951dc402159bd64d8959ebdf5fe1f42c8b5d9e2

fp8 cache case.

https://preview.redd.it/yzcgl1wjtq2h1.png?width=1600&format=png&auto=webp&s=6b6e547721a6daeb480423b5928c5a30cdf98e51

https://preview.redd.it/zopa2nlktq2h1.png?width=1600&format=png&auto=webp&s=25f05e0a183ae75627f2ae1071ea9318f91dfe0a

INT4 quant's fp8 senario

https://preview.redd.it/6um96q5qtq2h1.png?width=1600&format=png&auto=webp&s=463dfd330cd6f783ab9d6e446f58dc15be568326

https://preview.redd.it/e4j0sj3stq2h1.png?width=1600&format=png&auto=webp&s=4655627f234372ea7d4c847aaaca9faeb2080f7b

Gemma4 31B's case
Gemma-4-31B-it-AutoRound-AWQ, mattbucci, BF16 cache

https://preview.redd.it/rey8p3zytq2h1.png?width=1600&format=png&auto=webp&s=aa573c264af1e3fed6a87ec0837bca32066116b3

https://preview.redd.it/wera6hiztq2h1.png?width=1600&format=png&auto=webp&s=d8c92a6abffcbd0d866c17a7d3ecf2a19764a47c

This shows differences based on quantization and KV cache types. You can see how much max context length and speed fluctuate just by changing the cache type.
on Amphere card, TQ3 was pretty bad to keep Tg/s despite it can give more context amount..

https://preview.redd.it/j6y2cg6nvq2h1.png?width=1164&format=png&auto=webp&s=52eef18357c23d2341444e3e7e873902837fd87d

https://preview.redd.it/jb917qmovq2h1.png?width=1164&format=png&auto=webp&s=e94a60d752d0ad6bf28c070015a15c1cb37a0759

Code vs Narrative MTP

When concurrency is 1, code generation is always faster than narrative. But as you can see, when concurrency is 2 and it goes into deeper context, code speed drops and gets reversed by narrative. Seems like a weird load happens when concurrent requests and long context combine.

https://preview.redd.it/pcw1duwdwq2h1.png?width=1600&format=png&auto=webp&s=f6366e31b70af3d3d3361288320b9ebba4cda5c8

Huge thanks to
Club 3090 (https://github.com/noonghunna/club-3090/tree/master),
kyuz0's toolbox (https://github.com/kyuz0/amd-strix-halo-toolboxes), and DasDigitaleMomentum's distrobox (https://github.com/DasDigitaleMomentum/strix-halo-cuda-combined-toolbox)

submitted by /u/Rattling33
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA