Heterogeneous GPU Weighting & Layer Splitting
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
This is what I worked on today. With local LLM of course. So if I didn't write the code, did I really work on it? Who cares. It was my idea and I simply asked it to implement it. I basically downloaded /main/ branch, which is totally broken for Windows by the way (i had to remove vision and mlx support, it basically compiles only for Darwin for some reason by default), and then change the crap for the redistribution of weights to minimize bottlenecks.
Before:
RTX 5090: Good
RTX 3090: OK (handicapped due to vram shortage)
RTX 5090+3090: OK except more vram? But basically as slow as the 3090. The 5090 was taking a nap while the 3090 worked.
After:
RTX 5090+3090: Faster than 5090 alone, and i get to take advantage of the glorious VRAM on the 3090 in a way that doesn't handicap the 5090. Details:
Custom Heterogeneous GPU Support -- Design Differs from ollama/main
This document systematically compares our custom implementation against the current public ollama/main branch, organized by subsystem. All line references are against the main branch at the point of divergence.
1. findBestFit(): Compute Power Weighting
In main, findBestFit() uses GPU free memory verbatim, with no compute weighting:
go for _, gl := range ml.ByPerformance(gpus) { var high float32 = 1 var low float32 = 0 bestAssignments := greedyFit(layers, gl, high, requestedLayers) }
At capacity=1.0, each GPU's effective capacity = freeMemory. A 3090 (24 GB) and 5090 (32 GB) are assigned based purely on VRAM capacity. The sequential greedy algorithm fills the weaker GPU first (starting from len(gpus) - 1), then spills the remainder to the stronger GPU.
Our additions: Compute raw power per GPU (SMCount * ClockMHz), fall back to ComputeMajor*100+ComputeMinor if SMCount/ClockMHz reports uniform values, then compute the capacity multiplier formula:
powerShare[i] = rawPower[i] / totalRawPower
computeCapacity[i] = powerShare[i] * computeBoost + (1 - powerShare[i])
FreeMemory is scaled by computeCapacity before greedyFit runs:
gl[i].FreeMemory = uint64(float64(gpus[i].FreeMemory) * computeCapacity[i])
Effect: The 5090 receives layers proportional to compute power, not just VRAM.
2. greedyFit(): Iteration Direction
THIS IS THE SINGLE MOST IMPACTFUL CHANGE.
In main, greedyFit starts from the weakest GPU and fills upward:
go device := len(gpus) - 1 // Start from WEAK (smallest VRAM) for { device-- // Move toward strongest (index 0) }
Layers are packed into the slowest GPU first, then spill over.
Custom reverses the direction:
go device := 0 // Start from STRONG (largest VRAM, strongest compute) for { device++ // Move toward weak (spills to slower GPUs) }
Layers are packed into the strongest GPU first, then spill to weaker ones. Combined effect: main's VRAM-only greedy fills the 3090 with heavy layers and spills the 5090. Ours does the opposite. At computeBoost > 1.0, layers pile onto the 5090 until it hits its physical VRAM ceiling.
3. createLayout(): protectOutputLayer()
NEW: Forces the output layer onto the strongest GPU by compute tier (ComputeMajor/Minor) with SMCount * ClockMHz as tiebreaker. Prevents the output layer (the most expensive single operation) from landing on a slower GPU.
Main has no equivalent.
4. createLayout(): redistributeHeavyLayers()
NEW: Enables at computeBoost > 1.0. Moves FFN-heavy layers from the weakest to the strongest GPU.
Algorithm: 1. Compute per-GPU compute weight from layers assigned. 2. Add output layer's compute cost (weighted x2). 3. Calculate target imbalance = strongestRawPower / (weakestRawPower + 1). 4. Compare current imbalance against target. 5. If imbalance < target * 0.9, move largest FFN layers weakest to strongest one at a time. 6. Stop when imbalance reaches target or strongest GPU is full.
5. New Helper Functions
All four functions are NEW in ml/device.go:
-
GPUComputeCost(): Returns a tiered cost weight (0.5 to 1.6) reflecting how much value each GB of VRAM provides on that compute capability tier. -
BestGPUForPCIe(): Returns the GPU most able to absorb a single-GPU workload. -
IsBetterCompute(): Comparison logic for compute tiers. -
HighestComputeTier(): Utility to identify the most capable hardware.
6. GPUMinimumGraphOverhead()
NEW: Tiered graph overhead reservation per GPU since compute graphs cannot be split across GPUs in CUDA.
| Compute Tier | Reservation | Architecture |
|---|---|---|
| ComputeMajor >= 10 | 6 GB | Hopper/Blackwell |
| ComputeMajor >= 8 | 4 GB | Ampere/Ada |
| ComputeMajor < 8 | 2 GB | Turing and older |
7. Feature Comparison Summary
| Feature | Main Branch | Custom |
|---|---|---|
| Layer packing direction | Weakest-first | Strongest-first |
| Compute power weighting | None | PowerShare * Boost + (1-PowerShare) |
OLLAMA_SCHED_COMPUTE_BOOST | No | Yes (1.0-2.0) |
| Output layer placement | Anywhere | Forced to strongest |
| FFN-heavy redistribution | None | Enabled when boost > 1.0 |
| Compute tier awareness | No | Tiered (2/4/6 GB) |
GPUComputeCost() | No | Yes |
BestGPUForPCIe() | No | Yes |
ByComputePower sort | No | Yes |
8. Resulting Behavior Differences
At computeBoost=1.0 (main branch behavior): * 3090 gets ~60% of layers (slowest GPU fills first). * 5090 gets ~40% (absorbs overflow). * Pipeline stall: 5090 waits for 3090.
At computeBoost=1.75 (custom behavior): * 5090 gets ~68% of layers (strongest-first, compute-weighted). * 3090 gets ~32% (overflow from 5090). * Output layer always on 5090. * For models under 32GB: all layers on 5090, 3090 idles (clean break).
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.