r/LocalLLaMA · · 4 min read

Heterogeneous GPU Weighting & Layer Splitting

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

This is what I worked on today. With local LLM of course. So if I didn't write the code, did I really work on it? Who cares. It was my idea and I simply asked it to implement it. I basically downloaded /main/ branch, which is totally broken for Windows by the way (i had to remove vision and mlx support, it basically compiles only for Darwin for some reason by default), and then change the crap for the redistribution of weights to minimize bottlenecks.

Before:

RTX 5090: Good

RTX 3090: OK (handicapped due to vram shortage)

RTX 5090+3090: OK except more vram? But basically as slow as the 3090. The 5090 was taking a nap while the 3090 worked.

After:

RTX 5090+3090: Faster than 5090 alone, and i get to take advantage of the glorious VRAM on the 3090 in a way that doesn't handicap the 5090. Details:

Custom Heterogeneous GPU Support -- Design Differs from ollama/main

This document systematically compares our custom implementation against the current public ollama/main branch, organized by subsystem. All line references are against the main branch at the point of divergence.


1. findBestFit(): Compute Power Weighting

In main, findBestFit() uses GPU free memory verbatim, with no compute weighting:

go for _, gl := range ml.ByPerformance(gpus) { var high float32 = 1 var low float32 = 0 bestAssignments := greedyFit(layers, gl, high, requestedLayers) }

At capacity=1.0, each GPU's effective capacity = freeMemory. A 3090 (24 GB) and 5090 (32 GB) are assigned based purely on VRAM capacity. The sequential greedy algorithm fills the weaker GPU first (starting from len(gpus) - 1), then spills the remainder to the stronger GPU.

Our additions: Compute raw power per GPU (SMCount * ClockMHz), fall back to ComputeMajor*100+ComputeMinor if SMCount/ClockMHz reports uniform values, then compute the capacity multiplier formula:

powerShare[i] = rawPower[i] / totalRawPower
computeCapacity[i] = powerShare[i] * computeBoost + (1 - powerShare[i])

FreeMemory is scaled by computeCapacity before greedyFit runs:

gl[i].FreeMemory = uint64(float64(gpus[i].FreeMemory) * computeCapacity[i])

Effect: The 5090 receives layers proportional to compute power, not just VRAM.


2. greedyFit(): Iteration Direction

THIS IS THE SINGLE MOST IMPACTFUL CHANGE.

In main, greedyFit starts from the weakest GPU and fills upward:

go device := len(gpus) - 1 // Start from WEAK (smallest VRAM) for { device-- // Move toward strongest (index 0) }

Layers are packed into the slowest GPU first, then spill over.

Custom reverses the direction:

go device := 0 // Start from STRONG (largest VRAM, strongest compute) for { device++ // Move toward weak (spills to slower GPUs) }

Layers are packed into the strongest GPU first, then spill to weaker ones. Combined effect: main's VRAM-only greedy fills the 3090 with heavy layers and spills the 5090. Ours does the opposite. At computeBoost > 1.0, layers pile onto the 5090 until it hits its physical VRAM ceiling.


3. createLayout(): protectOutputLayer()

NEW: Forces the output layer onto the strongest GPU by compute tier (ComputeMajor/Minor) with SMCount * ClockMHz as tiebreaker. Prevents the output layer (the most expensive single operation) from landing on a slower GPU.

Main has no equivalent.


4. createLayout(): redistributeHeavyLayers()

NEW: Enables at computeBoost > 1.0. Moves FFN-heavy layers from the weakest to the strongest GPU.

Algorithm: 1. Compute per-GPU compute weight from layers assigned. 2. Add output layer's compute cost (weighted x2). 3. Calculate target imbalance = strongestRawPower / (weakestRawPower + 1). 4. Compare current imbalance against target. 5. If imbalance < target * 0.9, move largest FFN layers weakest to strongest one at a time. 6. Stop when imbalance reaches target or strongest GPU is full.


5. New Helper Functions

All four functions are NEW in ml/device.go:

  • GPUComputeCost(): Returns a tiered cost weight (0.5 to 1.6) reflecting how much value each GB of VRAM provides on that compute capability tier.
  • BestGPUForPCIe(): Returns the GPU most able to absorb a single-GPU workload.
  • IsBetterCompute(): Comparison logic for compute tiers.
  • HighestComputeTier(): Utility to identify the most capable hardware.

6. GPUMinimumGraphOverhead()

NEW: Tiered graph overhead reservation per GPU since compute graphs cannot be split across GPUs in CUDA.

Compute Tier Reservation Architecture
ComputeMajor >= 10 6 GB Hopper/Blackwell
ComputeMajor >= 8 4 GB Ampere/Ada
ComputeMajor < 8 2 GB Turing and older

7. Feature Comparison Summary

Feature Main Branch Custom
Layer packing direction Weakest-first Strongest-first
Compute power weighting None PowerShare * Boost + (1-PowerShare)
OLLAMA_SCHED_COMPUTE_BOOST No Yes (1.0-2.0)
Output layer placement Anywhere Forced to strongest
FFN-heavy redistribution None Enabled when boost > 1.0
Compute tier awareness No Tiered (2/4/6 GB)
GPUComputeCost() No Yes
BestGPUForPCIe() No Yes
ByComputePower sort No Yes

8. Resulting Behavior Differences

At computeBoost=1.0 (main branch behavior): * 3090 gets ~60% of layers (slowest GPU fills first). * 5090 gets ~40% (absorbs overflow). * Pipeline stall: 5090 waits for 3090.

At computeBoost=1.75 (custom behavior): * 5090 gets ~68% of layers (strongest-first, compute-weighted). * 3090 gets ~32% (overflow from 5090). * Output layer always on 5090. * For models under 32GB: all layers on 5090, 3090 idles (clean break).

submitted by /u/comperr
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA