r/LocalLLaMA · · 5 min read

DiffusionGemma 26B A4B results on my 5090

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

# DiffusionGemma 26B A4B — Tuning Results
(note: these are my tuning results but Deepseek assisted in generation of testing scripts and reports)

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF

System

- **GPU**: RTX 5090 (32 GB VRAM), CUDA 13.3 - **Build**: `llama.cpp` PR #24423, GCC-15, Ninja, ccache - **Flash Attention**: Auto-disabled on SM120 — limits max context - **Models**: `unsloth/diffusiongemma-26B-A4B-it-GGUF` 

Models

| Q6_K | `diffusiongemma-26B-A4B-it-Q6_K.gguf` | 22 GB | | Q4_K_M | `diffusiongemma-26B-A4B-it-Q4_K_M.gguf` | 16 GB | 

Max Stable Context

| Quant | Formula | Max ctx | -n limit | VRAM limit | |-------|---------|---------|----------|------------| | Q6_K | 16 blocks × 256 + 2048 | 6,144 | -n 4096 | 22 GB model + ~10 GB buffers | | Q4_K_M | 32 blocks × 256 + 2048 | 10,240 | -n 8192 | 16 GB model + ~14 GB buffers | 

Context is limited by compute buffer size — Flash Attention is auto-disabled on RTX 5090 (SM120), causing O(n²) memory scaling for full attention. Model itself supports up to 262k context; 64k is achievable with Flash Attention enabled.

Best Parameters

| Parameter | Q6_K | Q4_K_M | |-----------|------|--------| | `--diffusion-eb-t-max` | 0.4 | 0.3 | | `--diffusion-eb-t-min` | 0.1 | 0.05 | | `--diffusion-eb-max-steps` | auto (48) | 20 | | `--diffusion-eb-entropy-bound` | 0.1 (default) | 0.1 (default) | | `--diffusion-eb-confidence` | 0.005 (default) | 0.005 (default) | | `--diffusion-eb-stability` | 1 (default) | 1 (default) | | `-ub` / `-b` | auto-derived from -n | auto-derived from -n | 

Optimal invocations

**Q6_K fastest:**

./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q6_K.gguf \ -ngl 99 -n 2048 \ --diffusion-eb-t-max 0.4 --diffusion-eb-t-min 0.1 

**Q4_K_M fastest:**

./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -n 8192 \ --diffusion-eb-max-steps 20 \ --diffusion-eb-t-max 0.3 --diffusion-eb-t-min 0.05 

Speed Comparison

Multi-block throughput (long prompt, 2048 token generation)

| Context | Q6_K default | Q6_K tuned | Q4_K_M default | Q4_K_M tuned | |---------|-------------|------------|----------------|--------------| | -n 2048 (ctx=4096) | 180 tok/s | **213 tok/s** | 174 tok/s | **244 tok/s** | | -n 3072 (ctx=5120) | 183 tok/s | **209 tok/s** | 175 tok/s | **245 tok/s** | | -n 8192 (ctx=10240) | — | — | 175 tok/s | **252 tok/s** | 

Short-prompt (single block, 256 tokens)

| Metric | Q6_K default | Q6_K tuned | Q4_K_M default | Q4_K_M tuned | |--------|-------------|------------|----------------|--------------| | Throughput | 523 tok/s | 523 tok/s | 456 tok/s | **545 tok/s** | | Steps per block | 6 | 6 | 8 | **6** | 

Speedup over default

| Quant | -n 2048 | -n 3072 | -n 8192 | |-------|---------|---------|---------| | Q6_K | **+18%** | **+14%** | — | | Q4_K_M | **+40%** | **+40%** | **+44%** | 

Parameter Impact Analysis

Temperature range (t-max / t-min) — biggest lever

Lower temperature makes the model less exploratory, so the canvas converges in fewer denoising steps. Effect is consistent across both quantizations.

| t-max / t-min | Q6_K steps/blk | Q6_K tok/s | Q4_K_M steps/blk | Q4_K_M tok/s | |---------------|----------------|------------|-------------------|--------------| | 0.8 / 0.4 (default) | 15.8 | 180 | 18.0 | 174 | | 0.6 / 0.2 | 14.8 | 192 | 16.9 | 188 | | 0.4 / 0.1 | **13.0** | **213** | 13.2 | 221 | | 0.3 / 0.05 | 13.5 | 199 | **12.6** | **230** | | 0.2 / 0.05 | 12.0* | 223* | 15.0* | 260* | 

Single-block or partial generation — quality degraded, speed inflated.

Going too cold (< t-max 0.25) kills multi-block generation: the model becomes too deterministic to produce diverse tokens for subsequent blocks.EB max-steps — Q4_K_M only
Capping the maximum denoising steps per block helps Q4_K_M but not Q6_K. The smaller model converges faster, so a hard cap at 20 shaves off ~1.2 steps/block without hitting quality.

| max-steps | Q4_K_M steps/blk | Q4_K_M tok/s | |-----------|-------------------|--------------| | auto (48) | 12.6 | 230 | | 24 | 12.0 | 236 | | **20** | **11.4** | **244** | | 18 | 12.2 | 235 | | 16 | 12.8 | 228 | 

Entropy-bound — stick with default

| entropy-bound | Q6_K tok/s | Q4_K_M tok/s | Effect | |---------------|------------|---------------|--------| | 0.05 | 152 | 216 | Too selective → more steps | | **0.1 (default)** | **180** | **230** | Sweet spot | | 0.15 | — | 240 | Slight improvement on Q4 | | 0.2 | 158 | 233 | Too noisy → more steps | 

Batch size — auto is optimal

| -ub / -b | Q6_K tok/s | Notes | |----------|------------|-------| | auto (4096) | **213** | Derived from -n / ctx | | 512 | 203 | Smaller = less parallelism | | 8192 | 213 | Larger = no benefit | 

Key Findings

**Q4_K_M is the better choice** — 50% more context (10k vs 6k) and 18% fastergeneration (252 vs 213 tok/s at max context). **Temperature is everything** — lowering t-max from 0.8→0.3 and t-min from0.4→0.05 accounts for virtually all the speedup. The rest of the EB paramsare already well-tuned at defaults. **Bigger context doesn't slow down Q4_K_M** — speed actually *improves* atlarger context (252 tok/s at -n 8192 vs 244 at -n 2048). The larger batchgives the entropy-bound sampler better signal. **Flash Attention is the blocker for 64k** — once SM120 support lands inllama.cpp, the compute buffer bottleneck goes away and DiffusionGemma'sfull 262k context should be reachable on a single RTX 5090. 
submitted by /u/giveen
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA