DiffusionGemma 26B A4B results on my 5090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
# DiffusionGemma 26B A4B — Tuning Results
(note: these are my tuning results but Deepseek assisted in generation of testing scripts and reports)
https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF
System
- **GPU**: RTX 5090 (32 GB VRAM), CUDA 13.3 - **Build**: `llama.cpp` PR #24423, GCC-15, Ninja, ccache - **Flash Attention**: Auto-disabled on SM120 — limits max context - **Models**: `unsloth/diffusiongemma-26B-A4B-it-GGUF` Models
| Q6_K | `diffusiongemma-26B-A4B-it-Q6_K.gguf` | 22 GB | | Q4_K_M | `diffusiongemma-26B-A4B-it-Q4_K_M.gguf` | 16 GB | Max Stable Context
| Quant | Formula | Max ctx | -n limit | VRAM limit | |-------|---------|---------|----------|------------| | Q6_K | 16 blocks × 256 + 2048 | 6,144 | -n 4096 | 22 GB model + ~10 GB buffers | | Q4_K_M | 32 blocks × 256 + 2048 | 10,240 | -n 8192 | 16 GB model + ~14 GB buffers | Context is limited by compute buffer size — Flash Attention is auto-disabled on RTX 5090 (SM120), causing O(n²) memory scaling for full attention. Model itself supports up to 262k context; 64k is achievable with Flash Attention enabled.
Best Parameters
| Parameter | Q6_K | Q4_K_M | |-----------|------|--------| | `--diffusion-eb-t-max` | 0.4 | 0.3 | | `--diffusion-eb-t-min` | 0.1 | 0.05 | | `--diffusion-eb-max-steps` | auto (48) | 20 | | `--diffusion-eb-entropy-bound` | 0.1 (default) | 0.1 (default) | | `--diffusion-eb-confidence` | 0.005 (default) | 0.005 (default) | | `--diffusion-eb-stability` | 1 (default) | 1 (default) | | `-ub` / `-b` | auto-derived from -n | auto-derived from -n | Optimal invocations
**Q6_K fastest:**
./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q6_K.gguf \ -ngl 99 -n 2048 \ --diffusion-eb-t-max 0.4 --diffusion-eb-t-min 0.1 **Q4_K_M fastest:**
./build/bin/llama-diffusion-cli \ -m /path/to/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \ -ngl 99 -n 8192 \ --diffusion-eb-max-steps 20 \ --diffusion-eb-t-max 0.3 --diffusion-eb-t-min 0.05 Speed Comparison
Multi-block throughput (long prompt, 2048 token generation)
| Context | Q6_K default | Q6_K tuned | Q4_K_M default | Q4_K_M tuned | |---------|-------------|------------|----------------|--------------| | -n 2048 (ctx=4096) | 180 tok/s | **213 tok/s** | 174 tok/s | **244 tok/s** | | -n 3072 (ctx=5120) | 183 tok/s | **209 tok/s** | 175 tok/s | **245 tok/s** | | -n 8192 (ctx=10240) | — | — | 175 tok/s | **252 tok/s** | Short-prompt (single block, 256 tokens)
| Metric | Q6_K default | Q6_K tuned | Q4_K_M default | Q4_K_M tuned | |--------|-------------|------------|----------------|--------------| | Throughput | 523 tok/s | 523 tok/s | 456 tok/s | **545 tok/s** | | Steps per block | 6 | 6 | 8 | **6** | Speedup over default
| Quant | -n 2048 | -n 3072 | -n 8192 | |-------|---------|---------|---------| | Q6_K | **+18%** | **+14%** | — | | Q4_K_M | **+40%** | **+40%** | **+44%** | Parameter Impact Analysis
Temperature range (t-max / t-min) — biggest lever
Lower temperature makes the model less exploratory, so the canvas converges in fewer denoising steps. Effect is consistent across both quantizations.
| t-max / t-min | Q6_K steps/blk | Q6_K tok/s | Q4_K_M steps/blk | Q4_K_M tok/s | |---------------|----------------|------------|-------------------|--------------| | 0.8 / 0.4 (default) | 15.8 | 180 | 18.0 | 174 | | 0.6 / 0.2 | 14.8 | 192 | 16.9 | 188 | | 0.4 / 0.1 | **13.0** | **213** | 13.2 | 221 | | 0.3 / 0.05 | 13.5 | 199 | **12.6** | **230** | | 0.2 / 0.05 | 12.0* | 223* | 15.0* | 260* | Single-block or partial generation — quality degraded, speed inflated.
Going too cold (< t-max 0.25) kills multi-block generation: the model becomes too deterministic to produce diverse tokens for subsequent blocks.EB max-steps — Q4_K_M only
Capping the maximum denoising steps per block helps Q4_K_M but not Q6_K. The smaller model converges faster, so a hard cap at 20 shaves off ~1.2 steps/block without hitting quality.
| max-steps | Q4_K_M steps/blk | Q4_K_M tok/s | |-----------|-------------------|--------------| | auto (48) | 12.6 | 230 | | 24 | 12.0 | 236 | | **20** | **11.4** | **244** | | 18 | 12.2 | 235 | | 16 | 12.8 | 228 | Entropy-bound — stick with default
| entropy-bound | Q6_K tok/s | Q4_K_M tok/s | Effect | |---------------|------------|---------------|--------| | 0.05 | 152 | 216 | Too selective → more steps | | **0.1 (default)** | **180** | **230** | Sweet spot | | 0.15 | — | 240 | Slight improvement on Q4 | | 0.2 | 158 | 233 | Too noisy → more steps | Batch size — auto is optimal
| -ub / -b | Q6_K tok/s | Notes | |----------|------------|-------| | auto (4096) | **213** | Derived from -n / ctx | | 512 | 203 | Smaller = less parallelism | | 8192 | 213 | Larger = no benefit | Key Findings
**Q4_K_M is the better choice** — 50% more context (10k vs 6k) and 18% fastergeneration (252 vs 213 tok/s at max context). **Temperature is everything** — lowering t-max from 0.8→0.3 and t-min from0.4→0.05 accounts for virtually all the speedup. The rest of the EB paramsare already well-tuned at defaults. **Bigger context doesn't slow down Q4_K_M** — speed actually *improves* atlarger context (252 tok/s at -n 8192 vs 244 at -n 2048). The larger batchgives the entropy-bound sampler better signal. **Flash Attention is the blocker for 64k** — once SM120 support lands inllama.cpp, the compute buffer bottleneck goes away and DiffusionGemma'sfull 262k context should be reachable on a single RTX 5090. [link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.