r/LocalLLaMA · · 5 min read

Can we stop dunking on DiffusionGemma and hack it instead?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Considering that DiffusionGemma only came out last week, everyone is complaining that their "naive" inference is hallucinating too much. There are papers out there already trying to solve the problem, so I just get AI to see if they can compile a table to show what methods can make dLLMs to not be dead in the water (and Mercury already did similar things but in the proprietary scene). So just grill me if the AI output is not enough to get llama.cpp /vLLM or whatever agents to start doing their jobs on accelerating inference by 3x.

Legend: ⚙️ = Drop-in (prompt/config today) | 🛠️ = Wrapper (orchestration/validation/retrieval) | 🔧 = Decoder (custom sampler/runtime for largest gains).

# Method Type Concise Action Expected Benefit (vs Naive 256-Token Rendering) Citation Cluster
Tier 0: Foundational Official Settings (Must-Use Baseline – Fixes ~80% of Complaints)
1 Entropy-Bounded Sampler + Adaptive Stopping ⚙️ Drop-in Commit lowest-entropy tokens until accumulated entropy exceeds bound (0.1); stop when argmax stable (2+ steps) and mean entropy < 0.005 Prevents premature termination/over-refinement hallucinations; dynamic steps by task complexity; 2–3× effective speedup; core path to match Qwen-level quality Google model card & HF config (2026); Ben-Hamu et al. (EB-Sampler, NeurIPS 2025, arXiv:2505.24857)
2 Canvas Cap + Task-Tuned Entropy ⚙️ Drop-in Keep 256-token canvas but set max_new_tokens short for tool calls (64–128); lower bound (0.03–0.05) for tools/deterministic, higher (0.15–0.2) for factual/reasoning Reduces noise/waste on short structured outputs; deterministic tool selection; preserves candidate diversity to cut premature hallucination and improve reasoning Google serving examples (2026); EB-Sampler family + hallucination-mode papers (2026)
3 Thinking Mode + Clean History ⚙️ Drop-in Add enable_thinking=True for reasoning/tool selection; retain only final (non-thinking) response in multi-turn history Strongly boosts tool choice, argument discovery, instruction following, and reasoning; prevents context pollution in agents (key gap vs Qwen) Google model card (2026): “Function calling works best in thinking mode”; best-practices note
Tier 1: High-ROI Workflow & Structured Output (Wrappers – Critical for Tool Use & Agents)
4 S³ Schema Scaffolding ⚙️ Drop-in / 🛠️ Wrapper Pre-fill correct JSON/function skeleton (braces, keys, enums, punctuation) in output context; model fills values only Exploits bidirectional global refinement for +65% structural adherence, +48% fidelity, –17% hallucination; near-perfect JSON/tool syntax (closes major gap to Qwen) Xiong et al. (Self-Adaptive Schema Scaffolding, ~arXiv:2507.04504, 2025); structured-output diffusion works
5 Rich Schemas + Validate-Before-Execute + Draft-Serialize Split 🛠️ Wrapper Use verbose semantic tool descriptions; always parse/validate before execution or history append; use DiffusionGemma for planning, specialist for final serialization Addresses symbolic brittleness, indirect requests, and schema drift; separates reasoning from exact syntax; prevents malformed execution in agents Google function-calling guide (2026); agentic dLLM papers (2025–2026 cluster)
6 Faithful Mode + Mid-Denoising Retrieval (SARDI-style) 🛠️ Wrapper For factual/tool-grounded/reasoning tasks: raise budget (60–80 steps), trigger retrieval from low-confidence tentative tokens during denoising Counters dLLM-specific failures (premature termination, incomplete denoising, context intrusion); improves factuality, reasoning, and multi-hop agent performance at high throughput “Lost in Diffusion” analyses (2026); SARDI-style retrieval-during-denoising papers (2025–2026)
7 Never Stream Raw Denoising States 🛠️ Wrapper Show only final converged/committed spans to users; reserve streamer for debugging only Prevents UX erosion and false perception of hallucination from garbled intermediates before convergence Google HF inference notebook (2026)
Tier 2: Advanced Sampling, Caching & Constraints (Decoder Upgrades – Highest ROI for Closing Gap to Qwen/SOTA)
8 KLASS / Confidence-Aware Commit 🔧 Decoder Replace default commit with token-level KL divergence (or full confidence-profile selection) between timesteps to identify stable tokens Superior stability detection vs raw entropy; 2–2.78× wall-clock speedup + reasoning quality gains over greedy diffusion Kim et al. (KLASS-style, NeurIPS Spotlight 2025, arXiv:2511.05664); BACD/CadLLM/Prophet cluster (2026)
9 Fast-dLLM Family (Approximate KV + Parallel Decoding) 🔧 Decoder Port block-wise approximate KV cache + confidence-aware parallel unmasking (Fast-dLLM or v2) Solves bidirectional KV-cache problem; up to 27.6× throughput with <1–2% accuracy loss; enables practical multi-canvas use while maintaining quality Wu et al. (Fast-dLLM, arXiv:2505.22618, ICLR 2026 & v2)
10 SureLock / dKV-Cache / d²Cache Family 🔧 Decoder Lock converged tokens (skip Q/FFN while allowing attention); use delayed conditional or attention-aware KV selection; compress redundant masks 30–50% FLOP reduction or 2–12× effective speedup; critical for quantized long-context efficiency and agent stability Oba et al. (SureLock-style, ICLR 2026); Ma/Hu/Liu (dKV-Cache, FreeCache, d²Cache, Elastic-dLLM cluster, 2025–2026)
11 CFG / Constrained Discrete Diffusion (CDD) 🔧 Decoder Reject updates violating context-free grammar/regex during sampling (additive infilling or dynamic programming for max-probability valid strings) Near-100% syntactic correctness for JSON/tool calls/code (~30% median overhead); vastly superior to prompting/scaffolding alone; closes tool-use gap to SOTA Cardei et al. (Constrained Discrete Diffusion, arXiv:2503.09790, 2025); Mündler et al. (CFG variants, arXiv:2508.10111, ICLR 2026); DINGO-style methods
12 Remask / Review-Remask-Refine (R3/CORE) 🔧 Decoder On malformed/suspect spans (bad JSON field, code tail, factual error), reset only that span to [MASK] and re-denoise (avoid overwriting corrupted context) Strong for exact token-level repair in tool calls, code, JSON, and multi-turn agents; prevents error propagation and improves reasoning consistency Mounier et al. (Review, Remask, Refine (R3), arXiv:2507.08018, ICML 2025); CORE cluster (2026)
Tier 3: Variable-Length, Self-Verification & Advanced Factuality (Decoder/Wrapper – For Complex Agents & Reasoning)
13 DAEDAL / Length-Aware Dynamic Canvas + DyStruct 🔧 Decoder Start short; dynamically expand via early EOS/confidence or Bayesian block partitioning (Chinese Restaurant Process); crop after first denoising step when length distribution is clear Avoids full 256-canvas cost on short tool calls; adaptive structure for unpredictable agent outputs; reduces forced-length hallucinations and improves efficiency DAEDAL/Length-Aware Cropping/DyStruct/LR-DLLM cluster (2025–2026); Block Diffusion extensions (Arriola et al., arXiv:2503.09573, ICLR 2025 Oral)
14 S2D2 / BlockBatch / Self-Rewarding SMC + Prophet Early-Answer 🔧 Decoder / 🛠️ Wrapper Same model for large-block draft + small-block (AR-like) verification; multi-branch/trajectory sampling with confidence reweighting; early-commit when answer known in initial steps Self-speculation reduces NFEs (up to 4–6× speedup); multi-particle improves quality/reliability on hard reasoning/tool/agent prompts; cuts unnecessary refinement S2D2, BlockBatch, TCCF, AsyncLane, Self-Rewarding SMC, Prophet cluster (2025–2026); Block Diffusion (Arriola et al., 2025)
15 TDGNet-Style Trajectory Hallucination Detector + SARDI Retrieval 🔧 Decoder / 🛠️ Wrapper Score full denoising trajectory (evolving attention-graph dynamics) rather than only final output; reject unstable trajectories; trigger retrieval from tentative tokens during denoising Treats factuality as trajectory property (not endpoint); stronger detector + diffusion-native retrieval for multi-hop QA, reasoning, and agentic reliability; closes gap to SOTA like DeepSeek/GLM TDGNet & trajectory detectors (2026 cluster); SARDI-style papers (2025–2026); aligns with R3/Remask philosophy
submitted by /u/TomLucidor
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA