r/LocalLLaMA · June 14, 2026 · 5 min read

Can we stop dunking on DiffusionGemma and hack it instead?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Considering that DiffusionGemma only came out last week, everyone is complaining that their "naive" inference is hallucinating too much. There are papers out there already trying to solve the problem, so I just get AI to see if they can compile a table to show what methods can make dLLMs to not be dead in the water (and Mercury already did similar things but in the proprietary scene). So just grill me if the AI output is not enough to get llama.cpp /vLLM or whatever agents to start doing their jobs on accelerating inference by 3x.

Legend: ⚙️ = Drop-in (prompt/config today) | 🛠️ = Wrapper (orchestration/validation/retrieval) | 🔧 = Decoder (custom sampler/runtime for largest gains).

#	Method	Type	Concise Action	Expected Benefit (vs Naive 256-Token Rendering)	Citation Cluster
Tier 0: Foundational Official Settings (Must-Use Baseline – Fixes ~80% of Complaints)
1	Entropy-Bounded Sampler + Adaptive Stopping	⚙️ Drop-in	Commit lowest-entropy tokens until accumulated entropy exceeds bound (0.1); stop when argmax stable (2+ steps) and mean entropy < 0.005	Prevents premature termination/over-refinement hallucinations; dynamic steps by task complexity; 2–3× effective speedup; core path to match Qwen-level quality	Google model card & HF config (2026); Ben-Hamu et al. (EB-Sampler, NeurIPS 2025, arXiv:2505.24857)
2	Canvas Cap + Task-Tuned Entropy	⚙️ Drop-in	Keep 256-token canvas but set `max_new_tokens` short for tool calls (64–128); lower bound (0.03–0.05) for tools/deterministic, higher (0.15–0.2) for factual/reasoning	Reduces noise/waste on short structured outputs; deterministic tool selection; preserves candidate diversity to cut premature hallucination and improve reasoning	Google serving examples (2026); EB-Sampler family + hallucination-mode papers (2026)
3	Thinking Mode + Clean History	⚙️ Drop-in	Add `enable_thinking=True` for reasoning/tool selection; retain only final (non-thinking) response in multi-turn history	Strongly boosts tool choice, argument discovery, instruction following, and reasoning; prevents context pollution in agents (key gap vs Qwen)	Google model card (2026): “Function calling works best in thinking mode”; best-practices note
Tier 1: High-ROI Workflow & Structured Output (Wrappers – Critical for Tool Use & Agents)
4	S³ Schema Scaffolding	⚙️ Drop-in / 🛠️ Wrapper	Pre-fill correct JSON/function skeleton (braces, keys, enums, punctuation) in output context; model fills values only	Exploits bidirectional global refinement for +65% structural adherence, +48% fidelity, –17% hallucination; near-perfect JSON/tool syntax (closes major gap to Qwen)	Xiong et al. (Self-Adaptive Schema Scaffolding, ~arXiv:2507.04504, 2025); structured-output diffusion works
5	Rich Schemas + Validate-Before-Execute + Draft-Serialize Split	🛠️ Wrapper	Use verbose semantic tool descriptions; always parse/validate before execution or history append; use DiffusionGemma for planning, specialist for final serialization	Addresses symbolic brittleness, indirect requests, and schema drift; separates reasoning from exact syntax; prevents malformed execution in agents	Google function-calling guide (2026); agentic dLLM papers (2025–2026 cluster)
6	Faithful Mode + Mid-Denoising Retrieval (SARDI-style)	🛠️ Wrapper	For factual/tool-grounded/reasoning tasks: raise budget (60–80 steps), trigger retrieval from low-confidence tentative tokens during denoising	Counters dLLM-specific failures (premature termination, incomplete denoising, context intrusion); improves factuality, reasoning, and multi-hop agent performance at high throughput	“Lost in Diffusion” analyses (2026); SARDI-style retrieval-during-denoising papers (2025–2026)
7	Never Stream Raw Denoising States	🛠️ Wrapper	Show only final converged/committed spans to users; reserve streamer for debugging only	Prevents UX erosion and false perception of hallucination from garbled intermediates before convergence	Google HF inference notebook (2026)
Tier 2: Advanced Sampling, Caching & Constraints (Decoder Upgrades – Highest ROI for Closing Gap to Qwen/SOTA)
8	KLASS / Confidence-Aware Commit	🔧 Decoder	Replace default commit with token-level KL divergence (or full confidence-profile selection) between timesteps to identify stable tokens	Superior stability detection vs raw entropy; 2–2.78× wall-clock speedup + reasoning quality gains over greedy diffusion	Kim et al. (KLASS-style, NeurIPS Spotlight 2025, arXiv:2511.05664); BACD/CadLLM/Prophet cluster (2026)
9	Fast-dLLM Family (Approximate KV + Parallel Decoding)	🔧 Decoder	Port block-wise approximate KV cache + confidence-aware parallel unmasking (Fast-dLLM or v2)	Solves bidirectional KV-cache problem; up to 27.6× throughput with <1–2% accuracy loss; enables practical multi-canvas use while maintaining quality	Wu et al. (Fast-dLLM, arXiv:2505.22618, ICLR 2026 & v2)
10	SureLock / dKV-Cache / d²Cache Family	🔧 Decoder	Lock converged tokens (skip Q/FFN while allowing attention); use delayed conditional or attention-aware KV selection; compress redundant masks	30–50% FLOP reduction or 2–12× effective speedup; critical for quantized long-context efficiency and agent stability	Oba et al. (SureLock-style, ICLR 2026); Ma/Hu/Liu (dKV-Cache, FreeCache, d²Cache, Elastic-dLLM cluster, 2025–2026)
11	CFG / Constrained Discrete Diffusion (CDD)	🔧 Decoder	Reject updates violating context-free grammar/regex during sampling (additive infilling or dynamic programming for max-probability valid strings)	Near-100% syntactic correctness for JSON/tool calls/code (~30% median overhead); vastly superior to prompting/scaffolding alone; closes tool-use gap to SOTA	Cardei et al. (Constrained Discrete Diffusion, arXiv:2503.09790, 2025); Mündler et al. (CFG variants, arXiv:2508.10111, ICLR 2026); DINGO-style methods
12	Remask / Review-Remask-Refine (R3/CORE)	🔧 Decoder	On malformed/suspect spans (bad JSON field, code tail, factual error), reset only that span to [MASK] and re-denoise (avoid overwriting corrupted context)	Strong for exact token-level repair in tool calls, code, JSON, and multi-turn agents; prevents error propagation and improves reasoning consistency	Mounier et al. (Review, Remask, Refine (R3), arXiv:2507.08018, ICML 2025); CORE cluster (2026)
Tier 3: Variable-Length, Self-Verification & Advanced Factuality (Decoder/Wrapper – For Complex Agents & Reasoning)
13	DAEDAL / Length-Aware Dynamic Canvas + DyStruct	🔧 Decoder	Start short; dynamically expand via early EOS/confidence or Bayesian block partitioning (Chinese Restaurant Process); crop after first denoising step when length distribution is clear	Avoids full 256-canvas cost on short tool calls; adaptive structure for unpredictable agent outputs; reduces forced-length hallucinations and improves efficiency	DAEDAL/Length-Aware Cropping/DyStruct/LR-DLLM cluster (2025–2026); Block Diffusion extensions (Arriola et al., arXiv:2503.09573, ICLR 2025 Oral)
14	S2D2 / BlockBatch / Self-Rewarding SMC + Prophet Early-Answer	🔧 Decoder / 🛠️ Wrapper	Same model for large-block draft + small-block (AR-like) verification; multi-branch/trajectory sampling with confidence reweighting; early-commit when answer known in initial steps	Self-speculation reduces NFEs (up to 4–6× speedup); multi-particle improves quality/reliability on hard reasoning/tool/agent prompts; cuts unnecessary refinement	S2D2, BlockBatch, TCCF, AsyncLane, Self-Rewarding SMC, Prophet cluster (2025–2026); Block Diffusion (Arriola et al., 2025)
15	TDGNet-Style Trajectory Hallucination Detector + SARDI Retrieval	🔧 Decoder / 🛠️ Wrapper	Score full denoising trajectory (evolving attention-graph dynamics) rather than only final output; reject unstable trajectories; trigger retrieval from tentative tokens during denoising	Treats factuality as trajectory property (not endpoint); stronger detector + diffusion-native retrieval for multi-hop QA, reasoning, and agentic reliability; closes gap to SOTA like DeepSeek/GLM	TDGNet & trajectory detectors (2026 cluster); SARDI-style papers (2025–2026); aligns with R3/Remask philosophy

submitted by /u/TomLucidor
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA