r/LocalLLaMA · May 31, 2026 · 2 min read

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Ran a fair comparison between Qwen3.6-35B-A3B and Gemma4-26B-A4B on my Radeon 7900 XTX. Both reasoning-enabled at matching 32K budgets, no output caps, six generic real-world prompts (meeting notes, incident postmortem, log triage to JSON, code review, a build-vs-buy decision, a creative prompt).

TL;DR: the model with the slower decoder won the wall clock. Qwen’s MTP makes it ~1.65x faster at emitting tokens (130 vs 78 tok/s), but it generates ~2x as many tokens to answer the same prompt, most going to internal reasoning. Net result: Gemma is ~20% faster end to end. Aggregate across all six: Qwen 118.8s vs Gemma 95.6s.

Setup:
Ryzen 9600X, Sapphire NITRO+ 7900 XTX 24GB, 96GB DDR5-6800
ROCm 7.2.3, HIP gfx1100, llama.cpp build 9425, GGML_HIP=ON, ROCWMMA_FATTN=OFF
Qwen3.6-35B-A3B: IQ4_XS-Q8nextn hybrid MTP (~20GB), draft-n-max 3
Gemma4-26B-A4B: UD-Q4_K_XL (~17GB), no MTP

Key findings:
Qwen generated 14,811 tokens across the six workloads vs Gemma’s 7,386 — about 2x. It also spends a higher fraction of that on thinking (74% vs 57% aggregate).

Per-workload wall clock:
meeting-notes: Qwen 12.2s vs Gemma 10.8s (Gemma)
incident-postmortem: Qwen 28.2s vs Gemma 21.6s (Gemma)
log-triage-json: Qwen 10.4s vs Gemma 9.0s (Gemma)
code-review: Qwen 20.6s vs Gemma 23.1s (Qwen — the one task where both reasoned least)
build-vs-buy: Qwen 33.1s vs Gemma 21.5s (Gemma)
creative-spark: Qwen 14.4s vs Gemma 9.5s (Gemma)

The MTP question: on pure decode, MTP delivers 130 vs 78 tok/s, accept rates 41-62% (52.5% overall). But MTP only speeds up how fast you emit tokens, not how many. With thinking on, token count is the bottleneck, so the decode edge mostly evaporates. Measured as useful content chars/sec the two are basically tied (~137 vs ~130).

Quality: genuinely close, with interesting splits. On the code review Gemma caught a missing-param TypeError that Qwen missed. On build-vs-buy they gave opposite, but both defensible, answers (Qwen: managed Algolia; Gemma: just use Postgres, don’t touch Elasticsearch). On the strict-JSON task Qwen followed “no prose” and emitted bare JSON; Gemma wrapped it in a code fence. Neither hallucinated.

My conclusion: use both. Throughput-bound batch work to Qwen (decode speed compounds across many sequential requests, and it follows strict output formats). Latency-sensitive single requests to Gemma (it was ~20% faster wall clock here despite the slower decoder). The takeaway a spec sheet won’t give you: a faster decoder doesn’t mean faster responses. Tokens-to-answer beats tokens-per-second once reasoning is in the loop.

Full benchmark details and raw prompt/output pairs: Qwen3.6-35B vs Gemma4-26B: Real Workload Benchmarks on Radeon 7900 XTX · kmarble.dev
Raw data available on request.

submitted by /u/IvGranite
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA