r/LocalLLaMA · June 8, 2026 · 1 min read

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

#model-release #gpu

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

These last few weeks have been godsend for 24GB (and below) gpu poor peeps.

Killer models released (Gemma 4 / Qwen 3.6)
Free intelligence via QAT
Bonus speed via MTP

We're at the tipping point where GPU poor (24gb and below) people are actually NOT poor any more.

I was already happy with Gemma 4 31b running at 40tok/s but now its 70-80tok/s

Its not a wonder 3090 prices are increasing.

For ref:
- limit=1, OSL=192, concurrency 1, temp=1.0/top_k=64/top_p=0.95, ctx=40960, q8_0 KV cache, parallel=1
- For the 12b, did test for both TEXT only as well as mmproj multimodal. Same speedup increase.
(Im TOTALLY Loving the fact that you can actually TALK to the model, and its a split second before it starts generating a response. No TTS yet though)

• Hardware
- CPU: Intel Core i9-13900H, 14 cores / 20 threads
- RAM: 62 GiB system RAM, 8 GiB swap
- GPU: NVIDIA GeForce RTX 3090, 24 GiB VRAM
- Driver/CUDA: NVIDIA driver 595.71.05, CUDA 13.2
- OS/kernel: Ubuntu 24.04-ish, Linux 6.17.0-35-generic

Startup config: llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 40960 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-draft-ngl all \ --spec-draft-type-k q8_0 \ --spec-draft-type-v q8_0 \

submitted by /u/LeatherRub7248
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA