r/LocalLLaMA · · 1 min read

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

These last few weeks have been godsend for 24GB (and below) gpu poor peeps.

  1. Killer models released (Gemma 4 / Qwen 3.6)
  2. Free intelligence via QAT
  3. Bonus speed via MTP

We're at the tipping point where GPU poor (24gb and below) people are actually NOT poor any more.

I was already happy with Gemma 4 31b running at 40tok/s but now its 70-80tok/s

Its not a wonder 3090 prices are increasing.

For ref:
- limit=1, OSL=192, concurrency 1, temp=1.0/top_k=64/top_p=0.95, ctx=40960, q8_0 KV cache, parallel=1
- For the 12b, did test for both TEXT only as well as mmproj multimodal. Same speedup increase.
(Im TOTALLY Loving the fact that you can actually TALK to the model, and its a split second before it starts generating a response. No TTS yet though)

• Hardware
- CPU: Intel Core i9-13900H, 14 cores / 20 threads
- RAM: 62 GiB system RAM, 8 GiB swap
- GPU: NVIDIA GeForce RTX 3090, 24 GiB VRAM
- Driver/CUDA: NVIDIA driver 595.71.05, CUDA 13.2
- OS/kernel: Ubuntu 24.04-ish, Linux 6.17.0-35-generic

Startup config: llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --parallel 1 \ --ctx-size 40960 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --spec-draft-ngl all \ --spec-draft-type-k q8_0 \ --spec-draft-type-v q8_0 \ 
submitted by /u/LeatherRub7248
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA