120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result!
By using llama.cpp patched with the Gemma 4 MTP PR, and loading Unsloth's gemma-4-12B-it-qat-GGUF quant and Google's gemma-4-12B-it-qat-q4_0-unquantized-assistant QAT assistant / draft model, which I converted to GGUF and uploaded to HuggingFace as gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF using llama.cpp's convert_hf_to_gguf.py, I was able to achieve 120 tok/s with mtp-bench.py!
Before we start, here's my PC specs:
OS: CachyOS GPU: RTX 4070 Super 12GB (iGPU as main GPU) CPU: AMD Ryzen 7 9700X RAM: 32GB DDR5-6000 Here's my llama.cpp command:
llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --ctx-size 131072 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 For comparison, here's my mtp-bench.py benchmark results without MTP:
❯ ./mtp-bench.py code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.9 translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=60.0 stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=59.8 long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=57.6 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 0, "total_draft_accepted": 0, "aggregate_accept_rate": null, "wall_s_total": 30.2 } Here's my mtp-bench.py benchmark results with MTP:
❯ ./mtp-bench.py code_python pred= 192 draft= 172 acc= 133 rate=0.773 tok/s=130.5 code_cpp pred= 192 draft= 187 acc= 128 rate=0.684 tok/s=120.4 explain_concept pred= 192 draft= 213 acc= 119 rate=0.559 tok/s=105.7 summarize pred= 192 draft= 168 acc= 134 rate=0.798 tok/s=133.5 qa_factual pred= 192 draft= 210 acc= 120 rate=0.571 tok/s=107.2 translation pred= 192 draft= 175 acc= 132 rate=0.754 tok/s=128.6 creative_short pred= 192 draft= 240 acc= 110 rate=0.458 tok/s=94.0 stepwise_math pred= 192 draft= 165 acc= 135 rate=0.818 tok/s=135.7 long_code_review pred= 192 draft= 197 acc= 125 rate=0.634 tok/s=111.7 Aggregate: { "n_requests": 9, "total_predicted": 1728, "total_draft": 1727, "total_draft_accepted": 1136, "aggregate_accept_rate": 0.6578, "wall_s_total": 15.66 } To achieve this, all you need is a 12GB NVIDIA GPU and enough free VRAM to fit Gemma 4 12GB + assistant entirely in GPU memory. With CachyOS and my dGPU set as a secondary GPU, this gives me pretty much 100% free VRAM. On Windows, or if using your dGPU as your main GPU, you will probably loose 500MB+ of VRAM to the OS and driver, so you might need to lower the context size, or it might simply not work. You'll probably need to do some testing 😄
Here's step-by-step instructions to get this working:
1. Clone llama.cpp git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp 2. Fetch and switch to the Gemma 4 MTP PR branch git fetch origin pull/23398/head:gemma4-mtp git checkout gemma4-mtp 3. Build with CUDA support for NVIDIA GPUs cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF cmake --build build --config Release -j$(nproc) 4. Download Unsloth's Gemma 4 12B QAT here: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF 5. Download Google's Gemma 4 assistant / draft here https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF 6. Load the models with llama-server llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \ --spec-type draft-mtp \ --spec-draft-n-max 4 \ --ctx-size 131072 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 Cheers 😄
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.