Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me. It is a major regression compared to the standard Q5_K_L version, which worked without issue.
I know the general consensus is that Qwen is for coding and Gemma is for creatives. But I can tell you for a fact that I code very well with the regular Q5_K_L version. When factoring in prompt structure, edits, and specific coding languages, I was able to generate 2,300 solid lines of code on a project (fully debugged, architecturally sound, and tested) . Additionally, I was able to generate 10,000 lines of story writing on a generic prompt about a samurai. Speed is not everything.
The main problem with this QAT model is that it constantly questions itself during generation. I tried using it for coding in my custom VS Code extension, writing stories, and real use cases, but the results are completely inconsistent despite hitting a solid 60 tokens a second.
The core failure point shows up right in the server startup logs:
W load: control-looking token: 50 '<|tool_response|>' was not control-type; this is probably a bug in the model. its type will be overridden
Because the model misconfigures and overrides its own tool response tags before it even starts processing, structured function execution is broken. If you rely on agent workflows or developer extensions, save your time and stick to the regular quants.
I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me. It is a major regression compared to the standard Q5_K_L version, which worked without issue.
I know the general consensus is that Qwen is for coding and Gemma is for creatives. But I can tell you for a fact that I code very well with the regular Q5_K_L version. When factoring in prompt structure, edits, and specific coding languages, I was able to generate 2,300 solid lines of code on a project. Additionally, I was able to generate 10,000 lines of story writing on a generic prompt about a samurai. Speed is not everything.
The main problem with this QAT model is that it constantly questions itself during generation. I tried using it for coding in my custom VS Code extension, writing stories, and real use cases, but the results are completely inconsistent despite hitting a solid 60 tokens a second.
To rule out any backend or hardware misconfiguration, here is the continuous startup block from my server logs showing the exact GPU detection, thread assignment, context allocation, and the native template auto-match:
0.00.074.191 I - CUDA0 : NVIDIA GeForce RTX 4080 SUPER (16375 MiB, 15061 MiB free) 0.00.074.205 I - CPU : 12th Gen Intel(R) Core(TM) i7-12700KF (98097 MiB, 86472 MiB free) 0.00.074.254 I system_info: n_threads = 12 (n_threads_batch = 12) / 20 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.00.074.293 I srv init: using 19 threads for HTTP server 0.00.080.574 I srv load_model: loading model 'E:\models\gemma-4-12B-it-qat-UD-Q4_K_XL.gguf' 0.01.205.117 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.205.496 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.242.092 W load: special_eog_ids contains '<|tool_response|>', removing '</s>' token from EOG list 0.03.279.202 W llama_context: n_ctx_seq (32768) < n_ctx_train (262144) -- the full capacity of the model will not be utilized 0.03.370.810 I slot load_model: id 0 | task -1 | new slot, n_ctx = 32768 0.03.370.887 I srv load_model: prompt cache is enabled, size limit: 8192 MiB 4.07.196.023 I srv params_from_: Chat format: peg-gemma4 The hardware lines prove the 4080 Super is utilized cleanly and thread execution matches the i7-12700KF topology correctly. The server successfully initialized the 32768 context size and auto-detected the proper native peg-gemma4 chat layout from the model metadata on its own.
This completely isolates the broken tool calling to the token bug shown in the warnings. The model is misconfiguring and overriding its own tool response tags before it even starts processing, breaking structured function execution. If you rely on agent workflows or developer extensions, save your time and stick to the regular quants.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.