Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I completed a Python bug hunting benchmark with Gemma 4 12B. I used the Unsloth Dynamic Q5 GGUF model. The model has good capabilities. Default settings in LM Studio disable the reasoning.
Fix the LM Studio reasoning configuration. LM Studio looks for Qwen tokens. Gemma 4 uses different tokens. Change your settings with these steps.
• Open your inference settings.
• Add this text to the first line of your Jinja template: {%- set enable_thinking = true %}
• Set the start token to <|channel>thought
• Set the end token to <channel|>
Change your sampling parameters. Do not decrease the temperature. Low temperature hurts the reasoning quality. Use the official Google parameters.
• Set temperature to 1.0
• Set top_p to 0.95
• Set top_k to 64
Benchmark results and data. The model rewrote spatial loops correctly. The model replaced slow loops with a BallTree algorithm. The small size creates a limit for the model.
- Qwen 35B q4 k xl found 14 bugs.
- Gemma 4 12B q5 k xl found 6 bugs.
Better than 26B run I had. Probably need to find the better jinja file for it to work.
Configure your backend correctly to get the correct performance.
[link] [comments]
More from r/LocalLLaMA
-
OpenLumara - A different kind of AI agent, written from scratch, not vibecoded. Extremely token-efficient, super small system prompt, made for local models. Everything is modular.
Jun 5
-
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss
Jun 5
-
dots.tts 2B🎙️ SOTA TTS from RedNote
Jun 5
-
Don’t act like y’all ain’t thinking it. I’m just saying the quiet part out loud. /s
Jun 5
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.