r/LocalLLaMA · June 9, 2026 · 2 min read

Gemma 4 31B's competence surprised me

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia. My existing code from past projects is a mess and LLMs often fail to understand my code because I work with niche models, don't comment much, and sometimes have misleading variable names that LLMs over index on (if I redesign things as I learn new information I might not rename variables as I change their use). So, I'm moving at a very deliberate pace as I try to integrate local LLMs into my coding workflow.

In an early test of local models' ability to simply explain how some code implemented a model that was described in a paper, the Qwen 3.6 models had stand out performance. So, on a test project expanding some old messy code from my dissertation, I was really surprised to find Gemma 4 31b substantially outperformed Qwen 3.6 (both the 27b model and the 35b a3b) and Opus 4.7 assessed it's performance as essentially being on par with it's own performance. This repo explains the project in detail.

My main takeaways were that Gemma 4 31b is stellar at actually understanding how the parts of my code fit together, knowing that if it changes one thing, how that affects other parts of the code. The Qwen 3.6 models felt over zealous; they often rewrote the file I gave them with modification plans and requested access outside of the working directory. Qwen 3.6 27b did spot an improvement that could be made to my code that was overlooked by both Gemma and Opus, but it was with a sub component that wasn't being used the notebooks I provided it with and that improvement was entirely local, it didn't involve understanding how a change in one place required a change somewhere else.

This is all anecdotal and I didn't begin this intending to make a post. Some models got slightly different prompts than others, but the performance difference was just so contrary to my expectations that I had to post and I'm interested in hearing if others have had similar experiences? Does anyone know what benchmarks might track the sort of capabilities I'm looking for in a model? Most benchmarks seem to show Qwen outperforming Gemma. I did see that the SciCode benchmark is one where Gemma beats Qwen and am wondering if that's a benchmark I should index on in the future. Idk if I'm describing it looking for the right things in these models, so I'm interested in hearing others thoughts.

submitted by /u/The_Paradoxy
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA