GLM 5.2 Q1_S vs Qwen 27B Q8
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| TL;DR; GLM-5.2 Q1_S beats Qwen 3.6 27B Q8, both run at KV Q8 edit: GLM run a K & V Q8, Qwen run with KV cache at full FP16., with preserve thinking on. Disclaimer: This is a hobby/amateur comparison with n=1, so go easy on it. I just thought it would be fun to share. The Context and The TaskSome time ago there were quite a few discussions on what's better: a lower quant of a larger model, or a higher quant of a smaller model. We got quite a few benchmarks and in-house tests, which were mostly consistent — the larger model at a lower quant was better. Nowadays I often see claims of anything lower than Q3 being 'braindead' regardless of the actual size. I've also noticed some comments belittling people who share how they've managed to run huge models on their consumer-grade hardware, just because it was a low quant. So, I did a little test. Beloved Qwen 27B at Q8 vs 'braindead' GLM 5.2 at Q1_S. The Q1_S is the smallest quant I could find, but I really wouldn't be able to run Q2 anyway. My hardware is 2 x RTX 3090, 24GB VRAM each (limited to 200W power) and 192 GB DDR5 RAM. I run Qwen at ~60 tps gen, and GLM at ~6 tps at low context down to 3 tps nearing 100k context. I picked a simple tech stack and clear instructions, so that there would be as little variance due to instruction ambiguity as possible. Both models were run under the pi harness, with the exact same config and prompts. The instruction was to build a simple 3D game in Three.js (HTML/CSS/JS); the full content is attached at the end. This is the second attempt at this test. The first one was not documented and used a different tech stack, but the results were practically the same. Qwen 3.6 27BIt went quick, that's for sure — just a couple of minutes and ~20k tokens. But it failed to build a working product. After instructing it to fix it, it was 'working' but still not playable; it required another 2 prompts to make it 'done'. So in total: 1 initial + 3 follow-ups, with a total of ~42k tokens. https://i.redd.it/xz4zwgclk6ah1.gif GLM 5.2 Q1_SDamn, it was slow. It took a couple of hours and 75k tokens, but it was really impressive to see it lay out its thought process, and it did think A LOT. Planning and reassessing all its assumptions as well as the already-designed code, improving it over many iterations in the thinking process alone. I've been using Qwen as my daily local driver for the last two months, but the GLM thinking traces are something very different and much more impressive. It built it in one shot, 100% correct, with a proper 'premium' feel to the product. It's also the only one that managed to add sound to the game, which was omitted even by full GLM and Opus in later attempts. https://i.redd.it/8tcd6q2mk6ah1.gif GLM 5.2 full precision and OpusJust for fun, I also passed the same instruction to two other models: Opus, and GLM in full precision via OpenRouter, under the same pi harness. Both were impressive, but only a tiny bit better than the local attempts visually, if at all. Especially GLM in full precision disappointed me, as it inverted the direction of the control keys, which makes it really hard to play — even though the Q1_S version got it right. BUT — GLM 5.2 in FP did it in 11k tokens only, including system prompts. I don't know if this is the API provider's default restriction on thinking tokens, or if the high quantization made the model overthink a lot, but the Q1 really held up well against the FP. The GLM full precision output: https://i.redd.it/fta2d8ymk6ah1.gif The only other attempt that felt genuinely better was Opus 4.8, but under Claude Code instead of pi. I won't provide more details on these, as they're beyond the point of this post. LLM as a judge: Code QualityI passed all the outputs to Opus 4.8 and GPT 5.5 for them to rate code quality and instruction following; the results are as follows. Opus rating: Qwen — Code Quality: 7.5 | Instruction Following: 9.0 | Stretch/Polish: 8.5 | Overall: 8.3 GPT 5.5 rating: Qwen — Code Quality: 6.4 | Instruction Following: 7.0 | Stretch/Polish: 7.0 | Overall: 6.7 Both agree that Q1_S did the best job. It is nearly sure that's thanks to the thinking a lot, but nonetheless, prove that Q1 is a capable quant after all. You just need to find the proper use cases for it, as I wouldn't suggest using it as real-time agentic backend. Full instruction prompt: [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.