r/LocalLLaMA · June 29, 2026 · 6 min read

GLM 5.2 Q1_S vs Qwen 27B Q8

#model-release

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

TL;DR; GLM-5.2 Q1_S beats Qwen 3.6 27B Q8, both run at KV Q8

edit: GLM run a K & V Q8, Qwen run with KV cache at full FP16., with preserve thinking on.

Disclaimer: This is a hobby/amateur comparison with n=1, so go easy on it. I just thought it would be fun to share.

The Context and The Task

Some time ago there were quite a few discussions on what's better: a lower quant of a larger model, or a higher quant of a smaller model. We got quite a few benchmarks and in-house tests, which were mostly consistent — the larger model at a lower quant was better.

Nowadays I often see claims of anything lower than Q3 being 'braindead' regardless of the actual size. I've also noticed some comments belittling people who share how they've managed to run huge models on their consumer-grade hardware, just because it was a low quant.

So, I did a little test. Beloved Qwen 27B at Q8 vs 'braindead' GLM 5.2 at Q1_S. The Q1_S is the smallest quant I could find, but I really wouldn't be able to run Q2 anyway.

My hardware is 2 x RTX 3090, 24GB VRAM each (limited to 200W power) and 192 GB DDR5 RAM. I run Qwen at ~60 tps gen, and GLM at ~6 tps at low context down to 3 tps nearing 100k context.

I picked a simple tech stack and clear instructions, so that there would be as little variance due to instruction ambiguity as possible.

Both models were run under the pi harness, with the exact same config and prompts. The instruction was to build a simple 3D game in Three.js (HTML/CSS/JS); the full content is attached at the end.

This is the second attempt at this test. The first one was not documented and used a different tech stack, but the results were practically the same.

Qwen 3.6 27B

It went quick, that's for sure — just a couple of minutes and ~20k tokens. But it failed to build a working product. After instructing it to fix it, it was 'working' but still not playable; it required another 2 prompts to make it 'done'. So in total: 1 initial + 3 follow-ups, with a total of ~42k tokens.

https://i.redd.it/xz4zwgclk6ah1.gif

GLM 5.2 Q1_S

Damn, it was slow. It took a couple of hours and 75k tokens, but it was really impressive to see it lay out its thought process, and it did think A LOT. Planning and reassessing all its assumptions as well as the already-designed code, improving it over many iterations in the thinking process alone. I've been using Qwen as my daily local driver for the last two months, but the GLM thinking traces are something very different and much more impressive.

It built it in one shot, 100% correct, with a proper 'premium' feel to the product. It's also the only one that managed to add sound to the game, which was omitted even by full GLM and Opus in later attempts.

https://i.redd.it/8tcd6q2mk6ah1.gif

GLM 5.2 full precision and Opus

Just for fun, I also passed the same instruction to two other models: Opus, and GLM in full precision via OpenRouter, under the same pi harness. Both were impressive, but only a tiny bit better than the local attempts visually, if at all. Especially GLM in full precision disappointed me, as it inverted the direction of the control keys, which makes it really hard to play — even though the Q1_S version got it right. BUT — GLM 5.2 in FP did it in 11k tokens only, including system prompts. I don't know if this is the API provider's default restriction on thinking tokens, or if the high quantization made the model overthink a lot, but the Q1 really held up well against the FP.

The GLM full precision output:

https://i.redd.it/fta2d8ymk6ah1.gif

The only other attempt that felt genuinely better was Opus 4.8, but under Claude Code instead of pi. I won't provide more details on these, as they're beyond the point of this post.

LLM as a judge: Code Quality

I passed all the outputs to Opus 4.8 and GPT 5.5 for them to rate code quality and instruction following; the results are as follows.

Opus rating:

GPT 5.5 rating:

Both agree that Q1_S did the best job. It is nearly sure that's thanks to the thinking a lot, but nonetheless, prove that Q1 is a capable quant after all. You just need to find the proper use cases for it, as I wouldn't suggest using it as real-time agentic backend.

Full instruction prompt:

Build a 3D arena game as a SINGLE self-contained .html file. STACK (mandatory): - Three.js loaded from a CDN (one <script> tag). No other JS libraries, no build step. - All HTML, CSS, and JS in this one file. It must run by opening it directly in a browser. CORE SPEC (mandatory — implement all of this exactly): 1. A flat ground plane forming a bounded arena. The player cannot leave its bounds. 2. A player object on the ground. WASD moves it (camera-relative); movement has momentum, not instant stop/start. 3. A third-person camera that smoothly follows behind the player. 4. Collectible glowing orbs spawn at random positions. Touching one collects it (+10 score) and spawns a new one. 5. Enemy objects spawn at the arena edges and move toward the player. Contact with the player costs 1 life. 6. Player starts with 3 lives. A HUD shows score and lives at all times. 7. At 0 lives: a game-over screen showing final score, with a key press to restart. 8. Difficulty ramps over time (enemies spawn faster and/or move faster). STRETCH (strongly encouraged — you will be judged on this): Beyond the core, make it feel PREMIUM. Lighting, shadows, particles, juice, smooth camera, satisfying feedback, polished HUD, atmosphere. Add depth or complexity if it improves the experience. Aim to genuinely impress — this is evaluated on visual quality and feel, not just correctness. RULES: - Implement the full core before adding stretch features. - Output the complete, ready-to-run .html file.

submitted by /u/SnooPaintings8639
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.