Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output.
Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying. It handles repo-level reasoning, produces decent code. But yeah it makes more mistakes than frontier models. Expected.
What I didn't expect was that the 3-critic pipeline I built for frontier models turns out to be a great fit here. Critics catch the extra mistakes. Harness handles the retry overhead without breaking flow. The output after critics have done their work is good enough that I can't really tell the difference from a frontier run in terms of final quality. The path is just noisier.
One thing though, the plan for this run is executing was written by GLM5.2, not Qwen3.6. My guess is the optimal split is frontier for planning + Qwen3.6 for execution. Strong model where reasoning matters most, cheap model for high-volume implementation where the harness catches errors.
[link] [comments]
More from r/LocalLLaMA
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
-
on Dario’s statement
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.