Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I wanted to know how much of a coding agent's performance came from the model and how much came from the harness, so I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. ALl the images above all come from the same model, but with a different harness. Still working on getting automated/metric evaluation instead of subjective opinion. Things I noticed not present in the images:
[link] [comments] |
More from r/LocalLLaMA
-
AMD Powers Next-Generation Agent Computers with New Ryzen AI Halo Developer Platform and Ryzen AI Max PRO 400 Series Processors
May 21
-
Qwen3.6 27B and llama.cpp appreciation post
May 21
-
Training a vision model from scratch on iPod touch 4 images
May 21
-
Back again, many changes have taken place.
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.