r/LocalLLaMA · · 1 min read

Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B

I wanted to know how much of a coding agent's performance came from the model and how much came from the harness, so I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. ALl the images above all come from the same model, but with a different harness.

Still working on getting automated/metric evaluation instead of subjective opinion.

Things I noticed not present in the images:

  1. Opencode can search the internet by default. This made it's results way better on some tasks. Eg the 3d printer explainer page it listed specific filament temperatures etc.
  2. On webdev, opencode delivered really good results. You can't interact with them from here, but it made cool interactive widgets that worked really well.
  3. The model really struggles with Github Copilot. It generally takes half a dozen tries to write a file. It keeps mucking up copilots file editing tools. Doesn't have this issue with other harnesses. Claude code, pi and opencode all take 4 LLM requests to create the pelican.svg. Github copilot takes 13! It tries the edit tool, it tries bash, it tries the edit tool again. Whatever tool schema they use, in my tests the LLM really struggles. This makes it really slow as it has to regenerate the same diffs again and again.
  4. Qwen3-vl-4 looped endlessly in OpenCode, couldn't even write a the pelican.svg file to disk.
submitted by /u/sdfgeoff
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA