r/LocalLLaMA · · 2 min read

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090.

The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/sub-agent loop. Here's where it worked and where it broke.

Setup:

- RTX 3090, 24GB VRAM

- Qwen3.6-27B at Q6_K (~22GB on-GPU), 32k effective context

- Ollama as the inference engine

- Multi-agent orchestrator with structured-JSON plans, plan-approval modal, auto-review pass after sub-agent completion

- Tested across 47 multi-step coding workflows over two real repos

What worked (the reasoning layer):

- Plan generation. Qwen3.6 generated multi-step plans roughly as well as Claude on these tasks. Slightly more conservative (fewer unsolicited "let me also refactor X" steps), but coherent and schema-valid at ~95% after a few prompt tweaks. The remaining 5% were schema fixable with one re-prompt.

- Memory extraction. Mem0-style fact extraction every 6 turns worked fine. Qwen pulled out the same kinds of facts Claude does ("user prefers no comments unless they explain a 'why'") and stored them cleanly in Qdrant.

- Auto-review of sub-agent output. A second Qwen instance reviewing the first one's code caught roughly 60% of the bugs Claude's review caught on the same set. Less savage. Still useful and free.

Where it broke:

- Tool-call reliability. Qwen3.6's JSON tool-call output had a ~12% format error rate across the 47 tasks. Claude was ~0.5% on the same workload. The errors weren't malformed JSON they were wrong field names, wrong types, hallucinated tool signatures. Outlines / strict-output mode reduced it but didn't kill it.

- Long-context drift. Past ~14k tokens of accumulated session context, Qwen started misremembering decisions it had made earlier ("you said use Postgres" no, I said the opposite). Hard practical limit ~12k tokens, then aggressive summarize-and-reset.

- Cascade-failure handling. When a sub-agent failed, Claude's planner usually noticed and re-planned. Qwen sometimes just generated downstream steps assuming the sub-agent had succeeded. Three cascading hallucinations in 47 runs. Not catastrophic with plan gating in place. Would be catastrophic without.

The contrarian take: Qwen3.6-27B is a viable REASONING layer for local multi-agent systems today. It is NOT a viable execution layer. Run plans through it; gate every tool call.

Practical implication: if you're building local-only agents, you need (1) structured-output enforcement at the tool-call boundary (outlines, lm-format-enforcer, or your inference engine's grammar mode), (2) plan-approval gating so the 12% format errors don't reach actual file writes, (3) re-plan-on-failure logic the model itself can't be trusted to do.

The 12% tool-call gap is the metric to close. Once Qwen3.6 (or the next local model) hits ~2% on this, the case for cloud reasoning in agent loops gets weaker fast.

Disclosure: the orchestrator I tested this on is OpenYabby (openyabby.com). I built it. Tested honestly because I genuinely wanted to know if I could stop paying Anthropic.

submitted by /u/Interesting-Sock3940
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA