r/MachineLearning · May 20, 2026 · 3 min read

under 2% quality gap but 10x cost difference: tested 5 models on identical tool calling tasks[D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

I've been running a file management agent built on MCP for a few months. It handles module renames, import updates, validation scaffolding, test execution. A typical session is 60 to 120 tool calls. The whole thing was powered by Opus 4.7 because I never thought to question it until I looked at my April bill.

So I set up a comparison. Eight refactoring tasks on a 15k line Python project, same MCP tools, same system prompt, same repo state, five models. Tasks were things like "rename this module and fix all imports" and "add input validation to these 12 endpoints." Routine cleanup, nothing requiring deep architectural thought.

The metric I cared about was first attempt tool call success: did the model produce a valid function call that executed without a parse error on the first try? On the expensive end, Opus 4.7 hit roughly 98 to 99 percent across a bit over 500 calls and cost close to $15 for all eight tasks. GPT 5 was similar quality for around $11.

The cheaper tier surprised me. Sonnet 4.6 landed somewhere around 96 percent for about $4. DeepSeek V4 Pro was in the same neighborhood for under $2. And Tencent Hunyuan Hy3 preview came in within a couple of points of Opus for under $1.50. Under two percentage points separating the priciest model from the cheapest, on tasks where a failed call just gets retried anyway.

I'll be honest, the results were anticlimactic. I expected a bigger reliability gap. I actually spent half a day debugging what I thought was a quality issue with one of the MoE models before realizing I'd misconfigured the tool call schema in my system prompt. Every call was producing malformed JSON and I blamed the model. Classic.

The model is a 295B parameter MoE with 21B active per token, so full BF16 weights are around 590GB. The official deployment path is vLLM or SGLang on something like eight H200 class GPUs, which is not exactly homelab territory. But the 4 bit quantized weights land around 165GB, which just fits in unified memory on Apple Silicon. I picked up a refurbished M2 Ultra Mac Studio with 192GB for around $5,500 and installed the community MLX port from Hugging Face. Several hours of fiddling with conda environments later I had it generating. Throughput sits around 5 to 12 tokens per second depending on context length. Sounds slow, but agent loops spend most of their wall clock time waiting on tool execution, so in practice the model is rarely the bottleneck.

My orchestrator now routes routine file ops and straightforward refactors to the local model or DeepSeek over API depending on whether I need faster generation. Anything that fails two retries or touches cross module boundaries gets escalated to Opus in the cloud. Daily spend dropped from somewhere in the neighborhood of $40 to around $9, and that number keeps shrinking as I shift more work to the local box where marginal cost is electricity.

The one clear failure was a nested decorator refactor. Three layers of wrapper interaction, the model needed to hold complex state across many reasoning steps. It just looped, burning tokens without converging, until escalation kicked in and Opus nailed it first try. I've seen this consistently since: anything requiring sustained reasoning across unfamiliar patterns or debugging subtle type mismatches still wants the expensive model.

Per OpenRouter's public rankings the model was #1 by tool call volume after launch, which tracks with my experience that function calling feels like a primary design goal. I'd like to try the 8 bit MLX quantization once someone publishes a clean build, mainly to see if the cross file reasoning weakness narrows at higher precision. Still iterating on the escalation heuristic too. Retry count alone misses cases where the model is confidently wrong rather than obviously failing, and I haven't found a clean signal for that yet.

submitted by /u/Top-Cardiologist1011
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning