r/LocalLLaMA · · 1 min read

Agent Execution Tax: new procurement metric for browser agent benchmarks?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Agent Execution Tax: new procurement metric for browser agent benchmarks?

One model paid a 22.9% Agent Execution Tax (wasted / productive inference). The same model that looked cheapest per token cost 2.3x more per successful task. Ran 720 browser agent tasks across these four models on the WebVoyager benchmark. Open-weight models held their own against Gemini 2.5 Flash.

Highlights:

- MiniMax M2.5: 2.3x cheaper per successful task than Gemini

- GLM-5: highest accuracy (57.1%), strongest on structured data

- Kimi K2.5: 0% parse retries across 852 calls (Gemini was 18.6%)

What surprised us: open-weight models are now winning agent benchmarks not because they got smarter but because they're more reliable per call.

Token pricing comparisons are misleading once retries compound.

Full benchmark + reproducibility steps in the link

submitted by /u/ogandrea
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA