Apex-Testing: real-world, real repos, agentic coding benchmark (Update)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
BIG Apex-Testing update! https://www.apex-testing.org/
The Real-World Agentic Coding benchmark has been (95%) updated with all recent models! This is based on 65-70 actual private github repos made especially to test proper agentic coding capabilities of models.
For those who don't know about the project and see it for the first time, here's the excerpt from the website:
"What is APEX Testing?
Every week there's a new model that's "the best ever." Every provider promises 10x performance at a fraction of the cost. Benchmarks get cherry-picked, their demos get curated, influencers get paid and people keep falling for it.
APEX exists because I got tired of the hype and the intentional benchmaxxing. Models get dropped into real codebases with real bugs and real feature requests, and they have to figure it out like a developer would. 70 tasks across 8 categories, all based on work you'd actually encounter on the job. You get to see what actually works and what's just marketing."
What's included currently in metrics:
- Avg Cost
- Avg Time
- Scoring based off each category/difficulty
- ELO-based Leaderboard (see details on the website)
- Model comparison
- Various metrics (included in the website)
There are still a few things that need to be brought up to speed such as:
- Qwen3.7 Max is currently incomplete in its run (cca. 40/70 repo tasks done)
- Qwen3.6 local models must be added (will do so these upcoming days at BF16)
- Deepseek v4 pro+flash are currently incomplete in their runs
- Ideally I'd like to also add Qwen3.5 397B BF16 (Q4_K_XL is added and complete)
I will probably open up some kind of donation strictly for it or if anyone has OpenRouter tokens available, I'll appreciate it. Otherwise, I'll probably only update models selectively moving forward (local ones that I fit in my VRAM for sure will be added, referring to API costs only). Please don't take this as any sort of pressure or w/e, it's only for those interested and able to.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.