r/MachineLearning · June 25, 2026 · 3 min read

I stopped trusting model benchmarks and started running my own eval set, here is what changed[D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Three things broke my faith in published benchmarks recently.

One, Kimi K2.7 Code shipped with plus 21.8 percent on Kimi Code Bench v2, plus 11 percent on Program Bench, plus 31.5 percent on MLS Bench Lite. All three are Moonshot's own benchmarks. None were submitted to DeepSWE, which is the one independent coding benchmark that actually produces a meaningful spread between models. When a vendor reports gains on benchmarks they designed and control, the gains are real but the question they answer is "are we better at our own test" not "are we better at your workload."

Two, GLM-5.2 hit 51 on the Artificial Analysis Intelligence Index, which is third party, but the model parameters are self reported. The index is good for relative ranking within the artificial analysis methodology. It is not a prediction of how the model performs on the specific distribution of inputs my product sends it.

Three, Seed 2.1 just landed and the official information is thin. No clear public eval yet, no third party leaderboard entries I could find. So for now "Seed 2.1 is good" is just not a claim I can verify either way.

What I did was build a small eval set from real production traffic, about 240 tasks sampled across our actual usage distribution, frozen so it does not drift. Every model I consider has to run all 240 and I record pass rate, latency, token cost, and a subjective quality score from the person who owns that task area. It is not as rigorous as a published benchmark and it is definitely smaller, but it has one property the published ones do not, which is that it is my distribution.

The implementation detail that mattered more than I expected was removing provider variance from the run itself. I route every candidate model through GPTProto so each one gets the exact same 240 prompts in the same order, and the cost and latency come back in one log schema instead of five dashboards. A homegrown shim would do the same job, the point is not the product, it is that a fair comparison only works when everything except the model is held constant.

The results have been humbling. The model that wins on our set is not always the one at the top of the public leaderboard, and the gap between first and second place on our set is much smaller than the gap the press releases imply. We also caught one model that benchmarked great but had a nasty failure mode on our long tail of edge case prompts that would have been a production incident if we had shipped it.

I am not saying public benchmarks are useless. They are useful for narrowing the field. But the decision of which model to actually put in front of users should be made on your own data, and the eval set has to be frozen and versioned or it will quietly become "things the current model is good at" and stop measuring anything.

submitted by /u/Additional-Engine402
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning