r/MachineLearning · June 1, 2026 · 1 min read

How much of MLE-Bench's gains are the algorithm vs. better models + more search? [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

How much of MLE-Bench's gains are the algorithm vs. better models + more search? [R]

MLE-Bench scores have jumped from 30% to 80% over the last two years.
But how much of that is real algorithmic progress vs. better base models + problem definition shifts + overfitting?

Turns out: not much. Once you control for the same step budget and models, and then test on a different set of tasks, the two-year-old AIDE algorithm matches modern agent/evolutionary search systems.

Figure from FML-Bench, a new automated ML research benchmark, which unifies the code editing agent, step definition, and val/test split, and tries to benchmark the algorithmic efficiency (search/memory) of the agents.

paper link: https://arxiv.org/pdf/2605.17373

test improvement and pairwise win-rate

submitted by /u/Educational_Strain_3
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning