How much of MLE-Bench's gains are the algorithm vs. better models + more search? [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
| MLE-Bench scores have jumped from 30% to 80% over the last two years. Turns out: not much. Once you control for the same step budget and models, and then test on a different set of tasks, the two-year-old AIDE algorithm matches modern agent/evolutionary search systems. Figure from FML-Bench, a new automated ML research benchmark, which unifies the code editing agent, step definition, and val/test split, and tries to benchmark the algorithmic efficiency (search/memory) of the agents. paper link: https://arxiv.org/pdf/2605.17373 [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.