Hugging Face Daily Papers · · 4 min read

Forecasting Downstream Performance of LLMs With Proxy Metrics

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Cross-entropy loss is a poor predictor of how models perform on downstream tasks (esp. reasoning). We propose something better: proxy metrics computed over expert reasoning traces.</p>\n","updatedAt":"2026-05-22T13:49:51.125Z","author":{"_id":"631a523c04f8ed65eff16fb4","avatarUrl":"/avatars/2b284403c88f140d7bef283f729f7a3e.svg","fullname":"Arkil Patel","name":"arkilpatel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8948209285736084},"editors":["arkilpatel"],"editorAvatarUrls":["/avatars/2b284403c88f140d7bef283f729f7a3e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18607","authors":[{"_id":"6a0e83f4164dbbc68a26c5ca","name":"Arkil Patel","hidden":false},{"_id":"6a0e83f4164dbbc68a26c5cb","name":"Siva Reddy","hidden":false},{"_id":"6a0e83f4164dbbc68a26c5cc","name":"Marius Mosbach","hidden":false},{"_id":"6a0e83f4164dbbc68a26c5cd","name":"Dzmitry Bahdanau","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/631a523c04f8ed65eff16fb4/y5ixyoI0jJRMPpKKYNN1G.mp4","https://cdn-uploads.huggingface.co/production/uploads/631a523c04f8ed65eff16fb4/xrE44Ejrsj-oDlEgUjFqY.png"],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Forecasting Downstream Performance of LLMs With Proxy Metrics","submittedOnDailyBy":{"_id":"631a523c04f8ed65eff16fb4","avatarUrl":"/avatars/2b284403c88f140d7bef283f729f7a3e.svg","isPro":false,"fullname":"Arkil Patel","user":"arkilpatel","type":"user","name":"arkilpatel"},"summary":"Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.","upvotes":9,"discussionId":"6a0e83f4164dbbc68a26c5ce","githubRepo":"https://github.com/McGill-NLP/proxy-metrics","githubRepoAddedBy":"user","ai_summary":"Proxy metrics based on token-level statistics from expert-written solutions provide more reliable model performance forecasting than traditional loss-based methods across multiple development stages.","ai_keywords":["cross-entropy loss","next token distribution","expert token rank","model selection","pretraining data selection","training-time forecasting","Spearman Rho","Pareto frontier"],"githubStars":5,"organization":{"_id":"618cd1bfb8de35a67a79d266","name":"McGill-NLP","fullname":"McGill NLP Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651301909677-5fa9ff3ea13e063b8b2b60cb.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"631a523c04f8ed65eff16fb4","avatarUrl":"/avatars/2b284403c88f140d7bef283f729f7a3e.svg","isPro":false,"fullname":"Arkil Patel","user":"arkilpatel","type":"user"},{"_id":"5fa9ff3ea13e063b8b2b60cb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1633380224986-5fa9ff3ea13e063b8b2b60cb.jpeg","isPro":false,"fullname":"Xing Han Lù","user":"xhluca","type":"user"},{"_id":"60f02db20d920bc7805cadb0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60f02db20d920bc7805cadb0/GfiWwwv_zbZe-09XERagS.jpeg","isPro":false,"fullname":"Jay Gala","user":"jaygala24","type":"user"},{"_id":"63f0546df1a47aaea5bcbae1","avatarUrl":"/avatars/9d708d88574cea2f17af37b659ef6a53.svg","isPro":false,"fullname":"Mehar Bhatia","user":"MeharBhatia","type":"user"},{"_id":"63458f12d54fb141dedac508","avatarUrl":"/avatars/3946fb9c23d1cd24037770cc0a3489bf.svg","isPro":false,"fullname":"Amirhossein Kazemnejad","user":"kazemnejad","type":"user"},{"_id":"64527548fc4b47877aba7de0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64527548fc4b47877aba7de0/ht-mRRxNQT49A7NxArOGG.png","isPro":false,"fullname":"Nicholas Meade","user":"ncmeade","type":"user"},{"_id":"627d5ead401f42c57b6ce54c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/627d5ead401f42c57b6ce54c/GajmN5G_MRUFRZs6ens0t.jpeg","isPro":false,"fullname":"Parishad BehnamGhader","user":"parishadbehnam","type":"user"},{"_id":"64be128838953777fe995b00","avatarUrl":"/avatars/58e82aef8ac288ecc2e8ec84ddcff3b4.svg","isPro":false,"fullname":"Ada Tur","user":"adadtur2","type":"user"},{"_id":"643ee7606d5be535d28034f6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643ee7606d5be535d28034f6/1F241mP_buW74byC2GAWH.jpeg","isPro":false,"fullname":"Shravan Nayak","user":"BAJUKA","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"618cd1bfb8de35a67a79d266","name":"McGill-NLP","fullname":"McGill NLP Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651301909677-5fa9ff3ea13e063b8b2b60cb.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18607.md"}">
Papers
arxiv:2605.18607

Forecasting Downstream Performance of LLMs With Proxy Metrics

Published on May 18
· Submitted by
Arkil Patel
on May 22
Authors:
,
,
,

Abstract

Proxy metrics based on token-level statistics from expert-written solutions provide more reliable model performance forecasting than traditional loss-based methods across multiple development stages.

AI-generated summary

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly 10{,}000times less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an 18times compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

Community

Paper submitter about 12 hours ago

Cross-entropy loss is a poor predictor of how models perform on downstream tasks (esp. reasoning). We propose something better: proxy metrics computed over expert reasoning traces.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18607
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18607 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.18607 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18607 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers