Hugging Face · · 8 min read

Community Evals: Because we're done trusting black-box leaderboards over the community

Mirrored from Hugging Face for archival readability. Support the source by reading on the original site.

Community Evals: Because we're done trusting black-box leaderboards over the community

Published February 4, 2026
Update on GitHub

banner

TL;DR: Benchmark datasets on Hugging Face can now host leaderboards. Models store their own eval scores. Everything links together. The community can submit results via PR. Verified badges prove that the results can be reproduced.

Evaluation is broken

Let's be real about where we are with evals in 2026. MMLU is saturated above 91%. GSM8K hit 94%+. HumanEval is conquered. Yet some models that ace benchmarks still can't reliably browse the web, write production code, or handle multi-step tasks without hallucinating, based on usage reports. There is a clear gap between benchmark scores and real-world performance.

Furthermore, there is another gap within reported benchmark scores. Multiple sources report different results. From Model Cards, to papers, to evaluation platforms, there is no alignment in reported scores. The result is that the community lacks a single source of truth.

What We're Shipping

Decentralized and transparent evaluation reporting.

We are going to take evaluations on the Hugging Face Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks. At first, we will start with a shortlist of 4 benchmarks and over time we’ll expand to the most relevant benchmarks.

For Benchmarks: Dataset repos can now register as benchmarks (MMLU-Pro, GPQA, HLE are already live). They automatically aggregate reported results from across the Hub and display leaderboards in the dataset card. The benchmark defines the eval spec via eval.yaml, based on the Inspect AI format, so anyone can reproduce it. The reported results need to align with the task definition.

benchmark image

For Models: Eval scores live in .eval_results/*.yaml in the model repo. They appear on the model card and are fed into benchmark datasets. Both the model author’s results and open pull requests for results will be aggregated. Model authors will be able to close score PR and hide results.

For the Community: Any user can submit evaluation results for any model via a PR. Results get shown as "community", without waiting for model authors to merge or close. The community can link to sources like a paper, Model Card, third-party evaluation platform, or inspect eval logs. The community can discuss scores like any PR. Since the Hub is Git based, there is a history of when evals were added, when changes were made, etc. The sources look like below.

model image

To learn more about evaluation results, check out the docs.

Model scores in the Hub

Why This Matters

Decentralizing evaluation will expose scores that already exist across the community in sources like model cards and papers. By exposing these scores, the community can build on top of them to aggregate, track, and understand scores across the field. Also, all scores will be exposed via Hub APIs, making it easy to aggregate and build curated leaderboards, dashboards, etc.

Community evals do not replace benchmarks so leaderboards and closed evals with published results are still crucial. However, we believe it's important to contribute to the field with open eval results based on reproducible eval specs.

This won't solve benchmark saturation or close the benchmark-reality gap. Nor will it stop training on test sets. But it makes the game visible by exposing what is evaluated, how, when, and by whom.

Mostly, we hope to make the Hub an active place to build and share reproducible benchmarks. Particularly focusing on new tasks and domains that challenge SOTA models more.

Get Started

Read the docs: To learn more about evaluation results, check out the docs.

Add eval results: Publish the evals you conducted as YAML files in .eval_results/ on any model repo.

Check out the scores on the benchmark dataset.

Register a new benchmark: Add eval.yaml to your dataset repo and contact us to be included in the shortlist.

The feature is in beta. We're building in the open. Feedback welcome.

Datasets mentioned in this article 3

Community

Great initiative, aggregating multiple signals is the way to go!</p>\n","updatedAt":"2026-02-06T15:01:54.452Z","author":{"_id":"61b8e2ba285851687028d395","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/Rq3xWG7mJ3aCRoBsq340h.jpeg","fullname":"Maxime Labonne","name":"mlabonne","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7649,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/EsTgVtnM2IqVRKgPdfqcB.png","fullname":"Liquid AI","name":"LiquidAI","type":"org","isHf":false,"details":"A new generation of foundation models from first principles.","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8993434309959412},"editors":["mlabonne"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/Rq3xWG7mJ3aCRoBsq340h.jpeg"],"reactions":[{"reaction":"❤️","users":["SaylorTwift","kramp","Neilblaze"],"count":3}],"isReport":false}},{"id":"6986c939e1b677c5df5c3afb","author":{"_id":"692d667ac31ca24423a625a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg","fullname":"NJX-njx","name":"NJX-njx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":33,"isUserFollowing":false},"createdAt":"2026-02-07T05:10:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.","html":"<p>Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.</p>\n","updatedAt":"2026-02-07T05:10:17.062Z","author":{"_id":"692d667ac31ca24423a625a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg","fullname":"NJX-njx","name":"NJX-njx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":33,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9713615775108337},"editors":["NJX-njx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg"],"reactions":[],"isReport":false}},{"id":"69880808260cce8b14247341","author":{"_id":"63ec2270ca08a72ba9cf9bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png","fullname":"Naufal Suryanto","name":"naufalso","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2026-02-08T03:50:32.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Will there be the integration with existing huggingface [lighteval](https://github.com/huggingface/lighteval)?","html":"<p>Will there be the integration with existing huggingface <a href=\"https://github.com/huggingface/lighteval\" rel=\"nofollow\">lighteval</a>?</p>\n","updatedAt":"2026-02-08T03:51:24.938Z","author":{"_id":"63ec2270ca08a72ba9cf9bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png","fullname":"Naufal Suryanto","name":"naufalso","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5559505820274353},"editors":["naufalso"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png"],"reactions":[],"isReport":false}},{"id":"6989dfe0573fbffd10a8fbd6","author":{"_id":"63e0eea7af523c37e5a77966","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","fullname":"Nathan Habib","name":"SaylorTwift","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":421,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png","fullname":"Hugging Face","name":"huggingface","type":"org","isHf":true,"details":"The AI community building the future.","plan":"team"}},"createdAt":"2026-02-09T13:23:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"hi @naufalso ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥 ","html":"<p>hi <span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;naufalso&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/naufalso\">@<span class=\"underline\">naufalso</span></a></span> </span></span> ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥 </p>\n","updatedAt":"2026-02-09T13:23:44.361Z","author":{"_id":"63e0eea7af523c37e5a77966","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","fullname":"Nathan Habib","name":"SaylorTwift","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":421,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png","fullname":"Hugging Face","name":"huggingface","type":"org","isHf":true,"details":"The AI community building the future.","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7757905721664429},"editors":["SaylorTwift"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg"],"reactions":[{"reaction":"👍","users":["naufalso"],"count":1}],"isReport":false}},{"id":"698a04bae23462f322613301","author":{"_id":"683b6716b08ed58ff836343f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg","fullname":"Franklin Heng","name":"hengloose","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-09T16:00:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Amazing","html":"<p>Amazing</p>\n","updatedAt":"2026-02-09T16:00:58.996Z","author":{"_id":"683b6716b08ed58ff836343f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg","fullname":"Franklin Heng","name":"hengloose","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.28397423028945923},"editors":["hengloose"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg"],"reactions":[],"isReport":false}},{"id":"698a283c3add708a82c1f903","author":{"_id":"6657f6f91cdbe4e9b7c827f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6657f6f91cdbe4e9b7c827f9/7UK9Gev5sikbBp-kC0RCg.png","fullname":"Matthew Frank","name":"MatthewFrank","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-09T18:32:28.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Spam","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2026-03-17T14:25:26.814Z","author":{"_id":"6657f6f91cdbe4e9b7c827f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6657f6f91cdbe4e9b7c827f9/7UK9Gev5sikbBp-kC0RCg.png","fullname":"Matthew Frank","name":"MatthewFrank","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"698cc3459017bb9609f24fc4","author":{"_id":"648b67b2355113a5fba0ff31","avatarUrl":"/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg","fullname":"Harsha Kokel","name":"harshakokel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-11T17:58:29.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it. \n\nWe’re hoping to include our planning benchmark, [ACPBench](https://ibm.github.io/ACPBench/index.html), as part of this ecosystem—it's fully compatible with Inspect AI, the [evaluation scripts](https://github.com/IBM/ACPBench/blob/main/GettingStarted.md) are available on our GitHub.\n\n### References\n\n* ACPBench: Reasoning About Action, Change, and Planning, *Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi*, [AAAI 2025](https://ojs.aaai.org/index.php/AAAI/article/view/34857)\n* ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, *Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi*, [ICLR 2026](https://openreview.net/forum?id=WIXohR7mEo)","html":"<p>This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it. </p>\n<p>We’re hoping to include our planning benchmark, <a href=\"https://ibm.github.io/ACPBench/index.html\" rel=\"nofollow\">ACPBench</a>, as part of this ecosystem—it's fully compatible with Inspect AI, the <a href=\"https://github.com/IBM/ACPBench/blob/main/GettingStarted.md\" rel=\"nofollow\">evaluation scripts</a> are available on our GitHub.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"references\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#references\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tReferences\n\t</span>\n</h3>\n<ul>\n<li>ACPBench: Reasoning About Action, Change, and Planning, <em>Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi</em>, <a href=\"https://ojs.aaai.org/index.php/AAAI/article/view/34857\" rel=\"nofollow\">AAAI 2025</a></li>\n<li>ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, <em>Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi</em>, <a href=\"https://openreview.net/forum?id=WIXohR7mEo\" rel=\"nofollow\">ICLR 2026</a></li>\n</ul>\n","updatedAt":"2026-02-11T18:05:46.295Z","author":{"_id":"648b67b2355113a5fba0ff31","avatarUrl":"/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg","fullname":"Harsha Kokel","name":"harshakokel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9256938695907593},"editors":["harshakokel"],"editorAvatarUrls":["/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg"],"reactions":[{"reaction":"➕","users":["ctpelok"],"count":1}],"isReport":false}}],"status":"open","isReport":false,"pinned":false,"locked":false,"collection":"community_blogs"},"contextAuthors":["burtenshaw","SaylorTwift","kramp","merve","davanstrien","nielsr","julien-c"],"primaryEmailConfirmed":false,"discussionRole":0,"acceptLanguages":["en"],"withThread":true,"cardDisplay":false,"repoDiscussionsLocked":false}">
deleted
Feb 6
This comment has been hidden

Great initiative, aggregating multiple signals is the way to go!

Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.

Will there be the integration with existing huggingface lighteval?

Article author Feb 9

hi @naufalso ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥

Amazing

This comment has been hidden (marked as Spam)

This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it.

We’re hoping to include our planning benchmark, ACPBench, as part of this ecosystem—it's fully compatible with Inspect AI, the evaluation scripts are available on our GitHub.

References

  • ACPBench: Reasoning About Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, AAAI 2025
  • ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, ICLR 2026
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Datasets mentioned in this article 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face