Hugging Face · February 4, 2026 · 8 min read

Community Evals: Because we're done trusting black-box leaderboards over the community

#benchmark

Mirrored from Hugging Face for archival readability. Support the source by reading on the original site.

Like Read original ↗

Back to Articles

Community Evals: Because we're done trusting black-box leaderboards over the community

Published February 4, 2026

Update on GitHub

Upvote

TL;DR: Benchmark datasets on Hugging Face can now host leaderboards. Models store their own eval scores. Everything links together. The community can submit results via PR. Verified badges prove that the results can be reproduced.

Evaluation is broken

Let's be real about where we are with evals in 2026. MMLU is saturated above 91%. GSM8K hit 94%+. HumanEval is conquered. Yet some models that ace benchmarks still can't reliably browse the web, write production code, or handle multi-step tasks without hallucinating, based on usage reports. There is a clear gap between benchmark scores and real-world performance.

Furthermore, there is another gap within reported benchmark scores. Multiple sources report different results. From Model Cards, to papers, to evaluation platforms, there is no alignment in reported scores. The result is that the community lacks a single source of truth.

What We're Shipping

Decentralized and transparent evaluation reporting.

We are going to take evaluations on the Hugging Face Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks. At first, we will start with a shortlist of 4 benchmarks and over time we’ll expand to the most relevant benchmarks.

For Benchmarks: Dataset repos can now register as benchmarks (MMLU-Pro, GPQA, HLE are already live). They automatically aggregate reported results from across the Hub and display leaderboards in the dataset card. The benchmark defines the eval spec via eval.yaml, based on the Inspect AI format, so anyone can reproduce it. The reported results need to align with the task definition.

For Models: Eval scores live in .eval_results/*.yaml in the model repo. They appear on the model card and are fed into benchmark datasets. Both the model author’s results and open pull requests for results will be aggregated. Model authors will be able to close score PR and hide results.

For the Community: Any user can submit evaluation results for any model via a PR. Results get shown as "community", without waiting for model authors to merge or close. The community can link to sources like a paper, Model Card, third-party evaluation platform, or inspect eval logs. The community can discuss scores like any PR. Since the Hub is Git based, there is a history of when evals were added, when changes were made, etc. The sources look like below.

To learn more about evaluation results, check out the docs.

Model scores in the Hub

Why This Matters

Decentralizing evaluation will expose scores that already exist across the community in sources like model cards and papers. By exposing these scores, the community can build on top of them to aggregate, track, and understand scores across the field. Also, all scores will be exposed via Hub APIs, making it easy to aggregate and build curated leaderboards, dashboards, etc.

Community evals do not replace benchmarks so leaderboards and closed evals with published results are still crucial. However, we believe it's important to contribute to the field with open eval results based on reproducible eval specs.

This won't solve benchmark saturation or close the benchmark-reality gap. Nor will it stop training on test sets. But it makes the game visible by exposing what is evaluated, how, when, and by whom.

Mostly, we hope to make the Hub an active place to build and share reproducible benchmarks. Particularly focusing on new tasks and domains that challenge SOTA models more.

Get Started

Read the docs: To learn more about evaluation results, check out the docs.

Add eval results: Publish the evals you conducted as YAML files in .eval_results/ on any model repo.

Check out the scores on the benchmark dataset.

Register a new benchmark: Add eval.yaml to your dataset repo and contact us to be included in the shortlist.

The feature is in beta. We're building in the open. Feedback welcome.

Datasets mentioned in this article 3

Introducing swift-huggingface: The Complete Swift Client for Hugging Face

December 5, 2025

open-sourceLLMcommunity

🇵🇭 FilBench - Can LLMs Understand and Generate Filipino?

August 12, 2025

Community

Great initiative, aggregating multiple signals is the way to go!\n","updatedAt":"2026-02-06T15:01:54.452Z","author":{"_id":"61b8e2ba285851687028d395","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/Rq3xWG7mJ3aCRoBsq340h.jpeg","fullname":"Maxime Labonne","name":"mlabonne","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7649,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/EsTgVtnM2IqVRKgPdfqcB.png","fullname":"Liquid AI","name":"LiquidAI","type":"org","isHf":false,"details":"A new generation of foundation models from first principles.","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8993434309959412},"editors":["mlabonne"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/Rq3xWG7mJ3aCRoBsq340h.jpeg"],"reactions":[{"reaction":"❤️","users":["SaylorTwift","kramp","Neilblaze"],"count":3}],"isReport":false}},{"id":"6986c939e1b677c5df5c3afb","author":{"_id":"692d667ac31ca24423a625a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg","fullname":"NJX-njx","name":"NJX-njx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":33,"isUserFollowing":false},"createdAt":"2026-02-07T05:10:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.","html":"Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.\n","updatedAt":"2026-02-07T05:10:17.062Z","author":{"_id":"692d667ac31ca24423a625a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg","fullname":"NJX-njx","name":"NJX-njx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":33,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9713615775108337},"editors":["NJX-njx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg"],"reactions":[],"isReport":false}},{"id":"69880808260cce8b14247341","author":{"_id":"63ec2270ca08a72ba9cf9bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png","fullname":"Naufal Suryanto","name":"naufalso","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2026-02-08T03:50:32.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Will there be the integration with existing huggingface [lighteval](https://github.com/huggingface/lighteval)?","html":"Will there be the integration with existing huggingface <a href=\"https://github.com/huggingface/lighteval\" rel=\"nofollow\">lighteval</a>?\n","updatedAt":"2026-02-08T03:51:24.938Z","author":{"_id":"63ec2270ca08a72ba9cf9bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png","fullname":"Naufal Suryanto","name":"naufalso","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5559505820274353},"editors":["naufalso"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png"],"reactions":[],"isReport":false}},{"id":"6989dfe0573fbffd10a8fbd6","author":{"_id":"63e0eea7af523c37e5a77966","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","fullname":"Nathan Habib","name":"SaylorTwift","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":421,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png","fullname":"Hugging Face","name":"huggingface","type":"org","isHf":true,"details":"The AI community building the future.","plan":"team"}},"createdAt":"2026-02-09T13:23:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"hi @naufalso ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥 ","html":"hi <a href=\"/naufalso\">@naufalso</a> ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥 \n","updatedAt":"2026-02-09T13:23:44.361Z","author":{"_id":"63e0eea7af523c37e5a77966","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","fullname":"Nathan Habib","name":"SaylorTwift","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":421,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png","fullname":"Hugging Face","name":"huggingface","type":"org","isHf":true,"details":"The AI community building the future.","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7757905721664429},"editors":["SaylorTwift"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg"],"reactions":[{"reaction":"👍","users":["naufalso"],"count":1}],"isReport":false}},{"id":"698a04bae23462f322613301","author":{"_id":"683b6716b08ed58ff836343f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg","fullname":"Franklin Heng","name":"hengloose","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-09T16:00:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Amazing","html":"Amazing\n","updatedAt":"2026-02-09T16:00:58.996Z","author":{"_id":"683b6716b08ed58ff836343f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg","fullname":"Franklin Heng","name":"hengloose","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.28397423028945923},"editors":["hengloose"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg"],"reactions":[],"isReport":false}},{"id":"698a283c3add708a82c1f903","author":{"_id":"6657f6f91cdbe4e9b7c827f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6657f6f91cdbe4e9b7c827f9/7UK9Gev5sikbBp-kC0RCg.png","fullname":"Matthew Frank","name":"MatthewFrank","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-09T18:32:28.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Spam","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2026-03-17T14:25:26.814Z","author":{"_id":"6657f6f91cdbe4e9b7c827f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6657f6f91cdbe4e9b7c827f9/7UK9Gev5sikbBp-kC0RCg.png","fullname":"Matthew Frank","name":"MatthewFrank","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"698cc3459017bb9609f24fc4","author":{"_id":"648b67b2355113a5fba0ff31","avatarUrl":"/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg","fullname":"Harsha Kokel","name":"harshakokel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-11T17:58:29.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it. \n\nWe’re hoping to include our planning benchmark, [ACPBench](https://ibm.github.io/ACPBench/index.html), as part of this ecosystem—it's fully compatible with Inspect AI, the [evaluation scripts](https://github.com/IBM/ACPBench/blob/main/GettingStarted.md) are available on our GitHub.\n\n### References\n\n* ACPBench: Reasoning About Action, Change, and Planning, *Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi*, [AAAI 2025](https://ojs.aaai.org/index.php/AAAI/article/view/34857)\n* ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, *Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi*, [ICLR 2026](https://openreview.net/forum?id=WIXohR7mEo)","html":"This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it. \nWe’re hoping to include our planning benchmark, <a href=\"https://ibm.github.io/ACPBench/index.html\" rel=\"nofollow\">ACPBench</a>, as part of this ecosystem—it's fully compatible with Inspect AI, the <a href=\"https://github.com/IBM/ACPBench/blob/main/GettingStarted.md\" rel=\"nofollow\">evaluation scripts</a> are available on our GitHub.\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"references\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#references\" rel=\"nofollow\">\n\t\t<svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg>\n\t</a>\n\t\n\t\tReferences\n\t\n</h3>\n<ul>\n<li>ACPBench: Reasoning About Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, <a href=\"https://ojs.aaai.org/index.php/AAAI/article/view/34857\" rel=\"nofollow\">AAAI 2025</a></li>\n<li>ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, <a href=\"https://openreview.net/forum?id=WIXohR7mEo\" rel=\"nofollow\">ICLR 2026</a></li>\n</ul>\n","updatedAt":"2026-02-11T18:05:46.295Z","author":{"_id":"648b67b2355113a5fba0ff31","avatarUrl":"/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg","fullname":"Harsha Kokel","name":"harshakokel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9256938695907593},"editors":["harshakokel"],"editorAvatarUrls":["/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg"],"reactions":[{"reaction":"➕","users":["ctpelok"],"count":1}],"isReport":false}}],"status":"open","isReport":false,"pinned":false,"locked":false,"collection":"community_blogs"},"contextAuthors":["burtenshaw","SaylorTwift","kramp","merve","davanstrien","nielsr","julien-c"],"primaryEmailConfirmed":false,"discussionRole":0,"acceptLanguages":["en"],"withThread":true,"cardDisplay":false,"repoDiscussionsLocked":false}">

deleted

Feb 6

This comment has been hidden

mlabonne

Feb 6

Great initiative, aggregating multiple signals is the way to go!

NJX-njx

Feb 7

Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.

naufalso

Feb 8

•

edited Feb 8

Will there be the integration with existing huggingface lighteval?

SaylorTwift

Article author Feb 9

hi @naufalso ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥

hengloose

Feb 9

Amazing

MatthewFrank

Feb 9

This comment has been hidden (marked as Spam)

harshakokel

Feb 11

•

edited Feb 11

This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it.

We’re hoping to include our planning benchmark, ACPBench, as part of this ecosystem—it's fully compatible with Inspect AI, the evaluation scripts are available on our GitHub.

References

ACPBench: Reasoning About Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, AAAI 2025
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, ICLR 2026