Great initiative, aggregating multiple signals is the way to go!</p>\n","updatedAt":"2026-02-06T15:01:54.452Z","author":{"_id":"61b8e2ba285851687028d395","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/Rq3xWG7mJ3aCRoBsq340h.jpeg","fullname":"Maxime Labonne","name":"mlabonne","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7649,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/EsTgVtnM2IqVRKgPdfqcB.png","fullname":"Liquid AI","name":"LiquidAI","type":"org","isHf":false,"details":"A new generation of foundation models from first principles.","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8993434309959412},"editors":["mlabonne"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/Rq3xWG7mJ3aCRoBsq340h.jpeg"],"reactions":[{"reaction":"❤️","users":["SaylorTwift","kramp","Neilblaze"],"count":3}],"isReport":false}},{"id":"6986c939e1b677c5df5c3afb","author":{"_id":"692d667ac31ca24423a625a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg","fullname":"NJX-njx","name":"NJX-njx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":33,"isUserFollowing":false},"createdAt":"2026-02-07T05:10:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.","html":"<p>Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.</p>\n","updatedAt":"2026-02-07T05:10:17.062Z","author":{"_id":"692d667ac31ca24423a625a5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg","fullname":"NJX-njx","name":"NJX-njx","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":33,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9713615775108337},"editors":["NJX-njx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/692d667ac31ca24423a625a5/imc0hGThRKul5Ouo66HkJ.jpeg"],"reactions":[],"isReport":false}},{"id":"69880808260cce8b14247341","author":{"_id":"63ec2270ca08a72ba9cf9bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png","fullname":"Naufal Suryanto","name":"naufalso","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false},"createdAt":"2026-02-08T03:50:32.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Will there be the integration with existing huggingface [lighteval](https://github.com/huggingface/lighteval)?","html":"<p>Will there be the integration with existing huggingface <a href=\"https://github.com/huggingface/lighteval\" rel=\"nofollow\">lighteval</a>?</p>\n","updatedAt":"2026-02-08T03:51:24.938Z","author":{"_id":"63ec2270ca08a72ba9cf9bc0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png","fullname":"Naufal Suryanto","name":"naufalso","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5559505820274353},"editors":["naufalso"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63ec2270ca08a72ba9cf9bc0/zQrAf4ODgWWxLDRuD0pAh.png"],"reactions":[],"isReport":false}},{"id":"6989dfe0573fbffd10a8fbd6","author":{"_id":"63e0eea7af523c37e5a77966","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","fullname":"Nathan Habib","name":"SaylorTwift","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":421,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png","fullname":"Hugging Face","name":"huggingface","type":"org","isHf":true,"details":"The AI community building the future.","plan":"team"}},"createdAt":"2026-02-09T13:23:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"hi @naufalso ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥 ","html":"<p>hi <span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"naufalso"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/naufalso\">@<span class=\"underline\">naufalso</span></a></span> </span></span> ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥 </p>\n","updatedAt":"2026-02-09T13:23:44.361Z","author":{"_id":"63e0eea7af523c37e5a77966","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg","fullname":"Nathan Habib","name":"SaylorTwift","type":"user","isPro":true,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":421,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png","fullname":"Hugging Face","name":"huggingface","type":"org","isHf":true,"details":"The AI community building the future.","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7757905721664429},"editors":["SaylorTwift"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1678663263366-63e0eea7af523c37e5a77966.jpeg"],"reactions":[{"reaction":"👍","users":["naufalso"],"count":1}],"isReport":false}},{"id":"698a04bae23462f322613301","author":{"_id":"683b6716b08ed58ff836343f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg","fullname":"Franklin Heng","name":"hengloose","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-09T16:00:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Amazing","html":"<p>Amazing</p>\n","updatedAt":"2026-02-09T16:00:58.996Z","author":{"_id":"683b6716b08ed58ff836343f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg","fullname":"Franklin Heng","name":"hengloose","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.28397423028945923},"editors":["hengloose"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/F3BvTROY7ooqDcsV6_igv.jpeg"],"reactions":[],"isReport":false}},{"id":"698a283c3add708a82c1f903","author":{"_id":"6657f6f91cdbe4e9b7c827f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6657f6f91cdbe4e9b7c827f9/7UK9Gev5sikbBp-kC0RCg.png","fullname":"Matthew Frank","name":"MatthewFrank","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-09T18:32:28.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Spam","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2026-03-17T14:25:26.814Z","author":{"_id":"6657f6f91cdbe4e9b7c827f9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6657f6f91cdbe4e9b7c827f9/7UK9Gev5sikbBp-kC0RCg.png","fullname":"Matthew Frank","name":"MatthewFrank","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"698cc3459017bb9609f24fc4","author":{"_id":"648b67b2355113a5fba0ff31","avatarUrl":"/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg","fullname":"Harsha Kokel","name":"harshakokel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-02-11T17:58:29.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it. \n\nWe’re hoping to include our planning benchmark, [ACPBench](https://ibm.github.io/ACPBench/index.html), as part of this ecosystem—it's fully compatible with Inspect AI, the [evaluation scripts](https://github.com/IBM/ACPBench/blob/main/GettingStarted.md) are available on our GitHub.\n\n### References\n\n* ACPBench: Reasoning About Action, Change, and Planning, *Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi*, [AAAI 2025](https://ojs.aaai.org/index.php/AAAI/article/view/34857)\n* ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, *Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi*, [ICLR 2026](https://openreview.net/forum?id=WIXohR7mEo)","html":"<p>This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it. </p>\n<p>We’re hoping to include our planning benchmark, <a href=\"https://ibm.github.io/ACPBench/index.html\" rel=\"nofollow\">ACPBench</a>, as part of this ecosystem—it's fully compatible with Inspect AI, the <a href=\"https://github.com/IBM/ACPBench/blob/main/GettingStarted.md\" rel=\"nofollow\">evaluation scripts</a> are available on our GitHub.</p>\n<h3 class=\"relative group flex items-baseline\">\n\t<a id=\"references\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#references\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tReferences\n\t</span>\n</h3>\n<ul>\n<li>ACPBench: Reasoning About Action, Change, and Planning, <em>Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi</em>, <a href=\"https://ojs.aaai.org/index.php/AAAI/article/view/34857\" rel=\"nofollow\">AAAI 2025</a></li>\n<li>ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, <em>Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi</em>, <a href=\"https://openreview.net/forum?id=WIXohR7mEo\" rel=\"nofollow\">ICLR 2026</a></li>\n</ul>\n","updatedAt":"2026-02-11T18:05:46.295Z","author":{"_id":"648b67b2355113a5fba0ff31","avatarUrl":"/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg","fullname":"Harsha Kokel","name":"harshakokel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.9256938695907593},"editors":["harshakokel"],"editorAvatarUrls":["/avatars/8ea58a2e3f261d20b33c033992bc7f52.svg"],"reactions":[{"reaction":"➕","users":["ctpelok"],"count":1}],"isReport":false}}],"status":"open","isReport":false,"pinned":false,"locked":false,"collection":"community_blogs"},"contextAuthors":["burtenshaw","SaylorTwift","kramp","merve","davanstrien","nielsr","julien-c"],"primaryEmailConfirmed":false,"discussionRole":0,"acceptLanguages":["en"],"withThread":true,"cardDisplay":false,"repoDiscussionsLocked":false}">
This comment has been hidden Great initiative, aggregating multiple signals is the way to go!
Although such a measure has not solved the problems encountered in the current evaluation, at least it is indeed a very good measure in terms of decentralization and mobilizing the power of the community for co-construction.
Will there be the integration with existing huggingface lighteval?
hi @naufalso ! Lighteval now suport inspect-ai as a backend, so everything supported by inspect is integrrated in lighteval 🔥
This comment has been hidden (marked as Spam) This is a very important and timely initiative. It’s easy to get lost in the sea of leaderboards, each with its own format and reporting style. The Inspect AI log format brings much‑needed standardization, and having Hugging Face host evaluation logs is a real game changer. One reason many valuable benchmarks fade away is that original contributors often lack the resources to continuously maintain leaderboards. The Community Evals initiative has tremendous potential to address this gap, and I truly appreciate the effort behind it.
We’re hoping to include our planning benchmark, ACPBench, as part of this ecosystem—it's fully compatible with Inspect AI, the evaluation scripts are available on our GitHub.
References
- ACPBench: Reasoning About Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, AAAI 2025
- ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning, Harsha Kokel, Michael Katz, Kavitha Srinivas, Shirin Sohrabi, ICLR 2026
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.