Hugging Face Daily Papers · May 14, 2026 · 6 min read

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on . We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r{=}0.69) but negatively in execution (r{=}{-}0.13), with several 45.45% public execution systems reaching 63.64% on the hidden set. Third, the term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.</p>\n","updatedAt":"2026-05-14T02:26:14.190Z","author":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","fullname":"Dhaval Patel","name":"DhavalPatel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8607238531112671},"editors":["DhavalPatel"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08518","authors":[{"_id":"6a0528f5b1a8cbabc9f0866a","name":"Dhaval Patel","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866b","name":"Chathurangi Shyalika","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866c","name":"Suryanarayana Reddy Yarrabothula","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866d","name":"Ling Yue","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866e","name":"Shuxin Lin","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866f","name":"Nianjun Zhou","hidden":false},{"_id":"6a0528f5b1a8cbabc9f08670","name":"James Rayfield","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge","submittedOnDailyBy":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","isPro":false,"fullname":"Dhaval Patel","user":"DhavalPatel","type":"user","name":"DhavalPatel"},"summary":"Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on . We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r{=}0.69) but negatively in execution (r{=}{-}0.13), with several 45.45\\% public execution systems reaching 63.64\\% on the hidden set. Third, the term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.","upvotes":5,"discussionId":"6a0528f6b1a8cbabc9f08671","projectPage":"https://www.codabench.org/competitions/10206/","organization":{"_id":"6760ab6c5c9a8ea8370ab95b","name":"ibm-research","fullname":"IBM Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/npxapKcW-cXX3J2JBl2vY.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","isPro":false,"fullname":"Dhaval Patel","user":"DhavalPatel","type":"user"},{"_id":"68b3ced1a9ed9914048e5164","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/WKVqOZe8f9_qfbjT0gyMl.jpeg","isPro":false,"fullname":"Pushpak Jaiswal","user":"PUSHPAK-JAISWAL","type":"user"},{"_id":"66739484a660dfbb2643eb3d","avatarUrl":"/avatars/c9a6c6a1e295e448dacfccb1a2860e8a.svg","isPro":false,"fullname":"Leo Y","user":"LeoYML","type":"user"},{"_id":"662436602d61edba3d27e263","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/IpikJRhMDkaaAvVMiaVj6.png","isPro":false,"fullname":"Chathurangi Shyalika","user":"ChathurangiShyalika","type":"user"},{"_id":"6a05c300dc7757e4d78b89ec","avatarUrl":"/avatars/d876537c1b9effc46909c99919c909da.svg","isPro":false,"fullname":"Hanan Latiff","user":"Hanan91","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6760ab6c5c9a8ea8370ab95b","name":"ibm-research","fullname":"IBM Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/npxapKcW-cXX3J2JBl2vY.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08518.md"}">

Papers

arxiv:2605.08518