Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on . We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r{=}0.69) but negatively in execution (r{=}{-}0.13), with several 45.45% public execution systems reaching 63.64% on the hidden set. Third, the term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.</p>\n","updatedAt":"2026-05-14T02:26:14.190Z","author":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","fullname":"Dhaval Patel","name":"DhavalPatel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8607238531112671},"editors":["DhavalPatel"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08518","authors":[{"_id":"6a0528f5b1a8cbabc9f0866a","name":"Dhaval Patel","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866b","name":"Chathurangi Shyalika","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866c","name":"Suryanarayana Reddy Yarrabothula","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866d","name":"Ling Yue","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866e","name":"Shuxin Lin","hidden":false},{"_id":"6a0528f5b1a8cbabc9f0866f","name":"Nianjun Zhou","hidden":false},{"_id":"6a0528f5b1a8cbabc9f08670","name":"James Rayfield","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge","submittedOnDailyBy":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","isPro":false,"fullname":"Dhaval Patel","user":"DhavalPatel","type":"user","name":"DhavalPatel"},"summary":"Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on . We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r{=}0.69) but negatively in execution (r{=}{-}0.13), with several 45.45\\% public execution systems reaching 63.64\\% on the hidden set. Third, the term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.","upvotes":5,"discussionId":"6a0528f6b1a8cbabc9f08671","projectPage":"https://www.codabench.org/competitions/10206/","organization":{"_id":"6760ab6c5c9a8ea8370ab95b","name":"ibm-research","fullname":"IBM Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/npxapKcW-cXX3J2JBl2vY.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","isPro":false,"fullname":"Dhaval Patel","user":"DhavalPatel","type":"user"},{"_id":"68b3ced1a9ed9914048e5164","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/WKVqOZe8f9_qfbjT0gyMl.jpeg","isPro":false,"fullname":"Pushpak Jaiswal","user":"PUSHPAK-JAISWAL","type":"user"},{"_id":"66739484a660dfbb2643eb3d","avatarUrl":"/avatars/c9a6c6a1e295e448dacfccb1a2860e8a.svg","isPro":false,"fullname":"Leo Y","user":"LeoYML","type":"user"},{"_id":"662436602d61edba3d27e263","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/IpikJRhMDkaaAvVMiaVj6.png","isPro":false,"fullname":"Chathurangi Shyalika","user":"ChathurangiShyalika","type":"user"},{"_id":"6a05c300dc7757e4d78b89ec","avatarUrl":"/avatars/d876537c1b9effc46909c99919c909da.svg","isPro":false,"fullname":"Hanan Latiff","user":"Hanan91","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6760ab6c5c9a8ea8370ab95b","name":"ibm-research","fullname":"IBM Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637bfdf60dc13843b468ac20/npxapKcW-cXX3J2JBl2vY.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08518.md"}">
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
Abstract
Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on . We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r{=}0.69) but negatively in execution (r{=}{-}0.13), with several 45.45\% public execution systems reaching 63.64\% on the hidden set. Third, the term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.
Community
Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on . We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning (r{=}0.69) but negatively in execution (r{=}{-}0.13), with several 45.45% public execution systems reaching 63.64% on the hidden set. Third, the term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.08518 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.08518 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.08518 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.