Hugging Face Daily Papers · May 26, 2026 · 6 min read

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Data & log: <a href=\"https://huggingface.co/datasets/yomi017/CosPlay\">https://huggingface.co/datasets/yomi017/CosPlay</a><br>Code: <a href=\"https://github.com/sanae-ai/cosplay\" rel=\"nofollow\">https://github.com/sanae-ai/cosplay</a></p>\n","updatedAt":"2026-05-26T15:25:05.273Z","author":{"_id":"6575758fc79162da90ed80dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6575758fc79162da90ed80dc/bzSa5vJzyoDIkmygK4c_L.jpeg","fullname":"Zhangyi HU","name":"Sanae-Kochiya-2003","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7379058003425598},"editors":["Sanae-Kochiya-2003"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6575758fc79162da90ed80dc/bzSa5vJzyoDIkmygK4c_L.jpeg"],"reactions":[],"isReport":false}},{"id":"6a15ffba10a7e4a97b76f48f","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T20:16:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"co-splay's idea of letting code candidates and self-generated ut s co-evolve at test time is a neat workaround for the gt bottleneck. my main question is how it handles ut signals that are biased by the initial code pool or by the problem distribution, which could nudge the loop toward brittle solutions rather than robust generalization. the arxivlens breakdown (https://arxivlens.com/PaperView/Details/cosplay-cooperative-self-play-at-test-time-with-self-generated-code-and-unit-test-7381-f94593ff) helped me parse the method details, especially the execution matrix part that ties pass counts to pruning and updating. a focused ablation on ut noise or skewed input distributions would help confirm the robustness of the co-evolution claim in non-ideal settings.","html":"<p>co-splay's idea of letting code candidates and self-generated ut s co-evolve at test time is a neat workaround for the gt bottleneck. my main question is how it handles ut signals that are biased by the initial code pool or by the problem distribution, which could nudge the loop toward brittle solutions rather than robust generalization. the arxivlens breakdown (<a href=\"https://arxivlens.com/PaperView/Details/cosplay-cooperative-self-play-at-test-time-with-self-generated-code-and-unit-test-7381-f94593ff\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/cosplay-cooperative-self-play-at-test-time-with-self-generated-code-and-unit-test-7381-f94593ff</a>) helped me parse the method details, especially the execution matrix part that ties pass counts to pruning and updating. a focused ablation on ut noise or skewed input distributions would help confirm the robustness of the co-evolution claim in non-ideal settings.</p>\n","updatedAt":"2026-05-26T20:16:58.486Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.927344560623169},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23491","authors":[{"_id":"6a13b2ef4d9e8d8602d2021a","user":{"_id":"6575758fc79162da90ed80dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6575758fc79162da90ed80dc/bzSa5vJzyoDIkmygK4c_L.jpeg","isPro":false,"fullname":"Zhangyi HU","user":"Sanae-Kochiya-2003","type":"user","name":"Sanae-Kochiya-2003"},"name":"Zhangyi Hu","status":"admin_assigned","statusLastChangedAt":"2026-05-26T15:18:51.492Z","hidden":false},{"_id":"6a13b2ef4d9e8d8602d2021b","name":"Chenhui Liu","hidden":false},{"_id":"6a13b2ef4d9e8d8602d2021c","name":"Tian Huang","hidden":false},{"_id":"6a13b2ef4d9e8d8602d2021d","name":"Jindong Li","hidden":false},{"_id":"6a13b2ef4d9e8d8602d2021e","name":"Yang Yang","hidden":false},{"_id":"6a13b2ef4d9e8d8602d2021f","name":"Jiemin Wu","hidden":false},{"_id":"6a13b2ef4d9e8d8602d20220","name":"Zining Zhong","hidden":false},{"_id":"6a13b2ef4d9e8d8602d20221","name":"Menglin Yang","hidden":false},{"_id":"6a13b2ef4d9e8d8602d20222","name":"Yutao Yue","hidden":false}],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test","submittedOnDailyBy":{"_id":"6575758fc79162da90ed80dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6575758fc79162da90ed80dc/bzSa5vJzyoDIkmygK4c_L.jpeg","isPro":false,"fullname":"Zhangyi HU","user":"Sanae-Kochiya-2003","type":"user","name":"Sanae-Kochiya-2003"},"summary":"Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.","upvotes":2,"discussionId":"6a13b2f04d9e8d8602d20223","githubRepo":"https://github.com/sanae-ai/cosplay","githubRepoAddedBy":"user","ai_summary":"CoSPlay is a GT-free framework that jointly improves code generation and unit test quality through cooperative self-play, achieving competitive performance without ground-truth unit tests.","ai_keywords":["Reinforcement Learning with Verifiable Rewards","Test-Time Scaling","Ground-Truth Unit Tests","code generation","cooperative self-play","bidirectional pass-count signals","output-consensus cluster","Qwen2.5-7B-Instruct","CURE-7B"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6575758fc79162da90ed80dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6575758fc79162da90ed80dc/bzSa5vJzyoDIkmygK4c_L.jpeg","isPro":false,"fullname":"Zhangyi HU","user":"Sanae-Kochiya-2003","type":"user"},{"_id":"692a62e4a96bceeb5a7fec9d","avatarUrl":"/avatars/95f872aa43f6d698d340048af1314dda.svg","isPro":false,"fullname":"xiyuan yang","user":"xiyuanqwerty","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0}">

Papers

arxiv:2605.23491

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Published on May 22

· Submitted by

Zhangyi HU on May 26

Upvote

Authors:

Zhangyi Hu ,

Abstract

CoSPlay is a GT-free framework that jointly improves code generation and unit test quality through cooperative self-play, achieving competitive performance without ground-truth unit tests.

AI-generated summary

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.

View arXiv page View PDF GitHub 0 Add to collection

Community

Sanae-Kochiya-2003

Paper author Paper submitter about 10 hours ago

Data & log: https://huggingface.co/datasets/yomi017/CosPlay
Code: https://github.com/sanae-ai/cosplay

avahal

about 5 hours ago

co-splay's idea of letting code candidates and self-generated ut s co-evolve at test time is a neat workaround for the gt bottleneck. my main question is how it handles ut signals that are biased by the initial code pool or by the problem distribution, which could nudge the loop toward brittle solutions rather than robust generalization. the arxivlens breakdown (https://arxivlens.com/PaperView/Details/cosplay-cooperative-self-play-at-test-time-with-self-generated-code-and-unit-test-7381-f94593ff) helped me parse the method details, especially the execution matrix part that ties pass counts to pruning and updating. a focused ablation on ut noise or skewed input distributions would help confirm the robustness of the co-evolution claim in non-ideal settings.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.23491 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.23491 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.23491 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers