Hugging Face Daily Papers · June 1, 2026 · 3 min read

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks</p>\n","updatedAt":"2026-06-01T04:58:18.531Z","author":{"_id":"61001311e043e15c13412d30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61001311e043e15c13412d30/6yAbTweYR16XtxMBEyOWl.png","fullname":"Pasquale Minervini","name":"pminervini","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":60,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8610544204711914},"editors":["pminervini"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61001311e043e15c13412d30/6yAbTweYR16XtxMBEyOWl.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31433","authors":[{"_id":"6a1d111e808ddbc3c7d435ed","name":"Wai-Chung Kwan","hidden":false},{"_id":"6a1d111e808ddbc3c7d435ee","user":{"_id":"644f895e23d7eb05ca695054","avatarUrl":"/avatars/3fb04dd8544b403262bf98507de05453.svg","isPro":true,"fullname":"Aryo Pradipta Gema","user":"aryopg","type":"user","name":"aryopg"},"name":"Aryo Pradipta Gema","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:39.954Z","hidden":false},{"_id":"6a1d111e808ddbc3c7d435ef","name":"Joshua Ong Jun Leang","hidden":false},{"_id":"6a1d111e808ddbc3c7d435f0","user":{"_id":"61001311e043e15c13412d30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61001311e043e15c13412d30/6yAbTweYR16XtxMBEyOWl.png","isPro":false,"fullname":"Pasquale Minervini","user":"pminervini","type":"user","name":"pminervini"},"name":"Pasquale Minervini","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:42.033Z","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks","submittedOnDailyBy":{"_id":"61001311e043e15c13412d30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61001311e043e15c13412d30/6yAbTweYR16XtxMBEyOWl.png","isPro":false,"fullname":"Pasquale Minervini","user":"pminervini","type":"user","name":"pminervini"},"summary":"Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.","upvotes":12,"discussionId":"6a1d111e808ddbc3c7d435f1","projectPage":"https://edinburghnlp.github.io/scope/","ai_summary":"SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision.","ai_keywords":["self-play","instruction-tuned models","open-ended tasks","policy co-evolution","multi-turn retrieval","self-judge","rubric generation","GRPO_data"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61001311e043e15c13412d30","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61001311e043e15c13412d30/6yAbTweYR16XtxMBEyOWl.png","isPro":false,"fullname":"Pasquale Minervini","user":"pminervini","type":"user"},{"_id":"671b6dd0cca657cc83d968a3","avatarUrl":"/avatars/3f7ae42685ef84965b83d4fdb6c041b1.svg","isPro":false,"fullname":"Matteo Attimonelli","user":"mattate","type":"user"},{"_id":"6537dd20aa214745cbc89ad9","avatarUrl":"/avatars/cb6bda58ae6063562eb02cf92c0e7c4e.svg","isPro":false,"fullname":"Joshua Ong Jun Leang","user":"Jforeverss","type":"user"},{"_id":"644f895e23d7eb05ca695054","avatarUrl":"/avatars/3fb04dd8544b403262bf98507de05453.svg","isPro":true,"fullname":"Aryo Pradipta Gema","user":"aryopg","type":"user"},{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":false,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user"},{"_id":"60809ad44ad99100d63ce36a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619040921084-noauth.jpeg","isPro":false,"fullname":"Edoardo Maria Ponti","user":"ducdauge","type":"user"},{"_id":"62de68118960b17bb39a1e8c","avatarUrl":"/avatars/8a38716ed0b8908c86ef74889a757791.svg","isPro":false,"fullname":"Filippo Menolascina","user":"fmenol","type":"user"},{"_id":"6473543e8b7a55cfa91d75cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6473543e8b7a55cfa91d75cd/6glN97Z3tcYV30a7a08Ed.jpeg","isPro":false,"fullname":"Lorenzo Molfetta","user":"LorMolf","type":"user"},{"_id":"657ccbf2869d5bb0e53b482f","avatarUrl":"/avatars/2eae5a10bdc14814a04d9f255f16de6b.svg","isPro":false,"fullname":"Rohit Saxena","user":"rohitsaxena","type":"user"},{"_id":"6435a42cef840aef222d4416","avatarUrl":"/avatars/6a797ace9a7e6ff8ac5aa896a1b52324.svg","isPro":false,"fullname":"Neel Rajani","user":"Neelectric","type":"user"},{"_id":"6707a05c371a647d6c76b0eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6707a05c371a647d6c76b0eb/oBUOV0a8lVZVfrK9ka79T.jpeg","isPro":false,"fullname":"Alessandro Cagiano","user":"AlessandroFrancescoBruno","type":"user"},{"_id":"664fba64729c53d860844ccd","avatarUrl":"/avatars/9ae4c1ff2db0dd3f84345f756c3e90bb.svg","isPro":false,"fullname":"Ne Luo","user":"neluo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31433.md"}">

Papers

arxiv:2605.31433

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Published on May 29

· Submitted by

Pasquale Minervini on Jun 1

Upvote

Authors:

Aryo Pradipta Gema ,

Pasquale Minervini

Abstract

SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision.

AI-generated summary

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.

View arXiv page View PDF Project page Add to collection

Community

pminervini

Paper author Paper submitter about 6 hours ago

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31433

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31433 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31433 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31433 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers