Hugging Face Daily Papers · · 6 min read

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task–data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/674a59612f5974eb9ace45b1/koiziWjPm36J0g7tMA45F.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/674a59612f5974eb9ace45b1/koiziWjPm36J0g7tMA45F.png\" alt=\"5\"></a></p>\n","updatedAt":"2026-06-08T17:18:45.816Z","author":{"_id":"674a59612f5974eb9ace45b1","avatarUrl":"/avatars/50d8816053796da7b45e11118930d82b.svg","fullname":"Linas Nasvytis","name":"linasmn","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8835269212722778},"editors":["linasmn"],"editorAvatarUrls":["/avatars/50d8816053796da7b45e11118930d82b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28742","authors":[{"_id":"6a1925be56b4bb14ec65d09f","user":{"_id":"674a59612f5974eb9ace45b1","avatarUrl":"/avatars/50d8816053796da7b45e11118930d82b.svg","isPro":false,"fullname":"Linas Nasvytis","user":"linasmn","type":"user","name":"linasmn"},"name":"Linas Nasvytis","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:49:42.369Z","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a0","name":"Simon Jerome Han","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a1","name":"Ben Prystawski","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a2","name":"Satchel Grant","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a3","name":"Noah D. Goodman","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a4","name":"Judith E. Fan","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning","submittedOnDailyBy":{"_id":"674a59612f5974eb9ace45b1","avatarUrl":"/avatars/50d8816053796da7b45e11118930d82b.svg","isPro":false,"fullname":"Linas Nasvytis","user":"linasmn","type":"user","name":"linasmn"},"summary":"Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.","upvotes":0,"discussionId":"6a1925bf56b4bb14ec65d0a5","projectPage":"https://linasnasvytis.com/core-reasoning/","githubRepo":"https://github.com/LinasNas/core-reasoning","githubRepoAddedBy":"user","ai_summary":"Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-parametric approaches.","ai_keywords":["verifiable rewards","reasoning tasks","parametric approaches","non-parametric approaches","training samples","model rollouts","Contrastive Reflection","reasoning traces","natural-language descriptions","reasoning strategies","constraints","GRPO","GEPA","episodic RAG","MemRL","rollout budgets","prompt tokens","self-improvement"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28742.md"}">
Papers
arxiv:2605.28742

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Published on May 27
· Submitted by
Linas Nasvytis
on Jun 8
Authors:
,
,
,
,

Abstract

Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-parametric approaches.

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

Community

Paper author Paper submitter about 3 hours ago

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task–data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.28742
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28742 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28742 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28742 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers