Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task–data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/674a59612f5974eb9ace45b1/koiziWjPm36J0g7tMA45F.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/674a59612f5974eb9ace45b1/koiziWjPm36J0g7tMA45F.png\" alt=\"5\"></a></p>\n","updatedAt":"2026-06-08T17:18:45.816Z","author":{"_id":"674a59612f5974eb9ace45b1","avatarUrl":"/avatars/50d8816053796da7b45e11118930d82b.svg","fullname":"Linas Nasvytis","name":"linasmn","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8835269212722778},"editors":["linasmn"],"editorAvatarUrls":["/avatars/50d8816053796da7b45e11118930d82b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28742","authors":[{"_id":"6a1925be56b4bb14ec65d09f","user":{"_id":"674a59612f5974eb9ace45b1","avatarUrl":"/avatars/50d8816053796da7b45e11118930d82b.svg","isPro":false,"fullname":"Linas Nasvytis","user":"linasmn","type":"user","name":"linasmn"},"name":"Linas Nasvytis","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:49:42.369Z","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a0","name":"Simon Jerome Han","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a1","name":"Ben Prystawski","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a2","name":"Satchel Grant","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a3","name":"Noah D. Goodman","hidden":false},{"_id":"6a1925be56b4bb14ec65d0a4","name":"Judith E. Fan","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning","submittedOnDailyBy":{"_id":"674a59612f5974eb9ace45b1","avatarUrl":"/avatars/50d8816053796da7b45e11118930d82b.svg","isPro":false,"fullname":"Linas Nasvytis","user":"linasmn","type":"user","name":"linasmn"},"summary":"Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.","upvotes":0,"discussionId":"6a1925bf56b4bb14ec65d0a5","projectPage":"https://linasnasvytis.com/core-reasoning/","githubRepo":"https://github.com/LinasNas/core-reasoning","githubRepoAddedBy":"user","ai_summary":"Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-parametric approaches.","ai_keywords":["verifiable rewards","reasoning tasks","parametric approaches","non-parametric approaches","training samples","model rollouts","Contrastive Reflection","reasoning traces","natural-language descriptions","reasoning strategies","constraints","GRPO","GEPA","episodic RAG","MemRL","rollout budgets","prompt tokens","self-improvement"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28742.md"}">
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Abstract
Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-parametric approaches.
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.
Community
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, CORE achieves the strongest performance in most task–data regimes. Finally, we highlight how CORE is substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.28742 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.28742 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.28742 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.