Hugging Face Daily Papers · · 3 min read

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Folding redundant reasoning chains via introspective preference learning for efficient LRM inference. (<strong>Accepted by ICML 2026</strong>)</p>\n","updatedAt":"2026-06-04T03:12:46.915Z","author":{"_id":"6601196cc91ba4c08ad6e270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg","fullname":"Yuzhe Gu","name":"vanilla1116","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7731778621673584},"editors":["vanilla1116"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03503","authors":[{"_id":"6a1faf31e292c1c78ecb1416","name":"Ziyan Liu","hidden":false},{"_id":"6a1faf31e292c1c78ecb1417","name":"Xueda Shen","hidden":false},{"_id":"6a1faf31e292c1c78ecb1418","name":"Yuzhe Gu","hidden":false},{"_id":"6a1faf31e292c1c78ecb1419","name":"Songyang Gao","hidden":false},{"_id":"6a1faf31e292c1c78ecb141a","name":"Kuikun Liu","hidden":false},{"_id":"6a1faf31e292c1c78ecb141b","name":"Guangran Cheng","hidden":false},{"_id":"6a1faf31e292c1c78ecb141c","name":"Chengqi Lyu","hidden":false},{"_id":"6a1faf31e292c1c78ecb141d","name":"Dahua Lin","hidden":false},{"_id":"6a1faf31e292c1c78ecb141e","name":"Wenwei Zhang","hidden":false},{"_id":"6a1faf31e292c1c78ecb141f","name":"Kai Chen","hidden":false}],"publishedAt":"2026-06-02T11:21:27.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning","submittedOnDailyBy":{"_id":"6601196cc91ba4c08ad6e270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg","isPro":false,"fullname":"Yuzhe Gu","user":"vanilla1116","type":"user","name":"vanilla1116"},"summary":"Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.","upvotes":19,"discussionId":"6a1faf31e292c1c78ecb1420","githubRepo":"https://github.com/ziyanliux/ThoughtFold","githubRepoAddedBy":"user","ai_summary":"ThoughtFold addresses over-thinking in large reasoning models by using fine-grained preference learning to identify and eliminate redundant explorations in chain-of-thought reasoning processes.","ai_keywords":["Large Reasoning Models","Reinforcement Learning with Verifiable Rewards","Chain-of-Thoughts","thought folding","fine-grained preference learning","introspective strategy","masked preference optimization","redundant explorations","reasoning efficiency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"64a2d5fa81252883206f24c9","name":"internlm","fullname":"Intern Large Models","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6432683407bad11484a68457/Q3Y0dL79GcsnaBCGRMooZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64e8505321540e1da3226b54","avatarUrl":"/avatars/18958b8406d1ce492b54c1c839f18c54.svg","isPro":false,"fullname":"Wenwei Zhang","user":"ZwwWayne","type":"user"},{"_id":"6601196cc91ba4c08ad6e270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg","isPro":false,"fullname":"Yuzhe Gu","user":"vanilla1116","type":"user"},{"_id":"6413e0c350358a805205f540","avatarUrl":"/avatars/13a82cdd7c0b9f12206cb8a3e2d3809b.svg","isPro":false,"fullname":"shuo shen","user":"hyperion-shuo","type":"user"},{"_id":"64dee100e437d02ce6aca5ee","avatarUrl":"/avatars/838720434abad60e8dee8e85b4d402f5.svg","isPro":false,"fullname":"Guangran Cheng","user":"penny123","type":"user"},{"_id":"64ccaa4687ec96aa4752e754","avatarUrl":"/avatars/d2dd2040a521de4f55c7335cb7771c75.svg","isPro":false,"fullname":"Yiming Zhang","user":"ymzhang319","type":"user"},{"_id":"67652fc11fde77e3bb8b017d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/dVuf5R3vGZuK8uacp1d2f.png","isPro":false,"fullname":"Yinhao Tang","user":"tangyinhao","type":"user"},{"_id":"64c771ae3e30498ccada181a","avatarUrl":"/avatars/a05aa10e8cf9456324969d2776061a8e.svg","isPro":false,"fullname":"KLK","user":"skajifoia","type":"user"},{"_id":"690cabb37d132600e501502c","avatarUrl":"/avatars/dd0901cb1b0ef1f6b2462894f4cff158.svg","isPro":false,"fullname":"Zhiwei Zhuang","user":"zw31415926","type":"user"},{"_id":"6600f734997ede4f9b1bb33e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6600f734997ede4f9b1bb33e/vvBsz6c1WUzAl9ifGSl_m.jpeg","isPro":false,"fullname":"Fang","user":"Youqing","type":"user"},{"_id":"6777aeedcd9c943549b26527","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/wTusgCoTbVqvu2jeaebcP.png","isPro":false,"fullname":"liu","user":"ziyan2003","type":"user"},{"_id":"680ed019a9918bb1cbc66248","avatarUrl":"/avatars/904d243f2fad99341f11795e93788993.svg","isPro":false,"fullname":"Hyunmin Cho","user":"hyeoncho01","type":"user"},{"_id":"6888c0ef00c018ee0d25c8e4","avatarUrl":"/avatars/b1ee2eb756dcd0eb6a7890d36587925e.svg","isPro":false,"fullname":"Yeqiu Chen","user":"cyqloveljj","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64a2d5fa81252883206f24c9","name":"internlm","fullname":"Intern Large Models","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6432683407bad11484a68457/Q3Y0dL79GcsnaBCGRMooZ.png"}}">
Papers
arxiv:2606.03503

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Published on Jun 2
· Submitted by
Yuzhe Gu
on Jun 4
Authors:
,
,
,
,
,
,
,
,
,

Abstract

ThoughtFold addresses over-thinking in large reasoning models by using fine-grained preference learning to identify and eliminate redundant explorations in chain-of-thought reasoning processes.

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

Community

Paper submitter about 6 hours ago

Folding redundant reasoning chains via introspective preference learning for efficient LRM inference. (Accepted by ICML 2026)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.03503 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.03503 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.03503 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers