Hugging Face Daily Papers · June 4, 2026 · 3 min read

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Folding redundant reasoning chains via introspective preference learning for efficient LRM inference. (<strong>Accepted by ICML 2026</strong>)</p>\n","updatedAt":"2026-06-04T03:12:46.915Z","author":{"_id":"6601196cc91ba4c08ad6e270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg","fullname":"Yuzhe Gu","name":"vanilla1116","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7731778621673584},"editors":["vanilla1116"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03503","authors":[{"_id":"6a1faf31e292c1c78ecb1416","name":"Ziyan Liu","hidden":false},{"_id":"6a1faf31e292c1c78ecb1417","name":"Xueda Shen","hidden":false},{"_id":"6a1faf31e292c1c78ecb1418","name":"Yuzhe Gu","hidden":false},{"_id":"6a1faf31e292c1c78ecb1419","name":"Songyang Gao","hidden":false},{"_id":"6a1faf31e292c1c78ecb141a","name":"Kuikun Liu","hidden":false},{"_id":"6a1faf31e292c1c78ecb141b","name":"Guangran Cheng","hidden":false},{"_id":"6a1faf31e292c1c78ecb141c","name":"Chengqi Lyu","hidden":false},{"_id":"6a1faf31e292c1c78ecb141d","name":"Dahua Lin","hidden":false},{"_id":"6a1faf31e292c1c78ecb141e","name":"Wenwei Zhang","hidden":false},{"_id":"6a1faf31e292c1c78ecb141f","name":"Kai Chen","hidden":false}],"publishedAt":"2026-06-02T11:21:27.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning","submittedOnDailyBy":{"_id":"6601196cc91ba4c08ad6e270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg","isPro":false,"fullname":"Yuzhe Gu","user":"vanilla1116","type":"user","name":"vanilla1116"},"summary":"Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.","upvotes":19,"discussionId":"6a1faf31e292c1c78ecb1420","githubRepo":"https://github.com/ziyanliux/ThoughtFold","githubRepoAddedBy":"user","ai_summary":"ThoughtFold addresses over-thinking in large reasoning models by using fine-grained preference learning to identify and eliminate redundant explorations in chain-of-thought reasoning processes.","ai_keywords":["Large Reasoning Models","Reinforcement Learning with Verifiable Rewards","Chain-of-Thoughts","thought folding","fine-grained preference learning","introspective strategy","masked preference optimization","redundant explorations","reasoning efficiency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"64a2d5fa81252883206f24c9","name":"internlm","fullname":"Intern Large Models","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6432683407bad11484a68457/Q3Y0dL79GcsnaBCGRMooZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64e8505321540e1da3226b54","avatarUrl":"/avatars/18958b8406d1ce492b54c1c839f18c54.svg","isPro":false,"fullname":"Wenwei Zhang","user":"ZwwWayne","type":"user"},{"_id":"6601196cc91ba4c08ad6e270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6601196cc91ba4c08ad6e270/venywO3WPi2fNi5WUJTH0.jpeg","isPro":false,"fullname":"Yuzhe Gu","user":"vanilla1116","type":"user"},{"_id":"6413e0c350358a805205f540","avatarUrl":"/avatars/13a82cdd7c0b9f12206cb8a3e2d3809b.svg","isPro":false,"fullname":"shuo shen","user":"hyperion-shuo","type":"user"},{"_id":"64dee100e437d02ce6aca5ee","avatarUrl":"/avatars/838720434abad60e8dee8e85b4d402f5.svg","isPro":false,"fullname":"Guangran Cheng","user":"penny123","type":"user"},{"_id":"64ccaa4687ec96aa4752e754","avatarUrl":"/avatars/d2dd2040a521de4f55c7335cb7771c75.svg","isPro":false,"fullname":"Yiming Zhang","user":"ymzhang319","type":"user"},{"_id":"67652fc11fde77e3bb8b017d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/dVuf5R3vGZuK8uacp1d2f.png","isPro":false,"fullname":"Yinhao Tang","user":"tangyinhao","type":"user"},{"_id":"64c771ae3e30498ccada181a","avatarUrl":"/avatars/a05aa10e8cf9456324969d2776061a8e.svg","isPro":false,"fullname":"KLK","user":"skajifoia","type":"user"},{"_id":"690cabb37d132600e501502c","avatarUrl":"/avatars/dd0901cb1b0ef1f6b2462894f4cff158.svg","isPro":false,"fullname":"Zhiwei Zhuang","user":"zw31415926","type":"user"},{"_id":"6600f734997ede4f9b1bb33e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6600f734997ede4f9b1bb33e/vvBsz6c1WUzAl9ifGSl_m.jpeg","isPro":false,"fullname":"Fang","user":"Youqing","type":"user"},{"_id":"6777aeedcd9c943549b26527","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/wTusgCoTbVqvu2jeaebcP.png","isPro":false,"fullname":"liu","user":"ziyan2003","type":"user"},{"_id":"680ed019a9918bb1cbc66248","avatarUrl":"/avatars/904d243f2fad99341f11795e93788993.svg","isPro":false,"fullname":"Hyunmin Cho","user":"hyeoncho01","type":"user"},{"_id":"6888c0ef00c018ee0d25c8e4","avatarUrl":"/avatars/b1ee2eb756dcd0eb6a7890d36587925e.svg","isPro":false,"fullname":"Yeqiu Chen","user":"cyqloveljj","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64a2d5fa81252883206f24c9","name":"internlm","fullname":"Intern Large Models","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6432683407bad11484a68457/Q3Y0dL79GcsnaBCGRMooZ.png"}}">

Papers

arxiv:2606.03503

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Published on Jun 2

· Submitted by

Yuzhe Gu on Jun 4

Intern Large Models

Upvote

Authors:

Abstract

ThoughtFold addresses over-thinking in large reasoning models by using fine-grained preference learning to identify and eliminate redundant explorations in chain-of-thought reasoning processes.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.