/</swi>) that mark where latent reasoning begins and ends, making hidden-state-recurrence latent CoT both trainable with standard on-policy RL (GRPO) and open to direct mechanistic probing and causal intervention. It reaches 79.3% on MATH-500, well above same-scale Coconut-style baselines, and analysis confirms the latent step performs causally necessary computation rather than acting as an inert placeholder.","html":"<p>This paper introduces a pair of learned boundary tokens (/) that mark where latent reasoning begins and ends, making hidden-state-recurrence latent CoT both trainable with standard on-policy RL (GRPO) and open to direct mechanistic probing and causal intervention. It reaches 79.3% on MATH-500, well above same-scale Coconut-style baselines, and analysis confirms the latent step performs causally necessary computation rather than acting as an inert placeholder.</p>\n","updatedAt":"2026-06-12T02:50:22.621Z","author":{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","fullname":"fan","name":"EasonFan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8750749230384827},"editors":["EasonFan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13106","authors":[{"_id":"6a2b70cb4957fcdd3aac0653","name":"Jiayu Yang","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0654","name":"Chao Chen","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0655","name":"Shengen Wu","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0656","name":"Yinhong Liu","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0657","user":{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user","name":"EasonFan"},"name":"Yuxuan Fan","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:56:41.691Z","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0658","name":"Lujundong Li","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0659","name":"Songning Lai","hidden":false},{"_id":"6a2b70cb4957fcdd3aac065a","name":"Chengwei Qin","hidden":false},{"_id":"6a2b70cb4957fcdd3aac065b","name":"Zhijiang Guo","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning","submittedOnDailyBy":{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user","name":"EasonFan"},"summary":"Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.","upvotes":1,"discussionId":"6a2b70cc4957fcdd3aac065c","ai_summary":"A switchable latent reasoning framework uses explicit boundary tokens to enable trainable and interpretable latent reasoning through recurrent hidden states.","ai_keywords":["latent chain-of-thought","on-policy reinforcement learning","latent reasoning","hidden-state recurrence","switchable latent reasoning","GRPO policy ratio","mechanistic analysis","visible-to-latent curriculum","Switch-GRPO objective"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13106.md","query":{}}">
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Published on Jun 11
· Submitted by fan on Jun 12 Abstract
A switchable latent reasoning framework uses explicit boundary tokens to enable trainable and interpretable latent reasoning through recurrent hidden states.
Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.
Community
This paper introduces a pair of learned boundary tokens (/) that mark where latent reasoning begins and ends, making hidden-state-recurrence latent CoT both trainable with standard on-policy RL (GRPO) and open to direct mechanistic probing and causal intervention. It reaches 79.3% on MATH-500, well above same-scale Coconut-style baselines, and analysis confirms the latent step performs causally necessary computation rather than acting as an inert placeholder.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.13106 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.13106 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.13106 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.