Hugging Face Daily Papers · · 5 min read

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

/</swi>) that mark where latent reasoning begins and ends, making hidden-state-recurrence latent CoT both trainable with standard on-policy RL (GRPO) and open to direct mechanistic probing and causal intervention. It reaches 79.3% on MATH-500, well above same-scale Coconut-style baselines, and analysis confirms the latent step performs causally necessary computation rather than acting as an inert placeholder.","html":"<p>This paper introduces a pair of learned boundary tokens (/) that mark where latent reasoning begins and ends, making hidden-state-recurrence latent CoT both trainable with standard on-policy RL (GRPO) and open to direct mechanistic probing and causal intervention. It reaches 79.3% on MATH-500, well above same-scale Coconut-style baselines, and analysis confirms the latent step performs causally necessary computation rather than acting as an inert placeholder.</p>\n","updatedAt":"2026-06-12T02:50:22.621Z","author":{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","fullname":"fan","name":"EasonFan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8750749230384827},"editors":["EasonFan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13106","authors":[{"_id":"6a2b70cb4957fcdd3aac0653","name":"Jiayu Yang","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0654","name":"Chao Chen","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0655","name":"Shengen Wu","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0656","name":"Yinhong Liu","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0657","user":{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user","name":"EasonFan"},"name":"Yuxuan Fan","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:56:41.691Z","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0658","name":"Lujundong Li","hidden":false},{"_id":"6a2b70cb4957fcdd3aac0659","name":"Songning Lai","hidden":false},{"_id":"6a2b70cb4957fcdd3aac065a","name":"Chengwei Qin","hidden":false},{"_id":"6a2b70cb4957fcdd3aac065b","name":"Zhijiang Guo","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning","submittedOnDailyBy":{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user","name":"EasonFan"},"summary":"Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.","upvotes":1,"discussionId":"6a2b70cc4957fcdd3aac065c","ai_summary":"A switchable latent reasoning framework uses explicit boundary tokens to enable trainable and interpretable latent reasoning through recurrent hidden states.","ai_keywords":["latent chain-of-thought","on-policy reinforcement learning","latent reasoning","hidden-state recurrence","switchable latent reasoning","GRPO policy ratio","mechanistic analysis","visible-to-latent curriculum","Switch-GRPO objective"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13106.md","query":{}}">
Papers
arxiv:2606.13106

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Published on Jun 11
· Submitted by
fan
on Jun 12
Authors:
,
,
,
,
,
,
,

Abstract

A switchable latent reasoning framework uses explicit boundary tokens to enable trainable and interpretable latent reasoning through recurrent hidden states.

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

Community

Paper author Paper submitter about 7 hours ago

This paper introduces a pair of learned boundary tokens (/) that mark where latent reasoning begins and ends, making hidden-state-recurrence latent CoT both trainable with standard on-policy RL (GRPO) and open to direct mechanistic probing and causal intervention. It reaches 79.3% on MATH-500, well above same-scale Coconut-style baselines, and analysis confirms the latent step performs causally necessary computation rather than acting as an inert placeholder.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13106
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.13106 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13106 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13106 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers