Hugging Face Daily Papers · May 22, 2026 · 7 min read

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Efficient reasoning is not about shorter chain-of-thought, but about better allocation of simulation (i.e., knowing when to imagine possible futures and when to act directly).\nCurrent adaptive-reasoning approaches (effort knobs, token budgets in Opus 4.7 and GPT-5.5) control how much the model thinks. SR²AM asks a more structural question: what kind of thinking should the model do at each step?\nWe decompose agentic deliberation into three systems:\n<ul>\n<li>System I (reactive execution): fast, pattern-based reasoning and action for familiar situations</li>\n<li>System II (simulative reasoning): predicting future states through the a world model, evaluating consequences before committing. This is what separates planning from longer chain-of-thought</li>\n<li>System III (self-regulation): a learned configurator that autonomously decides when to simulate, how far ahead, and when to skip planning entirely</li>\n</ul>\nLast year, in our companion paper <a href=\"https://arxiv.org/abs/2507.23773\" rel=\"nofollow\">SiRA</a>, we showed that simulative reasoning yields up to 124% improvement over reactive baselines — and that strong reasoning models (o1, o3-mini) fail as planners without this structure.\n<a href=\"https://arxiv.org/abs/2605.22138\" rel=\"nofollow\">SR²AM</a> adds the self-regulation layer. The result is RL enables the model to plan further ahead (+22.8% horizon) rather than more often (+2% frequency). In terms of performance, our 30B model is competitive with DeepSeek-V3.2 (685B) and Kimi-K2.5 (1T) at 26–95% fewer reasoning tokens.\nThis is a prototype using language-based world models. Stay tuned for our next steps on multimodal and physical world models. \nThe concept of a configurator, which decides when and how deeply to engage a reasoning process, is not specific to planning, but extensible to learning and adaptation going forward. \n📄 SR²AM: <a href=\"https://arxiv.org/abs/2605.22138\" rel=\"nofollow\">https://arxiv.org/abs/2605.22138</a> 📄 SiRA: <a href=\"https://arxiv.org/abs/2507.23773\" rel=\"nofollow\">https://arxiv.org/abs/2507.23773</a> 🌐 Project: <a href=\"https://sailing-lab.github.io/sr2am-self-regulated-planning\" rel=\"nofollow\">https://sailing-lab.github.io/sr2am-self-regulated-planning</a> 💻 Code: <a href=\"https://github.com/sailing-lab/sr2am\" rel=\"nofollow\">https://github.com/sailing-lab/sr2am</a>\n🤗 SR²AM-v0.1-8B: <a href=\"https://huggingface.co/sailing-lab/SR2AM-v0.1-8B\">https://huggingface.co/sailing-lab/SR2AM-v0.1-8B</a> 🤗 SR²AM-v1.0-30B: <a href=\"https://huggingface.co/sailing-lab/SR2AM-v1.0-30B\">https://huggingface.co/sailing-lab/SR2AM-v1.0-30B</a>\n","updatedAt":"2026-05-22T15:42:39.744Z","author":{"_id":"61718f1eec6894b48c23eec9","avatarUrl":"/avatars/9f426e0f1ca734fb428a22881d2e7b20.svg","fullname":"Mingkai Deng","name":"mingkaid","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8643075227737427},"editors":["mingkaid"],"editorAvatarUrls":["/avatars/9f426e0f1ca734fb428a22881d2e7b20.svg"],"reactions":[{"reaction":"👍","users":["jinyuhou","larasneves"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22138","authors":[{"_id":"6a0fe513a53a61ce2e422e0a","name":"Mingkai Deng","hidden":false},{"_id":"6a0fe513a53a61ce2e422e0b","name":"Jinyu Hou","hidden":false},{"_id":"6a0fe513a53a61ce2e422e0c","user":{"_id":"673fc978eeab3f625fb41aa3","avatarUrl":"/avatars/99cc4bedbe926c7018f5fd3d1edaabc9.svg","isPro":false,"fullname":"Lara Sá Neves","user":"larasneves","type":"user","name":"larasneves"},"name":"Lara Sá Neves","status":"claimed_verified","statusLastChangedAt":"2026-05-22T16:07:22.787Z","hidden":false},{"_id":"6a0fe513a53a61ce2e422e0d","name":"Varad Pimpalkhute","hidden":false},{"_id":"6a0fe513a53a61ce2e422e0e","name":"Taylor W. Killian","hidden":false},{"_id":"6a0fe513a53a61ce2e422e0f","name":"Zhengzhong Liu","hidden":false},{"_id":"6a0fe513a53a61ce2e422e10","name":"Eric P. Xing","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/61718f1eec6894b48c23eec9/gGE5vyW-gUVff5gHvAbyt.png"],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Efficient Agentic Reasoning Through Self-Regulated Simulative Planning","submittedOnDailyBy":{"_id":"61718f1eec6894b48c23eec9","avatarUrl":"/avatars/9f426e0f1ca734fb428a22881d2e7b20.svg","isPro":false,"fullname":"Mingkai Deng","user":"mingkaid","type":"user","name":"mingkaid"},"summary":"How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.","upvotes":4,"discussionId":"6a0fe514a53a61ce2e422e11","projectPage":"https://sailing-lab.github.io/sr2am-self-regulated-planning/","githubRepo":"https://github.com/sailing-lab/sr2am-self-regulated-planning","githubRepoAddedBy":"user","ai_summary":"Efficient agentic reasoning requires decomposing decision-making into three systems—simulative reasoning, self-regulation, and reactive execution—enabling controlled planning that reduces token usage while maintaining performance.","ai_keywords":["chain-of-thought","world model","simulative reasoning","self-regulation","reactive execution","planning horizon","reasoning tokens","supervised learning","reinforcement learning","Pass@1","multi-module system","structured plans","pretrained reasoning"],"githubStars":0,"organization":{"_id":"69d5b3d6eafce2db2c49cf71","name":"sailing-lab","fullname":"SAILING Lab (CMU & MBZUAI)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61718f1eec6894b48c23eec9/zde_ahkkfpQTopCw5_I6x.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6510902581204fcae4215ff8","avatarUrl":"/avatars/b720e968a7d70ca62d8082b4316838af.svg","isPro":false,"fullname":"Jinyu Hou","user":"jinyuhou","type":"user"},{"_id":"673fc978eeab3f625fb41aa3","avatarUrl":"/avatars/99cc4bedbe926c7018f5fd3d1edaabc9.svg","isPro":false,"fullname":"Lara Sá Neves","user":"larasneves","type":"user"},{"_id":"61718f1eec6894b48c23eec9","avatarUrl":"/avatars/9f426e0f1ca734fb428a22881d2e7b20.svg","isPro":false,"fullname":"Mingkai Deng","user":"mingkaid","type":"user"},{"_id":"624ae12dc04d55ec0f43c089","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649074448411-noauth.png","isPro":false,"fullname":"Varad Pimpalkhute","user":"DaoistKalki","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69d5b3d6eafce2db2c49cf71","name":"sailing-lab","fullname":"SAILING Lab (CMU & MBZUAI)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61718f1eec6894b48c23eec9/zde_ahkkfpQTopCw5_I6x.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22138.md"}">

Papers

arxiv:2605.22138

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Published on May 21

· Submitted by

Mingkai Deng on May 22

SAILING Lab (CMU & MBZUAI)

Upvote

Authors:

Lara Sá Neves ,

Abstract

Efficient agentic reasoning requires decomposing decision-making into three systems—simulative reasoning, self-regulation, and reactive execution—enabling controlled planning that reduces token usage while maintaining performance.

AI-generated summary

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR^2AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

mingkaid

Paper submitter about 10 hours ago

Efficient reasoning is not about shorter chain-of-thought, but about better allocation of simulation (i.e., knowing when to imagine possible futures and when to act directly).

Current adaptive-reasoning approaches (effort knobs, token budgets in Opus 4.7 and GPT-5.5) control how much the model thinks. SR²AM asks a more structural question: what kind of thinking should the model do at each step?

We decompose agentic deliberation into three systems:

System I (reactive execution): fast, pattern-based reasoning and action for familiar situations
System II (simulative reasoning): predicting future states through the a world model, evaluating consequences before committing. This is what separates planning from longer chain-of-thought
System III (self-regulation): a learned configurator that autonomously decides when to simulate, how far ahead, and when to skip planning entirely

Last year, in our companion paper SiRA, we showed that simulative reasoning yields up to 124% improvement over reactive baselines — and that strong reasoning models (o1, o3-mini) fail as planners without this structure.

SR²AM adds the self-regulation layer. The result is RL enables the model to plan further ahead (+22.8% horizon) rather than more often (+2% frequency). In terms of performance, our 30B model is competitive with DeepSeek-V3.2 (685B) and Kimi-K2.5 (1T) at 26–95% fewer reasoning tokens.

This is a prototype using language-based world models. Stay tuned for our next steps on multimodal and physical world models.

The concept of a configurator, which decides when and how deeply to engage a reasoning process, is not specific to planning, but extensible to learning and adaptation going forward.

📄 SR²AM: https://arxiv.org/abs/2605.22138
📄 SiRA: https://arxiv.org/abs/2507.23773
🌐 Project: https://sailing-lab.github.io/sr2am-self-regulated-planning
💻 Code: https://github.com/sailing-lab/sr2am

🤗 SR²AM-v0.1-8B: https://huggingface.co/sailing-lab/SR2AM-v0.1-8B
🤗 SR²AM-v1.0-30B: https://huggingface.co/sailing-lab/SR2AM-v1.0-30B

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.22138

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22138 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22138 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Abstract

Community

Models citing this paper 2

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers