Hugging Face Daily Papers · · 7 min read

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<strong>TL;DR</strong> — Large reasoning models that write a Python program to solve a multi-step puzzle silently invalidate the entire plan when one primitive step is wrong — even if 29 of 30 steps were right. <strong>RePoT</strong> treats the program as a <em>checkpoint</em>, not a final answer: it runs the emitted plan through a deterministic verifier, stops at the first invalid transition, and asks the model for <strong>one</strong> repair call from the verified prefix. No fine-tuning, no rollout-time search.</p>\n<p><strong>Results on PuzzleZoo-775</strong></p>\n<ul>\n<li>Average about +3 to +11 pp over vanilla Program-of-Thought across four closed-model configurations (gpt-5.4-mini ± reasoning, gemini-3.5-flash, claude-sonnet-4.6), peaking at 96.9% vs 86.3% on <code>gpt-5.4-mini-medium</code>.</li>\n<li>Replicates on PlanBench Blocksworld and on four open-weights models (Qwen3.6-35B-A3B, Gemma-4-26B-A4B-it, gpt-oss-20b, Nemotron-3-Nano-30B-A3B).</li>\n<li>Costs at most <strong>one extra LLM call</strong> on the ~14% of problems where PoT fails.</li>\n</ul>\n<p>We also release <strong>Derail-550</strong> — the first benchmark to fix the <em>failure point</em> across recovery methods, so cross-method comparisons become causal rather than correlational. Finding: it's the trusted checkpoint <em>state</em> that does the recovery work, not any specific prefix tail.</p>\n<p>🤗 Dataset: <a href=\"https://huggingface.co/datasets/parsa-mz/puzzlezoo\">https://huggingface.co/datasets/parsa-mz/puzzlezoo</a><br>💻 Code: <a href=\"https://github.com/parsa-mz/RePot\" rel=\"nofollow\">https://github.com/parsa-mz/RePot</a></p>\n","updatedAt":"2026-05-29T13:37:27.631Z","author":{"_id":"63f9a22120589ee7cd649a33","avatarUrl":"/avatars/85e233f34930b5ff50a8cc1ec0a9d72f.svg","fullname":"Parsa Mazaheri","name":"parsa-mz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8583491444587708},"editors":["parsa-mz"],"editorAvatarUrls":["/avatars/85e233f34930b5ff50a8cc1ec0a9d72f.svg"],"reactions":[],"isReport":false}},{"id":"6a1a4121ce3d4314d4a88761","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:45:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents](https://huggingface.co/papers/2605.23574) (2026)\n* [RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement](https://huggingface.co/papers/2605.09730) (2026)\n* [SPEAR: Code-Augmented Agentic Prompt Optimization](https://huggingface.co/papers/2605.26275) (2026)\n* [ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning](https://huggingface.co/papers/2605.05737) (2026)\n* [The Detection-Extraction Gap: Models Know the Answer Before They Can Say It](https://huggingface.co/papers/2604.06613) (2026)\n* [Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use](https://huggingface.co/papers/2605.26037) (2026)\n* [Verifier-Guided Code Translation via Meta-Step Decoding](https://huggingface.co/papers/2605.17626) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.23574\">Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09730\">RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26275\">SPEAR: Code-Augmented Agentic Prompt Optimization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.05737\">ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.06613\">The Detection-Extraction Gap: Models Know the Answer Before They Can Say It</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26037\">Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17626\">Verifier-Guided Code Translation via Meta-Step Decoding</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:45:05.536Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7133148312568665},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30052","authors":[{"_id":"6a1902b356b4bb14ec65cf78","name":"Parsa Mazaheri","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63f9a22120589ee7cd649a33/7AUNDM7P7gu0T1I67aVE6.png"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"REPOT: Recoverable Program-of-Thought via Checkpoint Repair","submittedOnDailyBy":{"_id":"63f9a22120589ee7cd649a33","avatarUrl":"/avatars/85e233f34930b5ff50a8cc1ec0a9d72f.svg","isPro":false,"fullname":"Parsa Mazaheri","user":"parsa-mz","type":"user","name":"parsa-mz"},"summary":"One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.","upvotes":4,"discussionId":"6a1902b456b4bb14ec65cf79","projectPage":"https://huggingface.co/datasets/parsa-mz/puzzlezoo","githubRepo":"https://github.com/parsa-mz/RePot","githubRepoAddedBy":"user","ai_summary":"RePoT improves upon one-shot Program-of-Thought by enabling deterministic verified replay and recovery through environment interaction, achieving higher success rates across multiple models and benchmarks.","ai_keywords":["Program-of-Thought","RePoT","verified replay","environment interaction","LLM call","trajectory invalidation","checkpoint information","error-only feedback"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63f9a22120589ee7cd649a33","avatarUrl":"/avatars/85e233f34930b5ff50a8cc1ec0a9d72f.svg","isPro":false,"fullname":"Parsa Mazaheri","user":"parsa-mz","type":"user"},{"_id":"651ec3fe23cbf97a80bae9ea","avatarUrl":"/avatars/f8daa8e328fba702103c10c04b05b00a.svg","isPro":false,"fullname":"Hugging Face Lover","user":"marcopo1o","type":"user"},{"_id":"695348cedcc9dccb4981686f","avatarUrl":"/avatars/7cf154d6f7bebbe9dfc75cb461adfdeb.svg","isPro":false,"fullname":"Mohammed Ehab","user":"mohammedehab2002","type":"user"},{"_id":"648961d150c003881f1a10c3","avatarUrl":"/avatars/1eb3784c39f7ced2e952d11a410933ae.svg","isPro":false,"fullname":"Harshita Sharma","user":"hdsharma","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30052.md"}">
Papers
arxiv:2605.30052

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Published on May 28
· Submitted by
Parsa Mazaheri
on May 29
Authors:

Abstract

RePoT improves upon one-shot Program-of-Thought by enabling deterministic verified replay and recovery through environment interaction, achieving higher success rates across multiple models and benchmarks.

AI-generated summary

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.

Community

Paper submitter 1 day ago

TL;DR — Large reasoning models that write a Python program to solve a multi-step puzzle silently invalidate the entire plan when one primitive step is wrong — even if 29 of 30 steps were right. RePoT treats the program as a checkpoint, not a final answer: it runs the emitted plan through a deterministic verifier, stops at the first invalid transition, and asks the model for one repair call from the verified prefix. No fine-tuning, no rollout-time search.

Results on PuzzleZoo-775

  • Average about +3 to +11 pp over vanilla Program-of-Thought across four closed-model configurations (gpt-5.4-mini ± reasoning, gemini-3.5-flash, claude-sonnet-4.6), peaking at 96.9% vs 86.3% on gpt-5.4-mini-medium.
  • Replicates on PlanBench Blocksworld and on four open-weights models (Qwen3.6-35B-A3B, Gemma-4-26B-A4B-it, gpt-oss-20b, Nemotron-3-Nano-30B-A3B).
  • Costs at most one extra LLM call on the ~14% of problems where PoT fails.

We also release Derail-550 — the first benchmark to fix the failure point across recovery methods, so cross-method comparisons become causal rather than correlational. Finding: it's the trusted checkpoint state that does the recovery work, not any specific prefix tail.

🤗 Dataset: https://huggingface.co/datasets/parsa-mz/puzzlezoo
💻 Code: https://github.com/parsa-mz/RePot

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30052
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30052 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30052 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers