A research study built on the NVIDIA Nemotron Model Reasoning Challenge.</p>\n","updatedAt":"2026-06-23T15:29:02.267Z","author":{"_id":"6a39fecac79fc942bf49e28d","avatarUrl":"/avatars/86e5e0dbd4ab073457c98ef855ed5d22.svg","fullname":"Harsh Patel","name":"harshpatel2898","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7390094995498657},"editors":["harshpatel2898"],"editorAvatarUrls":["/avatars/86e5e0dbd4ab073457c98ef855ed5d22.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.21884","authors":[{"_id":"6a39fcf2fdcd3514343bb559","user":{"_id":"6a39fecac79fc942bf49e28d","avatarUrl":"/avatars/86e5e0dbd4ab073457c98ef855ed5d22.svg","isPro":false,"fullname":"Harsh Patel","user":"harshpatel2898","type":"user","name":"harshpatel2898"},"name":"Harsh Patel","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:37.287Z","hidden":false}],"publishedAt":"2026-06-20T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"A Verifiable Search Is Not a Learnable Chain-of-Thought","submittedOnDailyBy":{"_id":"6a39fecac79fc942bf49e28d","avatarUrl":"/avatars/86e5e0dbd4ab073457c98ef855ed5d22.svg","isPro":false,"fullname":"Harsh Patel","user":"harshpatel2898","type":"user","name":"harshpatel2898"},"summary":"It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time (\"verdict-as-token\"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.","upvotes":3,"discussionId":"6a39fcf3fdcd3514343bb55a","projectPage":"https://nemotron.harshpatel.live","githubRepo":"https://github.com/harshpatel1692/search-not-learnable","githubRepoAddedBy":"user","ai_summary":"Training models on chain-of-thought demonstrations fails for tasks requiring backtracking search because the forward derivation cannot be faithfully imitated, demonstrating a fundamental limitation in learning search procedures through demonstration.","ai_keywords":["chain-of-thought","fine-tuning","distillation","LoRA","Nemotron","backtracking search","verifiable rewards","self-training","forward-computable","search procedure","memorization","verification"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a39fecac79fc942bf49e28d","avatarUrl":"/avatars/86e5e0dbd4ab073457c98ef855ed5d22.svg","isPro":false,"fullname":"Harsh Patel","user":"harshpatel2898","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"697c8b15a7f796854ef333c4","avatarUrl":"/avatars/94de3a736fac914944f1b57609e3819a.svg","isPro":false,"fullname":"Joel Wang","user":"joelhenwang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.21884.md","query":{}}">
A Verifiable Search Is Not a Learnable Chain-of-Thought
Abstract
Training models on chain-of-thought demonstrations fails for tasks requiring backtracking search because the forward derivation cannot be faithfully imitated, demonstrating a fundamental limitation in learning search procedures through demonstration.
It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.
Community
A research study built on the NVIDIA Nemotron Model Reasoning Challenge.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.21884 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.21884 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.21884 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.