Humans live in a continuous world, yet think in discrete language. LLMs live in a tokenized world, yet think in continuous representations. So is discrete language merely a communication artifact?</p>\n<p>Our answer: no. Language is not just discrete; it is <strong>compositional</strong>. That structure lets agents both <strong>act</strong> and <strong>learn</strong> by composition. \"Open the door, then enter the room” composes two policies into a new one. “A narwhal is a whale with a horn” teaches an unfamiliar concept by combining ones we already know. </p>\n<p>We turn this intuition into an RL theorem: with General Dijkstra Search, shorter policies can be concatenated, much like language, to form optimal goal-reaching policies. Instead of learning by reconsidering every situation from scratch, The agent can reuse learned sub-policies and search over their possible compositions as goals change.</p>\n<p>Inspired by this, <strong>Dynamic Latent Routing lets an LLM learn its own inner monologue</strong>. Each code acts like a small steering signal for a sub-policy inside the model. DLR searches for useful codes, trains the model to reuse them, and lets codes compose into longer thoughts. Across low-data fine-tuning settings, DLR matches or outperforms SFT, with learned codes and n-grams taking on distinct causal roles.</p>\n","updatedAt":"2026-05-15T04:21:29.790Z","author":{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","fullname":"Fangyuan Yu","name":"Ksgk-fy","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9213544726371765},"editors":["Ksgk-fy"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14323","authors":[{"_id":"6a068828b1a8cbabc9f098eb","name":"Fangyuan Yu","hidden":false},{"_id":"6a068828b1a8cbabc9f098ec","name":"Xin Su","hidden":false},{"_id":"6a068828b1a8cbabc9f098ed","name":"Amir Abdullah","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64d98ef7a4839890b25eb78b/AEAmg6P0owtzDrUrggw1a.png"],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Dynamic Latent Routing","submittedOnDailyBy":{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","isPro":true,"fullname":"Fangyuan Yu","user":"Ksgk-fy","type":"user","name":"Ksgk-fy"},"summary":"We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the \"search, select, update\" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.","upvotes":2,"discussionId":"6a068828b1a8cbabc9f098ee","ai_summary":"Temporal composition of sub-policies in MDPs with time-varying rewards enables optimal policy recovery through generalized Dijkstra search, which inspires a dynamic latent routing method for language model fine-tuning that outperforms traditional supervised approaches.","ai_keywords":["Markov Decision Processes","time-varying reward functions","General Dijkstra Search","temporal composition","sub-policies","Dynamic Latent Routing","language-model post-training","discrete latent codes","routing policies","supervised fine-tuning"],"organization":{"_id":"64212864b286e8c464f94936","name":"thoughtworks","fullname":"Thoughtworks","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b8f30b30839a0db8daa9da/UqcZ56nPE3y42y10BtK1p.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","isPro":true,"fullname":"Fangyuan Yu","user":"Ksgk-fy","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64212864b286e8c464f94936","name":"thoughtworks","fullname":"Thoughtworks","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b8f30b30839a0db8daa9da/UqcZ56nPE3y42y10BtK1p.jpeg"}}">
Abstract
Temporal composition of sub-policies in MDPs with time-varying rewards enables optimal policy recovery through generalized Dijkstra search, which inspires a dynamic latent routing method for language model fine-tuning that outperforms traditional supervised approaches.
AI-generated summary
We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.
Community
Humans live in a continuous world, yet think in discrete language. LLMs live in a tokenized world, yet think in continuous representations. So is discrete language merely a communication artifact?
Our answer: no. Language is not just discrete; it is compositional. That structure lets agents both act and learn by composition. "Open the door, then enter the room” composes two policies into a new one. “A narwhal is a whale with a horn” teaches an unfamiliar concept by combining ones we already know.
We turn this intuition into an RL theorem: with General Dijkstra Search, shorter policies can be concatenated, much like language, to form optimal goal-reaching policies. Instead of learning by reconsidering every situation from scratch, The agent can reuse learned sub-policies and search over their possible compositions as goals change.
Inspired by this, Dynamic Latent Routing lets an LLM learn its own inner monologue. Each code acts like a small steering signal for a sub-policy inside the model. DLR searches for useful codes, trains the model to reuse them, and lets codes compose into longer thoughts. Across low-data fine-tuning settings, DLR matches or outperforms SFT, with learned codes and n-grams taking on distinct causal roles.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.14323 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.14323 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.14323 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.