Hugging Face Daily Papers · · 4 min read

Dynamic Latent Routing

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Humans live in a continuous world, yet think in discrete language. LLMs live in a tokenized world, yet think in continuous representations. So is discrete language merely a communication artifact?</p>\n<p>Our answer: no. Language is not just discrete; it is <strong>compositional</strong>. That structure lets agents both <strong>act</strong> and <strong>learn</strong> by composition. \"Open the door, then enter the room” composes two policies into a new one. “A narwhal is a whale with a horn” teaches an unfamiliar concept by combining ones we already know. </p>\n<p>We turn this intuition into an RL theorem: with General Dijkstra Search, shorter policies can be concatenated, much like language, to form optimal goal-reaching policies. Instead of learning by reconsidering every situation from scratch, The agent can reuse learned sub-policies and search over their possible compositions as goals change.</p>\n<p>Inspired by this, <strong>Dynamic Latent Routing lets an LLM learn its own inner monologue</strong>. Each code acts like a small steering signal for a sub-policy inside the model. DLR searches for useful codes, trains the model to reuse them, and lets codes compose into longer thoughts. Across low-data fine-tuning settings, DLR matches or outperforms SFT, with learned codes and n-grams taking on distinct causal roles.</p>\n","updatedAt":"2026-05-15T04:21:29.790Z","author":{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","fullname":"Fangyuan Yu","name":"Ksgk-fy","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9213544726371765},"editors":["Ksgk-fy"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14323","authors":[{"_id":"6a068828b1a8cbabc9f098eb","name":"Fangyuan Yu","hidden":false},{"_id":"6a068828b1a8cbabc9f098ec","name":"Xin Su","hidden":false},{"_id":"6a068828b1a8cbabc9f098ed","name":"Amir Abdullah","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64d98ef7a4839890b25eb78b/AEAmg6P0owtzDrUrggw1a.png"],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Dynamic Latent Routing","submittedOnDailyBy":{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","isPro":true,"fullname":"Fangyuan Yu","user":"Ksgk-fy","type":"user","name":"Ksgk-fy"},"summary":"We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the \"search, select, update\" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.","upvotes":2,"discussionId":"6a068828b1a8cbabc9f098ee","ai_summary":"Temporal composition of sub-policies in MDPs with time-varying rewards enables optimal policy recovery through generalized Dijkstra search, which inspires a dynamic latent routing method for language model fine-tuning that outperforms traditional supervised approaches.","ai_keywords":["Markov Decision Processes","time-varying reward functions","General Dijkstra Search","temporal composition","sub-policies","Dynamic Latent Routing","language-model post-training","discrete latent codes","routing policies","supervised fine-tuning"],"organization":{"_id":"64212864b286e8c464f94936","name":"thoughtworks","fullname":"Thoughtworks","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b8f30b30839a0db8daa9da/UqcZ56nPE3y42y10BtK1p.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64d98ef7a4839890b25eb78b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d98ef7a4839890b25eb78b/215-CSVLl81z6CAq0ECWU.jpeg","isPro":true,"fullname":"Fangyuan Yu","user":"Ksgk-fy","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64212864b286e8c464f94936","name":"thoughtworks","fullname":"Thoughtworks","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65b8f30b30839a0db8daa9da/UqcZ56nPE3y42y10BtK1p.jpeg"}}">
Papers
arxiv:2605.14323

Dynamic Latent Routing

Published on May 14
· Submitted by
Fangyuan Yu
on May 15
Authors:
,
,

Abstract

Temporal composition of sub-policies in MDPs with time-varying rewards enables optimal policy recovery through generalized Dijkstra search, which inspires a dynamic latent routing method for language model fine-tuning that outperforms traditional supervised approaches.

AI-generated summary

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

Community

Paper submitter about 21 hours ago

Humans live in a continuous world, yet think in discrete language. LLMs live in a tokenized world, yet think in continuous representations. So is discrete language merely a communication artifact?

Our answer: no. Language is not just discrete; it is compositional. That structure lets agents both act and learn by composition. "Open the door, then enter the room” composes two policies into a new one. “A narwhal is a whale with a horn” teaches an unfamiliar concept by combining ones we already know.

We turn this intuition into an RL theorem: with General Dijkstra Search, shorter policies can be concatenated, much like language, to form optimal goal-reaching policies. Instead of learning by reconsidering every situation from scratch, The agent can reuse learned sub-policies and search over their possible compositions as goals change.

Inspired by this, Dynamic Latent Routing lets an LLM learn its own inner monologue. Each code acts like a small steering signal for a sub-policy inside the model. DLR searches for useful codes, trains the model to reuse them, and lets codes compose into longer thoughts. Across low-data fine-tuning settings, DLR matches or outperforms SFT, with learned codes and n-grams taking on distinct causal roles.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14323 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.14323 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14323 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers