TRQAM internalizes the trust region as a scalar λ inside the flow-policy sampling SDE — an exact Girsanov path-space KL identity (Thm 1) makes the KL budget structurally enforceable via dual descent. 68% vs 46% on 50 OGBench tasks. 👇 blog & code</p>\n","updatedAt":"2026-06-05T17:13:35.236Z","author":{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","fullname":"Yonghoon Dong","name":"yonghoon96","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8328087329864502},"editors":["yonghoon96"],"editorAvatarUrls":["/avatars/ca129ba1d5cd588ac1184067ae799133.svg"],"reactions":[],"isReport":false}},{"id":"6a237bd0272acdea61ec8f70","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":362,"isUserFollowing":false},"createdAt":"2026-06-06T01:45:52.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning](https://huggingface.co/papers/2605.06156) (2026)\n* [Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow](https://huggingface.co/papers/2605.07727) (2026)\n* [Discrete Flow Matching for Offline-to-Online Reinforcement Learning](https://huggingface.co/papers/2605.12379) (2026)\n* [Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities](https://huggingface.co/papers/2605.05812) (2026)\n* [Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies](https://huggingface.co/papers/2606.01151) (2026)\n* [Aligning Flow Map Policies with Optimal Q-Guidance](https://huggingface.co/papers/2605.12416) (2026)\n* [Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy](https://huggingface.co/papers/2605.13435) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.06156\">Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07727\">Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12379\">Discrete Flow Matching for Offline-to-Online Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.05812\">Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.01151\">Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12416\">Aligning Flow Map Policies with Optimal Q-Guidance</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.13435\">Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-06T01:45:52.827Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":362,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7113137245178223},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27079","authors":[{"_id":"6a1a89a6808ddbc3c7d42f8e","user":{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","isPro":false,"fullname":"Yonghoon Dong","user":"yonghoon96","type":"user","name":"yonghoon96"},"name":"Yonghoon Dong","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:35:50.171Z","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f8f","name":"Kyungmin Lee","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f90","name":"Changyeon Kim","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f91","name":"Jaehyuk Kim","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f92","name":"Jinwoo Shin","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Trust Region Q Adjoint Matching","submittedOnDailyBy":{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","isPro":false,"fullname":"Yonghoon Dong","user":"yonghoon96","type":"user","name":"yonghoon96"},"summary":"Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.","upvotes":1,"discussionId":"6a1a89a7808ddbc3c7d42f93","projectPage":"https://yonghdong.github.io/blog/trqam/","githubRepo":"https://github.com/yonghdong/trqam","githubRepoAddedBy":"user","ai_summary":"Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies.","ai_keywords":["off-policy reinforcement learning","flow policies","optimization instability","Q-learning with Adjoint Matching","stochastic optimal control","critic-guided improvement","model collapse","Trust Region Q-Adjoint Matching","projected dual descent","path-space KL divergence","pretrained flow policies"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","isPro":false,"fullname":"Yonghoon Dong","user":"yonghoon96","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27079.md"}">
Trust Region Q Adjoint Matching
Abstract
Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies.
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.
Community
TRQAM internalizes the trust region as a scalar λ inside the flow-policy sampling SDE — an exact Girsanov path-space KL identity (Thm 1) makes the KL budget structurally enforceable via dual descent. 68% vs 46% on 50 OGBench tasks. 👇 blog & code
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.27079 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.27079 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.27079 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.