Hugging Face Daily Papers · · 6 min read

Trust Region Q Adjoint Matching

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

TRQAM internalizes the trust region as a scalar λ inside the flow-policy sampling SDE — an exact Girsanov path-space KL identity (Thm 1) makes the KL budget structurally enforceable via dual descent. 68% vs 46% on 50 OGBench tasks. 👇 blog &amp; code</p>\n","updatedAt":"2026-06-05T17:13:35.236Z","author":{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","fullname":"Yonghoon Dong","name":"yonghoon96","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8328087329864502},"editors":["yonghoon96"],"editorAvatarUrls":["/avatars/ca129ba1d5cd588ac1184067ae799133.svg"],"reactions":[],"isReport":false}},{"id":"6a237bd0272acdea61ec8f70","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":362,"isUserFollowing":false},"createdAt":"2026-06-06T01:45:52.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning](https://huggingface.co/papers/2605.06156) (2026)\n* [Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow](https://huggingface.co/papers/2605.07727) (2026)\n* [Discrete Flow Matching for Offline-to-Online Reinforcement Learning](https://huggingface.co/papers/2605.12379) (2026)\n* [Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities](https://huggingface.co/papers/2605.05812) (2026)\n* [Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies](https://huggingface.co/papers/2606.01151) (2026)\n* [Aligning Flow Map Policies with Optimal Q-Guidance](https://huggingface.co/papers/2605.12416) (2026)\n* [Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy](https://huggingface.co/papers/2605.13435) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.06156\">Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07727\">Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12379\">Discrete Flow Matching for Offline-to-Online Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.05812\">Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.01151\">Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12416\">Aligning Flow Map Policies with Optimal Q-Guidance</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.13435\">Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-06T01:45:52.827Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":362,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7113137245178223},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27079","authors":[{"_id":"6a1a89a6808ddbc3c7d42f8e","user":{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","isPro":false,"fullname":"Yonghoon Dong","user":"yonghoon96","type":"user","name":"yonghoon96"},"name":"Yonghoon Dong","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:35:50.171Z","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f8f","name":"Kyungmin Lee","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f90","name":"Changyeon Kim","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f91","name":"Jaehyuk Kim","hidden":false},{"_id":"6a1a89a6808ddbc3c7d42f92","name":"Jinwoo Shin","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Trust Region Q Adjoint Matching","submittedOnDailyBy":{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","isPro":false,"fullname":"Yonghoon Dong","user":"yonghoon96","type":"user","name":"yonghoon96"},"summary":"Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.","upvotes":1,"discussionId":"6a1a89a7808ddbc3c7d42f93","projectPage":"https://yonghdong.github.io/blog/trqam/","githubRepo":"https://github.com/yonghdong/trqam","githubRepoAddedBy":"user","ai_summary":"Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies.","ai_keywords":["off-policy reinforcement learning","flow policies","optimization instability","Q-learning with Adjoint Matching","stochastic optimal control","critic-guided improvement","model collapse","Trust Region Q-Adjoint Matching","projected dual descent","path-space KL divergence","pretrained flow policies"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"688a928c2b88fe48e8b69bed","avatarUrl":"/avatars/ca129ba1d5cd588ac1184067ae799133.svg","isPro":false,"fullname":"Yonghoon Dong","user":"yonghoon96","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27079.md"}">
Papers
arxiv:2605.27079

Trust Region Q Adjoint Matching

Published on May 26
· Submitted by
Yonghoon Dong
on Jun 5
Authors:
,
,
,

Abstract

Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies.

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter λ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of λ. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.

Community

Paper author Paper submitter about 9 hours ago

TRQAM internalizes the trust region as a scalar λ inside the flow-policy sampling SDE — an exact Girsanov path-space KL identity (Thm 1) makes the KL budget structurally enforceable via dual descent. 68% vs 46% on 50 OGBench tasks. 👇 blog & code

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27079
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27079 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27079 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27079 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers