Hugging Face Daily Papers · June 10, 2026 · 4 min read

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

QGF (Q-Guided Flow) is a test-time reinforcement learning method that utilizes critic gradients to guide pre-trained flow-based policies toward higher-value actions, enhancing stability and scalability in continuous control.</p>\n","updatedAt":"2026-06-10T02:07:52.852Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":314,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9198866486549377},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11087","authors":[{"_id":"6a28c6e0e7d78ea7587e534c","name":"Zhiyuan Zhou","hidden":false},{"_id":"6a28c6e0e7d78ea7587e534d","name":"Andy Peng","hidden":false},{"_id":"6a28c6e0e7d78ea7587e534e","name":"Charles Xu","hidden":false},{"_id":"6a28c6e0e7d78ea7587e534f","name":"Qiyang Li","hidden":false},{"_id":"6a28c6e0e7d78ea7587e5350","name":"Tobias Springenberg","hidden":false},{"_id":"6a28c6e0e7d78ea7587e5351","name":"Kevin Frans","hidden":false},{"_id":"6a28c6e0e7d78ea7587e5352","name":"Sergey Levine","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.","upvotes":1,"discussionId":"6a28c6e1e7d78ea7587e5353","projectPage":"https://q-guided-flow.github.io/","githubRepo":"https://github.com/zhouzypaul/qgf","githubRepoAddedBy":"user","ai_summary":"QGF is an RL algorithm that improves policies at test time by using a value gradient to guide a pre-trained flow policy, avoiding training-time instability while maintaining competitive performance.","ai_keywords":["diffusion models","flow models","imitation learning","reinforcement learning","policy improvement","behavioral cloning","value function critic","test-time RL","Q-Guided Flow","action generation","policy optimization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":17},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11087.md"}">

Papers

arxiv:2606.11087

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Published on Jun 9

· Submitted by

taesiri on Jun 10

Upvote

Authors:

Abstract

QGF is an RL algorithm that improves policies at test time by using a value gradient to guide a pre-trained flow policy, avoiding training-time instability while maintaining competitive performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

View arXiv page View PDF Project page GitHub 17 Add to collection

Community

taesiri

Paper submitter about 15 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11087

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11087 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11087 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11087 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers