Hugging Face Daily Papers · · 5 min read

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

As large language models are used more like interactive assistants, they need to do more than answer once: when a user points out a mistake, they should rethink and improve rather than repeat the same error. Training this behavior is difficult. One approach, reinforcement learning, rewards the model through trial and error and can work well, but it is costly because the model must repeatedly generate full back-and-forth conversations during training. A cheaper approach trains on fixed examples, but it often fails to teach the model how to recover after feedback. We propose DRIFT, a method that first collects example conversations from a fixed model, scores them by whether they reach the right answer and how quickly they do so, and then gives more training influence to the better conversations. This lets the model learn from simple feedback such as “Incorrect, please try again” without generating new conversations at every training step. Across math and general reasoning tasks, DRIFT learns stronger correction behavior while keeping training much closer in cost and simplicity to standard example-based training.</p>\n","updatedAt":"2026-06-01T06:39:27.926Z","author":{"_id":"6628b0621b4cd5f0ada35ed8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png","fullname":"mj","name":"mujianijan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9479822516441345},"editors":["mujianijan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31455","authors":[{"_id":"6a1d2878808ddbc3c7d43676","user":{"_id":"6628b0621b4cd5f0ada35ed8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png","isPro":false,"fullname":"mj","user":"mujianijan","type":"user","name":"mujianijan"},"name":"Jian Mu","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:08.906Z","hidden":false},{"_id":"6a1d2878808ddbc3c7d43677","name":"Tianyi Lin","hidden":false},{"_id":"6a1d2878808ddbc3c7d43678","name":"Chengwei Qin","hidden":false},{"_id":"6a1d2878808ddbc3c7d43679","name":"Zhongxiang Dai","hidden":false},{"_id":"6a1d2878808ddbc3c7d4367a","name":"Yao Shu","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization","submittedOnDailyBy":{"_id":"6628b0621b4cd5f0ada35ed8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png","isPro":false,"fullname":"mj","user":"mujianijan","type":"user","name":"mujianijan"},"summary":"Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.","upvotes":0,"discussionId":"6a1d2879808ddbc3c7d4367b","githubRepo":"https://github.com/2020-qqtcg/DRIFT","githubRepoAddedBy":"user","ai_summary":"DRIFT is a framework that combines offline trajectories with importance-weighted supervised fine-tuning to achieve multi-turn interactive learning efficiency and performance comparable to reinforcement learning.","ai_keywords":["online reinforcement learning","offline supervised fine-tuning","multi-turn dynamics","behavioral collapse","KL-regularized RL objective","importance-weighted supervised learning","rollouts","importance weights","policy optimization","reference policy"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31455.md"}">
Papers
arxiv:2605.31455

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Published on May 29
· Submitted by
mj
on Jun 1
Authors:
,
,
,

Abstract

DRIFT is a framework that combines offline trajectories with importance-weighted supervised fine-tuning to achieve multi-turn interactive learning efficiency and performance comparable to reinforcement learning.

AI-generated summary

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

Community

Paper author Paper submitter about 4 hours ago

As large language models are used more like interactive assistants, they need to do more than answer once: when a user points out a mistake, they should rethink and improve rather than repeat the same error. Training this behavior is difficult. One approach, reinforcement learning, rewards the model through trial and error and can work well, but it is costly because the model must repeatedly generate full back-and-forth conversations during training. A cheaper approach trains on fixed examples, but it often fails to teach the model how to recover after feedback. We propose DRIFT, a method that first collects example conversations from a fixed model, scores them by whether they reach the right answer and how quickly they do so, and then gives more training influence to the better conversations. This lets the model learn from simple feedback such as “Incorrect, please try again” without generating new conversations at every training step. Across math and general reasoning tasks, DRIFT learns stronger correction behavior while keeping training much closer in cost and simplicity to standard example-based training.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.31455
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31455 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31455 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31455 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers