As large language models are used more like interactive assistants, they need to do more than answer once: when a user points out a mistake, they should rethink and improve rather than repeat the same error. Training this behavior is difficult. One approach, reinforcement learning, rewards the model through trial and error and can work well, but it is costly because the model must repeatedly generate full back-and-forth conversations during training. A cheaper approach trains on fixed examples, but it often fails to teach the model how to recover after feedback. We propose DRIFT, a method that first collects example conversations from a fixed model, scores them by whether they reach the right answer and how quickly they do so, and then gives more training influence to the better conversations. This lets the model learn from simple feedback such as “Incorrect, please try again” without generating new conversations at every training step. Across math and general reasoning tasks, DRIFT learns stronger correction behavior while keeping training much closer in cost and simplicity to standard example-based training.</p>\n","updatedAt":"2026-06-01T06:39:27.926Z","author":{"_id":"6628b0621b4cd5f0ada35ed8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png","fullname":"mj","name":"mujianijan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9479822516441345},"editors":["mujianijan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31455","authors":[{"_id":"6a1d2878808ddbc3c7d43676","user":{"_id":"6628b0621b4cd5f0ada35ed8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png","isPro":false,"fullname":"mj","user":"mujianijan","type":"user","name":"mujianijan"},"name":"Jian Mu","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:08.906Z","hidden":false},{"_id":"6a1d2878808ddbc3c7d43677","name":"Tianyi Lin","hidden":false},{"_id":"6a1d2878808ddbc3c7d43678","name":"Chengwei Qin","hidden":false},{"_id":"6a1d2878808ddbc3c7d43679","name":"Zhongxiang Dai","hidden":false},{"_id":"6a1d2878808ddbc3c7d4367a","name":"Yao Shu","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization","submittedOnDailyBy":{"_id":"6628b0621b4cd5f0ada35ed8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/TppL-D7GtZPqUlUb6UU-K.png","isPro":false,"fullname":"mj","user":"mujianijan","type":"user","name":"mujianijan"},"summary":"Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.","upvotes":0,"discussionId":"6a1d2879808ddbc3c7d4367b","githubRepo":"https://github.com/2020-qqtcg/DRIFT","githubRepoAddedBy":"user","ai_summary":"DRIFT is a framework that combines offline trajectories with importance-weighted supervised fine-tuning to achieve multi-turn interactive learning efficiency and performance comparable to reinforcement learning.","ai_keywords":["online reinforcement learning","offline supervised fine-tuning","multi-turn dynamics","behavioral collapse","KL-regularized RL objective","importance-weighted supervised learning","rollouts","importance weights","policy optimization","reference policy"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31455.md"}">
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
Published on May 29
· Submitted by mj on Jun 1 Abstract
DRIFT is a framework that combines offline trajectories with importance-weighted supervised fine-tuning to achieve multi-turn interactive learning efficiency and performance comparable to reinforcement learning.
AI-generated summary
Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.
Community
As large language models are used more like interactive assistants, they need to do more than answer once: when a user points out a mistake, they should rethink and improve rather than repeat the same error. Training this behavior is difficult. One approach, reinforcement learning, rewards the model through trial and error and can work well, but it is costly because the model must repeatedly generate full back-and-forth conversations during training. A cheaper approach trains on fixed examples, but it often fails to teach the model how to recover after feedback. We propose DRIFT, a method that first collects example conversations from a fixed model, scores them by whether they reach the right answer and how quickly they do so, and then gives more training influence to the better conversations. This lets the model learn from simple feedback such as “Incorrect, please try again” without generating new conversations at every training step. Across math and general reasoning tasks, DRIFT learns stronger correction behavior while keeping training much closer in cost and simplicity to standard example-based training.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.31455 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.31455 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.31455 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.