Hugging Face Daily Papers · · 6 min read

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \\emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \\emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at this https URL.</p>\n","updatedAt":"2026-05-13T06:46:47.253Z","author":{"_id":"660162a9eee53450ba93c34b","avatarUrl":"/avatars/d6128f630e041e29d1cdc178e112f23f.svg","fullname":"guanzhong","name":"guanzhong2","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8814380764961243},"editors":["guanzhong2"],"editorAvatarUrls":["/avatars/d6128f630e041e29d1cdc178e112f23f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12070","authors":[{"_id":"6a041e5586b054ce2fa40ff4","name":"Zhong Guan","hidden":false},{"_id":"6a041e5586b054ce2fa40ff5","name":"Yongjian Guo","hidden":false},{"_id":"6a041e5586b054ce2fa40ff6","name":"Haoran Sun","hidden":false},{"_id":"6a041e5586b054ce2fa40ff7","name":"Wen Huang","hidden":false},{"_id":"6a041e5586b054ce2fa40ff8","name":"Shuai Di","hidden":false},{"_id":"6a041e5586b054ce2fa40ff9","name":"Xiong Jun Wu","hidden":false},{"_id":"6a041e5586b054ce2fa40ffa","name":"Likang Wu","hidden":false},{"_id":"6a041e5586b054ce2fa40ffb","name":"Hongke Zhao","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction","submittedOnDailyBy":{"_id":"660162a9eee53450ba93c34b","avatarUrl":"/avatars/d6128f630e041e29d1cdc178e112f23f.svg","isPro":false,"fullname":"guanzhong","user":"guanzhong2","type":"user","name":"guanzhong2"},"summary":"Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.","upvotes":12,"discussionId":"6a041e5586b054ce2fa40ffc","githubRepo":"https://github.com/millioniron/ROLL","githubRepoAddedBy":"user","ai_summary":"Asynchronous reinforcement learning in large language models faces challenges with PPO-style corrections due to delayed updates and missing historical logits, which are addressed through exact and approximate correction methods including snapshot tracking and revised PPO-EWMA techniques.","ai_keywords":["asynchronous reinforcement learning","rollout throughput","PPO-style off-policy correction","heterogeneous training systems","importance ratio","training--inference discrepancy term","policy-staleness term","delayed updates","partial rollouts","old logits","snapshot-based version tracking","old-logit model","partial rollout interruption","approximate correction","PPO-EWMA"],"githubStars":1,"organization":{"_id":"682a97a154b087448a5504ee","name":"jingdong1","fullname":"jingdong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/682a966182dc4fc3e373e3ed/9xUgXZCij6qGJpLqpR99M.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"660162a9eee53450ba93c34b","avatarUrl":"/avatars/d6128f630e041e29d1cdc178e112f23f.svg","isPro":false,"fullname":"guanzhong","user":"guanzhong2","type":"user"},{"_id":"66a4c7fe3d7fbb9183d77d81","avatarUrl":"/avatars/2e60baf732e06be2c5ec456fcff4f3a8.svg","isPro":false,"fullname":"Yongjian Guo","user":"Thu24-gyj","type":"user"},{"_id":"669912aa4ea6475a57a0cabb","avatarUrl":"/avatars/65db29620062435b1b82c41c9a73b59a.svg","isPro":false,"fullname":"Haoran Sun","user":"Haoran231","type":"user"},{"_id":"65f929c2d0724d81ce8c771d","avatarUrl":"/avatars/e041a750cdd65b0c162554a7decddc53.svg","isPro":false,"fullname":"zhangwentao","user":"zhangwt97","type":"user"},{"_id":"6953724735af7f1b09200ee6","avatarUrl":"/avatars/2b77aa2ea174df84586d090f6612b1b3.svg","isPro":false,"fullname":"Zhichen Xiang","user":"Willmorning","type":"user"},{"_id":"64ba34107132196e0831e12e","avatarUrl":"/avatars/6c6bfe2607420162db30c564e9000abd.svg","isPro":false,"fullname":"weilu","user":"weilu4606","type":"user"},{"_id":"67920c95e5e173d2dd12a9f9","avatarUrl":"/avatars/6234e9f78d8c44a04a991594280157be.svg","isPro":false,"fullname":"classnuba","user":"classnuba","type":"user"},{"_id":"6392b770ab384292ad25037c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6392b770ab384292ad25037c/f620mjrPPdw4fLByn6-cB.jpeg","isPro":false,"fullname":"arron thomas","user":"arron666","type":"user"},{"_id":"6614e472ce283a5402539085","avatarUrl":"/avatars/6a6a16ef1b157218e4eb8b2c244e27f4.svg","isPro":false,"fullname":"Timothy Liu","user":"Tymothy17","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6a04623276ce00440c92390a","avatarUrl":"/avatars/c6f54369b064549f50e924386c227a68.svg","isPro":false,"fullname":"Tony Zhao","user":"ztonyzhao","type":"user"},{"_id":"6715c3647dfe714b42ebeb90","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/srCfCZXj6PoWK6TSx-WxH.png","isPro":false,"fullname":"ZhaoYuehan","user":"Xyyyh","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"682a97a154b087448a5504ee","name":"jingdong1","fullname":"jingdong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/682a966182dc4fc3e373e3ed/9xUgXZCij6qGJpLqpR99M.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12070.md"}">
Papers
arxiv:2605.12070

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Published on May 12
· Submitted by
guanzhong
on May 13
Authors:
,
,
,
,
,
,
,

Abstract

Asynchronous reinforcement learning in large language models faces challenges with PPO-style corrections due to delayed updates and missing historical logits, which are addressed through exact and approximate correction methods including snapshot tracking and revised PPO-EWMA techniques.

AI-generated summary

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.

Community

Paper submitter about 14 hours ago

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at this https URL.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.12070
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12070 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12070 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12070 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers