Enjoy! :)</p>\n","updatedAt":"2026-06-18T10:55:59.259Z","author":{"_id":"6788cbfd6cdca10c9bb3dea5","avatarUrl":"/avatars/a0084b72d5b1a82d9c8a5aef1a230ffd.svg","fullname":"Minjae Lee","name":"FuriosaMJLEE","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5418548583984375},"editors":["FuriosaMJLEE"],"editorAvatarUrls":["/avatars/a0084b72d5b1a82d9c8a5aef1a230ffd.svg"],"reactions":[],"isReport":false}},{"id":"6a33d79d4b66cffcdddfc54c","author":{"_id":"6788cbfd6cdca10c9bb3dea5","avatarUrl":"/avatars/a0084b72d5b1a82d9c8a5aef1a230ffd.svg","fullname":"Minjae Lee","name":"FuriosaMJLEE","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2026-06-18T11:33:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"https://github.com/furiosa-ai/EfficientRollout","html":"<p><a href=\"https://github.com/furiosa-ai/EfficientRollout\" rel=\"nofollow\">https://github.com/furiosa-ai/EfficientRollout</a></p>\n","updatedAt":"2026-06-18T11:33:49.812Z","author":{"_id":"6788cbfd6cdca10c9bb3dea5","avatarUrl":"/avatars/a0084b72d5b1a82d9c8a5aef1a230ffd.svg","fullname":"Minjae Lee","name":"FuriosaMJLEE","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.46558836102485657},"editors":["FuriosaMJLEE"],"editorAvatarUrls":["/avatars/a0084b72d5b1a82d9c8a5aef1a230ffd.svg"],"reactions":[],"isReport":false}},{"id":"6a33dc7036f2190cf3573763","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-18T11:54:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. RL rollout latency is such a headache, especially with long-tailed generations slowing everything down. The idea of using a quantized version of the target model as a self-drafter seems like a clever way to avoid the maintenance of keeping a separate model in sync with an evolving policy.\n\nHow much performance do you lose by using a quantized drafter compared to a full-precision one, or is the speedup in the memory-bound regime worth the tradeoff?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/fa395e35-3eb7-4724-8132-8fe7fd7926d1","html":"<p>Neat paper. RL rollout latency is such a headache, especially with long-tailed generations slowing everything down. The idea of using a quantized version of the target model as a self-drafter seems like a clever way to avoid the maintenance of keeping a separate model in sync with an evolving policy.</p>\n<p>How much performance do you lose by using a quantized drafter compared to a full-precision one, or is the speedup in the memory-bound regime worth the tradeoff?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/fa395e35-3eb7-4724-8132-8fe7fd7926d1\" rel=\"nofollow\">https://researchpod.app/episode/fa395e35-3eb7-4724-8132-8fe7fd7926d1</a></p>\n","updatedAt":"2026-06-18T11:54:24.989Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9285455942153931},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18967","authors":[{"_id":"6a33695c59127a45e2c1c610","user":{"_id":"66ab9a9c661901e7adacbba0","avatarUrl":"/avatars/ee146edc35a6ac5e669188af35bf34d0.svg","isPro":false,"fullname":"MinseoKim","user":"minseo25","type":"user","name":"minseo25"},"name":"Minseo Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-18T11:26:36.343Z","hidden":false},{"_id":"6a33695c59127a45e2c1c611","name":"Minjae Lee","hidden":false},{"_id":"6a33695c59127a45e2c1c612","name":"Seunghyuk Oh","hidden":false},{"_id":"6a33695c59127a45e2c1c613","name":"Kevin Galim","hidden":false},{"_id":"6a33695c59127a45e2c1c614","name":"Donghoon Kim","hidden":false},{"_id":"6a33695c59127a45e2c1c615","name":"Coleman Hooper","hidden":false},{"_id":"6a33695c59127a45e2c1c616","name":"Harman Singh","hidden":false},{"_id":"6a33695c59127a45e2c1c617","name":"Amir Gholami","hidden":false},{"_id":"6a33695c59127a45e2c1c618","name":"Hyung Il Koo","hidden":false},{"_id":"6a33695c59127a45e2c1c619","name":"Wonjun Kang","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts","submittedOnDailyBy":{"_id":"6788cbfd6cdca10c9bb3dea5","avatarUrl":"/avatars/a0084b72d5b1a82d9c8a5aef1a230ffd.svg","isPro":false,"fullname":"Minjae Lee","user":"FuriosaMJLEE","type":"user","name":"FuriosaMJLEE"},"summary":"Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.","upvotes":12,"discussionId":"6a33695c59127a45e2c1c61a","githubRepo":"https://github.com/furiosa-ai/EfficientRollout","githubRepoAddedBy":"user","ai_summary":"EfficientRollout is a system-aware self-speculative decoding framework that accelerates reinforcement learning rollouts by adapting drafters to evolving policies and optimizing speculative decoding regimes.","ai_keywords":["reinforcement learning","autoregressive sampling","speculative decoding","rollout generation","self-speculative decoding","drafters","acceptance-aware draft-length adaptation","compute-bound regimes","memory-bound regimes"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"6213a1dcb670cb63a38074a1","name":"furiosa-ai","fullname":"FuriosaAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/620b6dc29412b0861cb2474a/Bl7ua2mXSFxk9rVo8vMA8.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69ae6142cb7650cb6df8787d","avatarUrl":"/avatars/51ecdc68fe98dcc847c152383c673f9b.svg","isPro":false,"fullname":"Minseo Kim","user":"minseokim25","type":"user"},{"_id":"66ab9a9c661901e7adacbba0","avatarUrl":"/avatars/ee146edc35a6ac5e669188af35bf34d0.svg","isPro":false,"fullname":"MinseoKim","user":"minseo25","type":"user"},{"_id":"6a308db1114dc7ae7d02c605","avatarUrl":"/avatars/4e2603b671bfe79255e8c45da1ad4d87.svg","isPro":false,"fullname":"Younghyun Kim","user":"yhyunkim","type":"user"},{"_id":"6788cbfd6cdca10c9bb3dea5","avatarUrl":"/avatars/a0084b72d5b1a82d9c8a5aef1a230ffd.svg","isPro":false,"fullname":"Minjae Lee","user":"FuriosaMJLEE","type":"user"},{"_id":"63e1d5247fbb6ae4d4f4cc8e","avatarUrl":"/avatars/8a8f700adf9e8000641c2c2f6bd56080.svg","isPro":false,"fullname":"Wonjun Kang","user":"wjkang","type":"user"},{"_id":"68f131f0f7a26fddda93aa90","avatarUrl":"/avatars/d9a8b8b306f958214f5999b97ca2ffc3.svg","isPro":false,"fullname":"Wonjun Kang","user":"wjkang-furiosa","type":"user"},{"_id":"61ad9c3300d01045fca0ad64","avatarUrl":"/avatars/04c53c2d68d80db1053e5ebadbda5592.svg","isPro":false,"fullname":"Min Jae Lee","user":"mjbooo","type":"user"},{"_id":"62fca4e265ba08da9ccf9474","avatarUrl":"/avatars/38da8e019a1997dee9ea8bd6175b39b5.svg","isPro":false,"fullname":"Donghoon Kim","user":"DonghoonKim","type":"user"},{"_id":"64ae35dc00781825350e880b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae35dc00781825350e880b/VuON41yUDzNDACbvWAVhz.jpeg","isPro":false,"fullname":"Seunghyuk Oh","user":"JakeOh","type":"user"},{"_id":"64ca6c28cbf85b573123431a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ca6c28cbf85b573123431a/jPEJTmrqM7yZIlhhCgt5D.png","isPro":false,"fullname":"Rishi Athavale","user":"rishipython","type":"user"},{"_id":"694bd29d43cdb4a594bc080b","avatarUrl":"/avatars/c3c67d0c08703b5a47914152e160a3a4.svg","isPro":false,"fullname":"harmannnn","user":"harmannnn","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6213a1dcb670cb63a38074a1","name":"furiosa-ai","fullname":"FuriosaAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/620b6dc29412b0861cb2474a/Bl7ua2mXSFxk9rVo8vMA8.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18967.md","query":{}}">
EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
Abstract
EfficientRollout is a system-aware self-speculative decoding framework that accelerates reinforcement learning rollouts by adapting drafters to evolving policies and optimizing speculative decoding regimes.
Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.
Community
Neat paper. RL rollout latency is such a headache, especially with long-tailed generations slowing everything down. The idea of using a quantized version of the target model as a self-drafter seems like a clever way to avoid the maintenance of keeping a separate model in sync with an evolving policy.
How much performance do you lose by using a quantized drafter compared to a full-precision one, or is the speedup in the memory-bound regime worth the tradeoff?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/fa395e35-3eb7-4724-8132-8fe7fd7926d1
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18967 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.18967 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.18967 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.