Our model is also aviliable at <a href=\"https://huggingface.co/collections/bingyang-lei/draft-opd\">https://huggingface.co/collections/bingyang-lei/draft-opd</a></p>\n","updatedAt":"2026-06-02T02:49:05.068Z","author":{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","fullname":"haodi lei","name":"bingyang-lei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.882394015789032},"editors":["bingyang-lei"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29343","authors":[{"_id":"6a19278e56b4bb14ec65d0be","user":{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user","name":"bingyang-lei"},"name":"Haodi Lei","status":"claimed_verified","statusLastChangedAt":"2026-06-01T11:46:48.551Z","hidden":false},{"_id":"6a19278e56b4bb14ec65d0bf","name":"Yafy Li","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c0","name":"Haoran Zhang","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c1","name":"Shunkai Zhang","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c2","name":"Qianjia Cheng","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c3","name":"Xiaoye Qu","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c4","name":"Ganqu Cui","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c5","name":"Bowen Zhou","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c6","name":"Ning Ding","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c7","name":"Yun Luo","hidden":false},{"_id":"6a19278e56b4bb14ec65d0c8","name":"Yu Cheng","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Draft-OPD: On-Policy Distillation for Speculative Draft Models","submittedOnDailyBy":{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user","name":"bingyang-lei"},"summary":"Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5times lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\\% and 13\\%.","upvotes":22,"discussionId":"6a19278f56b4bb14ec65d0c9","projectPage":"https://www.haodilei.top/draft-opd/","githubRepo":"https://github.com/bingyang-lei/Draft-OPD","githubRepoAddedBy":"user","ai_summary":"Speculative decoding uses a lightweight draft model to accelerate large language model inference, but supervised fine-tuning plateaus due to offline-to-inference mismatch, which is addressed through on-policy distillation with target-assisted rollouts and error replay.","ai_keywords":["speculative decoding","draft model","target model","supervised fine-tuning","on-policy distillation","target-assisted generation","draft-induced states","error replay","lossless acceleration"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user"},{"_id":"689ec537196ab997b13dc977","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ec537196ab997b13dc977/yXA_pd8ndjBIIg1Hx59QJ.png","isPro":false,"fullname":"Haoran Zhang","user":"zzzhr97","type":"user"},{"_id":"64cb54da1af278541d663708","avatarUrl":"/avatars/c44507cc92bb2e83154bad31b90ce6dd.svg","isPro":false,"fullname":"Xiaoye Qu","user":"Xiaoye08","type":"user"},{"_id":"63f3502a520c14618925825a","avatarUrl":"/avatars/e986a2a6625e7be6890616a417f908d2.svg","isPro":false,"fullname":"Yafu Li","user":"yaful","type":"user"},{"_id":"67247adb73d1eb17b6bfd27c","avatarUrl":"/avatars/57bdbb7362f9854c87dd0a71ae071652.svg","isPro":false,"fullname":"Zefeng He","user":"yhx12","type":"user"},{"_id":"664717a50860c78e7c7b7c52","avatarUrl":"/avatars/ca17216b6d73234e1a68510f87653b3a.svg","isPro":false,"fullname":"Puyi Wang","user":"Puyiiii","type":"user"},{"_id":"682e78875f13f3775708b6a7","avatarUrl":"/avatars/d9a79247bd5c7e717d62beace3198711.svg","isPro":false,"fullname":"Yi","user":"auids","type":"user"},{"_id":"644915c5e87a77e872e61350","avatarUrl":"/avatars/46ba7bdf04ad4c1b0ad79155010dc684.svg","isPro":false,"fullname":"Luo","user":"ramiroluo","type":"user"},{"_id":"645b4819f9d4ec91fdd54852","avatarUrl":"/avatars/e12efb8e030688a0afcc72176b453fb3.svg","isPro":false,"fullname":"Jiawei Gu","user":"kuvvi","type":"user"},{"_id":"629454301ae2138079f7ff31","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/629454301ae2138079f7ff31/rVtbF-j06gDiYzomTeVTc.jpeg","isPro":false,"fullname":"Tong Zhu","user":"Spico","type":"user"},{"_id":"65f955121cccf63639b81337","avatarUrl":"/avatars/a8503d47cdc67f14b57ca16f05becea1.svg","isPro":false,"fullname":"zqyz","user":"zqyz333","type":"user"},{"_id":"61af81009f77f7b669578f95","avatarUrl":"/avatars/fb50773ac49948940eb231834ee6f2fd.svg","isPro":false,"fullname":"rotem israeli","user":"irotem98","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29343.md"}">
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Authors: ,
,
,
,
,
,
,
,
,
Abstract
Speculative decoding uses a lightweight draft model to accelerate large language model inference, but supervised fine-tuning plateaus due to offline-to-inference mismatch, which is addressed through on-policy distillation with target-assisted rollouts and error replay.
AI-generated summary
Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5times lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29343 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.29343 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.