Hugging Face Daily Papers · June 2, 2026 · 4 min read

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Domino is a speculative decoding method that improves parallel drafting by adding lightweight causal correction. It aims to retain the efficiency of block-parallel draft generation while recovering part of the causal dependency modeling lost in fully parallel draft models. Code and models are available at: <a href=\"https://github.com/jianuo-huang/Domino\" rel=\"nofollow\">https://github.com/jianuo-huang/Domino</a></p>\n","updatedAt":"2026-06-02T13:47:19.631Z","author":{"_id":"6757988d12d7d9e5dced1bee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NoJhZHlCdrOQmBoW0xTGY.png","fullname":"黄佳诺","name":"Huang2020","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8757284283638},"editors":["Huang2020"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NoJhZHlCdrOQmBoW0xTGY.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29707","authors":[{"_id":"6a1da50c808ddbc3c7d4396a","user":{"_id":"6757988d12d7d9e5dced1bee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NoJhZHlCdrOQmBoW0xTGY.png","isPro":false,"fullname":"黄佳诺","user":"Huang2020","type":"user","name":"Huang2020"},"name":"Jianuo Huang","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:10:04.097Z","hidden":false},{"_id":"6a1da50c808ddbc3c7d4396b","name":"Yaojie Zhang","hidden":false},{"_id":"6a1da50c808ddbc3c7d4396c","name":"Qituan Zhang","hidden":false},{"_id":"6a1da50c808ddbc3c7d4396d","name":"Hao Lin","hidden":false},{"_id":"6a1da50c808ddbc3c7d4396e","name":"Hanlin Xu","hidden":false},{"_id":"6a1da50c808ddbc3c7d4396f","name":"Linfeng Zhang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding","submittedOnDailyBy":{"_id":"6757988d12d7d9e5dced1bee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/NoJhZHlCdrOQmBoW0xTGY.png","isPro":false,"fullname":"黄佳诺","user":"Huang2020","type":"user","name":"Huang2020"},"summary":"Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \\(5.49\\times\\) end-to-end speedup under the Transformers backend and up to \\(5.8\\times\\) throughput speedup under SGLang serving.","upvotes":31,"discussionId":"6a1da50c808ddbc3c7d43970","githubRepo":"https://github.com/jianuo-huang/Domino","githubRepoAddedBy":"user","ai_summary":"Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and throughput.","ai_keywords":["speculative decoding","autoregressive drafters","parallel drafters","causal dependencies","draft quality","drafting cost","token drafting","parallel backbone","Domino head","teacher-forced causal encoding","base-anchored training curriculum","Transformers backend","SGLang serving"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":29,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67ce63fc616419c0afb54c52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cwE-TzM2T0lgF6BsBFkbE.png","isPro":false,"fullname":"Yaro","user":"yaro1214","type":"user"},{"_id":"69ef8e90ecfb5e55332f8bd1","avatarUrl":"/avatars/11893e1dd14eec61603e627a5e7f01c8.svg","isPro":false,"fullname":"Vueko","user":"Vueko0826","type":"user"},{"_id":"69d8caa142f2442720cb4415","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/w6tO5COENxedmF9L_3v0m.jpeg","isPro":false,"fullname":"koreyoshi","user":"koreyoshi0","type":"user"},{"_id":"68abcf417b5f1ea415ec7ef1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68abcf417b5f1ea415ec7ef1/7crwqTz_CFBL1RHl8wRb6.png","isPro":false,"fullname":"He","user":"Crane-wuthering","type":"user"},{"_id":"67dc247b164baf14fc214c67","avatarUrl":"/avatars/e159ef7dd485d59d34bcd70f1cd89cf9.svg","isPro":false,"fullname":"guixiyan","user":"guixiyan","type":"user"},{"_id":"66cd97d83bd2c33f879392bc","avatarUrl":"/avatars/5a81808a3e04319147ebd9520a1902d8.svg","isPro":false,"fullname":"WU XINGBO","user":"bobo1027","type":"user"},{"_id":"69c61feefb1cfc0bdbffc4e3","avatarUrl":"/avatars/b430d2e6c81031abff8c18508933893c.svg","isPro":false,"fullname":"Zhangjun Zhou","user":"minimaxzzj","type":"user"},{"_id":"65d5bf86e4f48c27b7c5f3a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d5bf86e4f48c27b7c5f3a7/jbyKb0kp337cumvjarQuI.jpeg","isPro":false,"fullname":"Yujie Chen","user":"verach3n","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"64ba419f7efc4b6c4398f378","avatarUrl":"/avatars/88b9fea6ba6fc45a4d66ac1567458a5c.svg","isPro":false,"fullname":"Cheng","user":"lordintheworld","type":"user"},{"_id":"69c38bb53c18f4a06fed1ec3","avatarUrl":"/avatars/e00eea1fba90a297d71f3fb84b5b6cf1.svg","isPro":false,"fullname":"Shawn","user":"Nortenshawn","type":"user"},{"_id":"672c9cfe260ac711712cfbac","avatarUrl":"/avatars/4571e45f9b74995cc2d71a910a4d74e7.svg","isPro":false,"fullname":"panenbao","user":"pebao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29707.md"}">

Papers

arxiv:2605.29707

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Published on May 28

· Submitted by

黄佳诺 on Jun 2

Shanghai Jiao Tong University

Upvote

Authors:

Jianuo Huang ,

Abstract

Domino is a speculative decoding framework that improves LLM inference speed by decoupling causal dependency modeling from autoregressive drafting through a parallel backbone and lightweight causal refinement head, achieving significant speedups in both end-to-end execution and throughput.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.