Hugging Face Daily Papers · · 6 min read

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head using fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model’s autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at <a href=\"https://github.com/hao-ai-lab/JetSpec\" rel=\"nofollow\">https://github.com/hao-ai-lab/JetSpec</a>.</p>\n","updatedAt":"2026-06-26T05:24:20.175Z","author":{"_id":"6301d6455e305a35cb0846a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6301d6455e305a35cb0846a7/aT2AtzRMSY_T3y02MIUap.jpeg","fullname":"Lanxiang Hu","name":"Snyhlxde","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8806746602058411},"editors":["Snyhlxde"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6301d6455e305a35cb0846a7/aT2AtzRMSY_T3y02MIUap.jpeg"],"reactions":[{"reaction":"🔥","users":["Edenzzz","memset0","JensenYuan","noctuashap"],"count":4}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18394","authors":[{"_id":"6a3e0c903b43e283349ec26b","name":"Lanxiang Hu","hidden":false},{"_id":"6a3e0c903b43e283349ec26c","name":"Zhaoxiang Feng","hidden":false},{"_id":"6a3e0c903b43e283349ec26d","name":"Yulun Wu","hidden":false},{"_id":"6a3e0c903b43e283349ec26e","name":"Haoran Yuan","hidden":false},{"_id":"6a3e0c903b43e283349ec26f","name":"Yujie Zhao","hidden":false},{"_id":"6a3e0c903b43e283349ec270","name":"Yu-Yang Qian","hidden":false},{"_id":"6a3e0c903b43e283349ec271","name":"Bojun Wang","hidden":false},{"_id":"6a3e0c903b43e283349ec272","name":"Peng Zhao","hidden":false},{"_id":"6a3e0c903b43e283349ec273","name":"Daxin Jiang","hidden":false},{"_id":"6a3e0c903b43e283349ec274","name":"Yibo Zhu","hidden":false},{"_id":"6a3e0c903b43e283349ec275","name":"Tajana Rosing","hidden":false},{"_id":"6a3e0c903b43e283349ec276","name":"Hao Zhang","hidden":false}],"publishedAt":"2026-06-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-26T00:00:00.000Z","title":"JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting","submittedOnDailyBy":{"_id":"6301d6455e305a35cb0846a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6301d6455e305a35cb0846a7/aT2AtzRMSY_T3y02MIUap.jpeg","isPro":true,"fullname":"Lanxiang Hu","user":"Snyhlxde","type":"user","name":"Snyhlxde"},"summary":"Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.","upvotes":30,"discussionId":"6a3e0c903b43e283349ec277","projectPage":"https://jetspec-project.github.io/jetspec-web/","githubRepo":"https://github.com/hao-ai-lab/JetSpec","githubRepoAddedBy":"user","ai_summary":"JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates across various benchmarks.","ai_keywords":["speculative decoding","autoregressive Large Language Models","draft budget","acceptance rate","causality-efficiency dilemma","tree speculative decoding","bidirectional block-diffusion","branch-agnostic marginals","causal parallel draft head","fused hidden states","autoregressive factorization","end-to-end speedup","MoE Qwen3","vLLM integration"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":69},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6301d6455e305a35cb0846a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6301d6455e305a35cb0846a7/aT2AtzRMSY_T3y02MIUap.jpeg","isPro":true,"fullname":"Lanxiang Hu","user":"Snyhlxde","type":"user"},{"_id":"6657620ea496f7fcb67c3871","avatarUrl":"/avatars/54fef1c835e6f6b478652d438a140d45.svg","isPro":false,"fullname":"xieweihao","user":"chalengr","type":"user"},{"_id":"65416817271d3bc4d70f6745","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65416817271d3bc4d70f6745/1YkW0MpuufejvxqksVMIx.jpeg","isPro":false,"fullname":"Yongqi Chen","user":"BrianChen1129","type":"user"},{"_id":"68f18db6dcb9e41cd8df67e1","avatarUrl":"/avatars/d22323d9c655d88c0492568f931e7de3.svg","isPro":false,"fullname":"Yujie Zhao","user":"YujieZhao","type":"user"},{"_id":"67ad790c2b28204981be8e24","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ad790c2b28204981be8e24/KstE5e5bUXXIvgPJqMO2B.jpeg","isPro":false,"fullname":"Mengyuan Sun","user":"blue01223","type":"user"},{"_id":"636886e8228caace939235dd","avatarUrl":"/avatars/de55d6ba3222a0f25eff81d46a96acf2.svg","isPro":false,"fullname":"wangbojun","user":"neomax24","type":"user"},{"_id":"64ec4fb29e53684e6eb476d1","avatarUrl":"/avatars/fd00bd8bd5264175f3f9349c349ba2e0.svg","isPro":false,"fullname":"Mingjia Huo","user":"mignonjia","type":"user"},{"_id":"643839d9581e6bf0fa9c835e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643839d9581e6bf0fa9c835e/JxlgR-zQhms-rfF0sDxD8.jpeg","isPro":false,"fullname":"Junda Chen","user":"GindaChen","type":"user"},{"_id":"6758a97f29859870dcec8b94","avatarUrl":"/avatars/5e2f108636adfaba1f6657ea07eee2dc.svg","isPro":false,"fullname":"Nilesh Prasad Pandey","user":"nppandey","type":"user"},{"_id":"65ebae78d767680a0cf5f833","avatarUrl":"/avatars/5e0cee3000c6c4166983c2892e27bc8f.svg","isPro":false,"fullname":"Marshall Guo","user":"weathon","type":"user"},{"_id":"66bb04f6447411b9c0125570","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66bb04f6447411b9c0125570/AS3V1Gk7Ed408yS4Ekuap.jpeg","isPro":false,"fullname":"Yu-Yang Qian","user":"d3LLM-model","type":"user"},{"_id":"667fa8d9ed6d253729f57add","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/mz0kSyCzZyv5a8W1ac7XK.png","isPro":false,"fullname":"Shaoxiong Duan","user":"noctuashap","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18394.md","query":{}}">
Papers
arxiv:2606.18394

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Published on Jun 25
· Submitted by
Lanxiang Hu
on Jun 26
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates across various benchmarks.

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.

Community

Paper submitter 2 days ago

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head using fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model’s autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.18394
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18394 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18394 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18394 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers