r/LocalLLaMA · · 1 min read

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

We find speculative decoding can push LLM generation latency to extreme by co-optimizing drafting cost and drafting quality with causal parallel tree drafting.

JetSpec reaches up to 9.64× end-to-end speedup on MATH-500 and 4.58× on open-ended chat while keeping lossless. With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200 GPU. ⚡️

Prior SD faces a dilemma:

  1. AR-style draft heads preserve causality for quality, but drafting cost grows with tree depth.
  2. Block-diffusion style heads draft cheaply in one pass, but branches are often scored independently, so deeper paths can become mutually inconsistent.

JetSpec enables such speed by drafting a causality-preserving tree in one single pass. 🚀🌳

Check out our project page for demos and how we built it 👇
https://jetspec-project.github.io/jetspec-web/

💻 Code: https://github.com/hao-ai-lab/JetSpec
🌟 Blog: https://haoailab.com/blogs/parallel-tree-decoding/

JetSpec vs. DFlash and AR baselines.

JetSpec with Inference engine rendering around 1000 TPS on average.

End-to-end Speedup comparisons.

submitted by /u/No_Yogurtcloset_7050
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA