Hugging Face Daily Papers · · 4 min read

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Fast, lossless LLM inference via dual-view diffusion decoding.<br>Code: <a href=\"https://github.com/chiennv2000/orthrus\" rel=\"nofollow\">https://github.com/chiennv2000/orthrus</a></p>\n","updatedAt":"2026-05-14T03:15:14.921Z","author":{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","fullname":"Nguyen Van Chien","name":"chiennv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5130829215049744},"editors":["chiennv"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg"],"reactions":[{"reaction":"🔥","users":["Franck-Dernoncourt"],"count":1}],"isReport":false}},{"id":"6a05bcb1ee22a6ada70940d1","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2026-05-14T12:14:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"That amount of research on dLLMs recently has been pretty inspiring. This is another example of that.\n\nI think diffusion solves a lot of top-level issues with AR models, and I'm praying that it leads us to a better future for the industry.","html":"<p>That amount of research on dLLMs recently has been pretty inspiring. This is another example of that.</p>\n<p>I think diffusion solves a lot of top-level issues with AR models, and I'm praying that it leads us to a better future for the industry.</p>\n","updatedAt":"2026-05-14T12:14:41.627Z","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9849029779434204},"editors":["urroxyz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png"],"reactions":[{"reaction":"🔥","users":["chiennv"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12825","authors":[{"_id":"6a0531dbb1a8cbabc9f086bd","user":{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","isPro":false,"fullname":"Nguyen Van Chien","user":"chiennv","type":"user","name":"chiennv"},"name":"Chien Van Nguyen","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:43.740Z","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086be","name":"Chaitra Hegde","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086bf","name":"Van Cuong Pham","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086c0","name":"Ryan A. Rossi","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086c1","user":{"_id":"62c5947524171688a9feb992","avatarUrl":"/avatars/5a151713b9eae8dc566f5957acee3475.svg","isPro":false,"fullname":"Franck Dernoncourt","user":"Franck-Dernoncourt","type":"user","name":"Franck-Dernoncourt"},"name":"Franck Dernoncourt","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:45.916Z","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086c2","name":"Thien Huu Nguyen","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion","submittedOnDailyBy":{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","isPro":false,"fullname":"Nguyen Van Chien","user":"chiennv","type":"user","name":"chiennv"},"summary":"We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.","upvotes":7,"discussionId":"6a0531dcb1a8cbabc9f086c3","githubRepo":"https://github.com/chiennv2000/orthrus","githubRepoAddedBy":"user","ai_summary":"Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.","ai_keywords":["autoregressive Large Language Models","diffusion models","parallel token generation","Transformer","Key-Value cache","consensus mechanism","lossless inference","dual-architecture framework"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","isPro":false,"fullname":"Nguyen Van Chien","user":"chiennv","type":"user"},{"_id":"62c5947524171688a9feb992","avatarUrl":"/avatars/5a151713b9eae8dc566f5957acee3475.svg","isPro":false,"fullname":"Franck Dernoncourt","user":"Franck-Dernoncourt","type":"user"},{"_id":"651a6b53757417990d5a3605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/QeVPeJ9av9MsEICRQl5Gk.jpeg","isPro":false,"fullname":"Nguyen Huu Huy","user":"huy-nh-2000","type":"user"},{"_id":"65df6aa6d3f50db99e274d0d","avatarUrl":"/avatars/97d49a30e1da8b829296c43d60d7d414.svg","isPro":false,"fullname":"Ryan Rossi","user":"RyanARossi","type":"user"},{"_id":"69a3f858f5f4c149efcb484f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/nBdJ0d-UlU3MHk0LawjMC.jpeg","isPro":false,"fullname":"Васильев Евгений","user":"charlesgarcia22","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"64fd92048b633409637b5c33","avatarUrl":"/avatars/547a82d5584cc554af8c8ba78fe14616.svg","isPro":false,"fullname":"Phạm Văn Cường","user":"mrcuongtroll","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12825.md"}">
Papers
arxiv:2605.12825

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Published on May 12
· Submitted by
Nguyen Van Chien
on May 14

Abstract

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.

AI-generated summary

We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

Community

Paper author Paper submitter about 23 hours ago
edited about 23 hours ago

Fast, lossless LLM inference via dual-view diffusion decoding.
Code: https://github.com/chiennv2000/orthrus

That amount of research on dLLMs recently has been pretty inspiring. This is another example of that.

I think diffusion solves a lot of top-level issues with AR models, and I'm praying that it leads us to a better future for the industry.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.12825
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12825 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12825 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers