Fast, lossless LLM inference via dual-view diffusion decoding.<br>Code: <a href=\"https://github.com/chiennv2000/orthrus\" rel=\"nofollow\">https://github.com/chiennv2000/orthrus</a></p>\n","updatedAt":"2026-05-14T03:15:14.921Z","author":{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","fullname":"Nguyen Van Chien","name":"chiennv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5130829215049744},"editors":["chiennv"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg"],"reactions":[{"reaction":"🔥","users":["Franck-Dernoncourt"],"count":1}],"isReport":false}},{"id":"6a05bcb1ee22a6ada70940d1","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false},"createdAt":"2026-05-14T12:14:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"That amount of research on dLLMs recently has been pretty inspiring. This is another example of that.\n\nI think diffusion solves a lot of top-level issues with AR models, and I'm praying that it leads us to a better future for the industry.","html":"<p>That amount of research on dLLMs recently has been pretty inspiring. This is another example of that.</p>\n<p>I think diffusion solves a lot of top-level issues with AR models, and I'm praying that it leads us to a better future for the industry.</p>\n","updatedAt":"2026-05-14T12:14:41.627Z","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9849029779434204},"editors":["urroxyz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png"],"reactions":[{"reaction":"🔥","users":["chiennv"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12825","authors":[{"_id":"6a0531dbb1a8cbabc9f086bd","user":{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","isPro":false,"fullname":"Nguyen Van Chien","user":"chiennv","type":"user","name":"chiennv"},"name":"Chien Van Nguyen","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:43.740Z","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086be","name":"Chaitra Hegde","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086bf","name":"Van Cuong Pham","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086c0","name":"Ryan A. Rossi","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086c1","user":{"_id":"62c5947524171688a9feb992","avatarUrl":"/avatars/5a151713b9eae8dc566f5957acee3475.svg","isPro":false,"fullname":"Franck Dernoncourt","user":"Franck-Dernoncourt","type":"user","name":"Franck-Dernoncourt"},"name":"Franck Dernoncourt","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:45.916Z","hidden":false},{"_id":"6a0531dbb1a8cbabc9f086c2","name":"Thien Huu Nguyen","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion","submittedOnDailyBy":{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","isPro":false,"fullname":"Nguyen Van Chien","user":"chiennv","type":"user","name":"chiennv"},"summary":"We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.","upvotes":7,"discussionId":"6a0531dcb1a8cbabc9f086c3","githubRepo":"https://github.com/chiennv2000/orthrus","githubRepoAddedBy":"user","ai_summary":"Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.","ai_keywords":["autoregressive Large Language Models","diffusion models","parallel token generation","Transformer","Key-Value cache","consensus mechanism","lossless inference","dual-architecture framework"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63edf17df5aef7e657262605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63edf17df5aef7e657262605/X3WzygIyND_0T_SWRDDU4.jpeg","isPro":false,"fullname":"Nguyen Van Chien","user":"chiennv","type":"user"},{"_id":"62c5947524171688a9feb992","avatarUrl":"/avatars/5a151713b9eae8dc566f5957acee3475.svg","isPro":false,"fullname":"Franck Dernoncourt","user":"Franck-Dernoncourt","type":"user"},{"_id":"651a6b53757417990d5a3605","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/QeVPeJ9av9MsEICRQl5Gk.jpeg","isPro":false,"fullname":"Nguyen Huu Huy","user":"huy-nh-2000","type":"user"},{"_id":"65df6aa6d3f50db99e274d0d","avatarUrl":"/avatars/97d49a30e1da8b829296c43d60d7d414.svg","isPro":false,"fullname":"Ryan Rossi","user":"RyanARossi","type":"user"},{"_id":"69a3f858f5f4c149efcb484f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/nBdJ0d-UlU3MHk0LawjMC.jpeg","isPro":false,"fullname":"Васильев Евгений","user":"charlesgarcia22","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"64fd92048b633409637b5c33","avatarUrl":"/avatars/547a82d5584cc554af8c8ba78fe14616.svg","isPro":false,"fullname":"Phạm Văn Cường","user":"mrcuongtroll","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12825.md"}">
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Abstract
Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity through shared KV caches and consensus mechanisms.
AI-generated summary
We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.
Community
That amount of research on dLLMs recently has been pretty inspiring. This is another example of that.
I think diffusion solves a lot of top-level issues with AR models, and I'm praying that it leads us to a better future for the industry.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.12825 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.12825 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.