Hugging Face Daily Papers · · 6 min read

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Wan-Streamer v0.1 is a native-streaming, end-to-end model that listens, sees, thinks, speaks, and responds on video in real time — at 25 fps with ~200 ms model-side latency, all within a single Transformer.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/lEkBiuGAbSwcv1a2pbeRG.webp\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/lEkBiuGAbSwcv1a2pbeRG.webp\" alt=\"framework\"></a></p>\n","updatedAt":"2026-06-25T01:40:35.482Z","author":{"_id":"6522cf31777019ca30d85725","avatarUrl":"/avatars/a180b096e438e429d445b68fe703e43f.svg","fullname":"Lianghua Huang","name":"lhhuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7860909700393677},"editors":["lhhuang"],"editorAvatarUrls":["/avatars/a180b096e438e429d445b68fe703e43f.svg"],"reactions":[],"isReport":false}},{"id":"6a3c888a3fb8d0968fe4f1e5","author":{"_id":"6522cf31777019ca30d85725","avatarUrl":"/avatars/a180b096e438e429d445b68fe703e43f.svg","fullname":"Lianghua Huang","name":"lhhuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-06-25T01:46:50.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Comparison with other interactive models / systems:\n\n![image](https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/tNOKeBz6jq2bu_LUAwpA6.png)\n\nThe two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.\n\n![image](https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/45uFnTH3ASZpfN25olwqQ.png)\n\nCapability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. \"~\" marks partial support or a figure that is not publicly disclosed.","html":"<p>Comparison with other interactive models / systems:</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/tNOKeBz6jq2bu_LUAwpA6.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/tNOKeBz6jq2bu_LUAwpA6.png\" alt=\"image\"></a></p>\n<p>The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/45uFnTH3ASZpfN25olwqQ.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/45uFnTH3ASZpfN25olwqQ.png\" alt=\"image\"></a></p>\n<p>Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. \"~\" marks partial support or a figure that is not publicly disclosed.</p>\n","updatedAt":"2026-06-25T01:47:40.365Z","author":{"_id":"6522cf31777019ca30d85725","avatarUrl":"/avatars/a180b096e438e429d445b68fe703e43f.svg","fullname":"Lianghua Huang","name":"lhhuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.8713445067405701},"editors":["lhhuang"],"editorAvatarUrls":["/avatars/a180b096e438e429d445b68fe703e43f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25041","authors":[{"_id":"6a3c8617f3facdb67e9ff018","name":"Lianghua Huang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff019","name":"Zhifan Wu","hidden":false},{"_id":"6a3c8617f3facdb67e9ff01a","name":"Wei Wang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff01b","name":"Yupeng Shi","hidden":false},{"_id":"6a3c8617f3facdb67e9ff01c","name":"Mengyang Feng","hidden":false},{"_id":"6a3c8617f3facdb67e9ff01d","name":"Junjie He","hidden":false},{"_id":"6a3c8617f3facdb67e9ff01e","name":"Chenwei Xie","hidden":false},{"_id":"6a3c8617f3facdb67e9ff01f","name":"Yu Liu","hidden":false},{"_id":"6a3c8617f3facdb67e9ff020","name":"Jingren Zhou","hidden":false},{"_id":"6a3c8617f3facdb67e9ff021","name":"Ang Wang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff022","name":"Bang Zhang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff023","name":"Baole Ai","hidden":false},{"_id":"6a3c8617f3facdb67e9ff024","name":"Chen Liang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff025","name":"Cheng Yu","hidden":false},{"_id":"6a3c8617f3facdb67e9ff026","name":"Chongyang Zhong","hidden":false},{"_id":"6a3c8617f3facdb67e9ff027","name":"Jinwei Qi","hidden":false},{"_id":"6a3c8617f3facdb67e9ff028","name":"Kai Zhu","hidden":false},{"_id":"6a3c8617f3facdb67e9ff029","name":"Pandeng Li","hidden":false},{"_id":"6a3c8617f3facdb67e9ff02a","name":"Peng Zhang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff02b","name":"Wenyuan Zhang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff02c","name":"Xinhua Cheng","hidden":false},{"_id":"6a3c8617f3facdb67e9ff02d","name":"Yitong Huang","hidden":false},{"_id":"6a3c8617f3facdb67e9ff02e","name":"Yun Zheng","hidden":false},{"_id":"6a3c8617f3facdb67e9ff02f","name":"Zoubin Bi","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/Jr9b2HX4c95VLANHcNzuR.webp","https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/52JmBBBq3UEDFLQRAdCBP.webp","https://cdn-uploads.huggingface.co/production/uploads/6522cf31777019ca30d85725/Bqq4XAclY1G62z64jVNDe.mp4"],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models","submittedOnDailyBy":{"_id":"6522cf31777019ca30d85725","avatarUrl":"/avatars/a180b096e438e429d445b68fe703e43f.svg","isPro":false,"fullname":"Lianghua Huang","user":"lhhuang","type":"user","name":"lhhuang"},"summary":"We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.","upvotes":30,"discussionId":"6a3c8617f3facdb67e9ff030","projectPage":"https://wan-streamer.com/","ai_summary":"Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities.","ai_keywords":["Transformer","block-causal attention","causal encoders","causal decoders","multimodal token scheduling","audio-visual interaction","real-time streaming","low-latency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"67bc7cd418dd753c02a82684","name":"Wan-AI","fullname":"Wan-AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b610677ea7952def8b29c6/N6jQbbeaa_FcUY-wI1dgG.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6522cf31777019ca30d85725","avatarUrl":"/avatars/a180b096e438e429d445b68fe703e43f.svg","isPro":false,"fullname":"Lianghua Huang","user":"lhhuang","type":"user"},{"_id":"659cb6cc38186a51f122689e","avatarUrl":"/avatars/11c33c81e87f55091b672c64f7c743d3.svg","isPro":false,"fullname":"Park JuHoon","user":"J4BEZ","type":"user"},{"_id":"63bea9b987619d1458c5ea77","avatarUrl":"/avatars/f3ae4610f0d30088838b1b83683b473a.svg","isPro":false,"fullname":"Mengyang Feng","user":"archerfmy0831","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"66b1d0b3ccac66f1832d909e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b1d0b3ccac66f1832d909e/FUdnedLCPLbMcShG12RmV.jpeg","isPro":false,"fullname":"Zoubin Bi","user":"rupertpaoz","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6418554a0956be7233a1023e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6418554a0956be7233a1023e/9EKN0GoOpcDbvBDmAQEJf.png","isPro":false,"fullname":"zhang yuechen","user":"julianjuaner","type":"user"},{"_id":"693fb237de4af2b689cf58a1","avatarUrl":"/avatars/fc71c9eddf2317e5f954e77a546c4f4d.svg","isPro":false,"fullname":"nancy67623","user":"nancy67623","type":"user"},{"_id":"66d98c40dc8d2111492954f6","avatarUrl":"/avatars/c58392abce0b9b1152cc783b142b8061.svg","isPro":false,"fullname":"Chen Liang","user":"JasiLiang","type":"user"},{"_id":"6369e9bf7c5dd0caa7dafea8","avatarUrl":"/avatars/24dbc688491160d7521db4130acdc162.svg","isPro":false,"fullname":"sypyp","user":"ypyp","type":"user"},{"_id":"64f0337e1dccbf71c3946204","avatarUrl":"/avatars/955d0533b4b7219773ef7b9a16c05848.svg","isPro":false,"fullname":"Wu Zhi-Fan","user":"wuzhifan","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67bc7cd418dd753c02a82684","name":"Wan-AI","fullname":"Wan-AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b610677ea7952def8b29c6/N6jQbbeaa_FcUY-wI1dgG.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25041.md","query":{}}">
Papers
arxiv:2606.25041

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Published on Jun 23
· Submitted by
Lianghua Huang
on Jun 25
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities.

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

Community

Paper submitter about 7 hours ago

Wan-Streamer v0.1 is a native-streaming, end-to-end model that listens, sees, thinks, speaks, and responds on video in real time — at 25 fps with ~200 ms model-side latency, all within a single Transformer.

framework

Comparison with other interactive models / systems:

image

The two groups measure different things. The top group is full end-to-end interaction loops — they perceive the user and produce a response; only Wan Streamer also outputs video. The bottom group is avatar / audio-visual renderers timed at the rendering stage only: their latency excludes the external language model, ASR, and TTS they depend on, so their true user-visible latency is higher than shown. Wan Streamer is the only end-to-end model that outputs synchronized audio + video, and it does so under 0.6 s. Numbers are the closest publicly reported figures and mix measurement boundaries; see the paper for exact definitions.

image

Capability coverage across representative systems. Full-duplex means the system keeps perceiving while it generates — understanding and responding at the same time. Wan Streamer is the only model that perceives video, outputs synchronized video, runs full-duplex, is end-to-end, and responds within a second; every other system covers only part of this. "~" marks partial support or a figure that is not publicly disclosed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.25041
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.25041 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.25041 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.25041 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers