Audio-video multimodal generation<br>with real-time interactive experience<br>on a single GPU.</p>\n","updatedAt":"2026-06-02T02:20:57.217Z","author":{"_id":"66a1ba268cce07a5e6fff661","avatarUrl":"/avatars/73c4478fb63cdd410cc9acc67846d2f1.svg","fullname":"Linrui Tian","name":"MrTaller","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8000049591064453},"editors":["MrTaller"],"editorAvatarUrls":["/avatars/73c4478fb63cdd410cc9acc67846d2f1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25659","authors":[{"_id":"6a1cfd1d808ddbc3c7d43513","name":"Linrui Tian","hidden":false},{"_id":"6a1cfd1d808ddbc3c7d43514","name":"Qi Wang","hidden":false},{"_id":"6a1cfd1d808ddbc3c7d43515","name":"Bang Zhang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66a1ba268cce07a5e6fff661/blAFfHkfSea8Jz6lddlRJ.mp4"],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration","submittedOnDailyBy":{"_id":"66a1ba268cce07a5e6fff661","avatarUrl":"/avatars/73c4478fb63cdd410cc9acc67846d2f1.svg","isPro":false,"fullname":"Linrui Tian","user":"MrTaller","type":"user","name":"MrTaller"},"summary":"Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.","upvotes":10,"discussionId":"6a1cfd1d808ddbc3c7d43516","projectPage":"https://humanaigc.github.io/StreamChar_page/","ai_summary":"StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage distillation and maintaining visual consistency through memory mechanisms.","ai_keywords":["streaming framework","LLM-based orchestrator","frame-aligned audio conditions","joint audio-video DiT","local bidirectional denoising","reference and motion-frame conditioning","two-stage distillation pipeline","sampler compression","online chunk rollouts","progress-aware pointer","sink-chunk memory","real-time generation","audio-visual synchronization","visual identity maintenance"],"organization":{"_id":"67bc94e9254f794a8443696a","name":"Wan-Video","fullname":"WanXiang","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a1200045edac9f7508bae9/xDDqIckXCigt91N-NXJHL.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6507fbecffc738079ca592bf","avatarUrl":"/avatars/1cb0f39ac6dc2dba2292846a8d7746da.svg","isPro":false,"fullname":"Ming Chen","user":"ChenMing-thu14","type":"user"},{"_id":"637b561358697b8e2c342129","avatarUrl":"/avatars/2ebdc92cc0a07fffe22352f0ae2f4b73.svg","isPro":false,"fullname":"tallery","user":"tallery","type":"user"},{"_id":"698f8f12e7a7f2b4ed5a1774","avatarUrl":"/avatars/4542477cfa0c794be555831ba20e84f0.svg","isPro":false,"fullname":"Cydkauktam4ys","user":"cydkauktam4ys","type":"user"},{"_id":"65df1f1ee98700500d4c289c","avatarUrl":"/avatars/be11bf61465df29ac997cc0fedad1cb9.svg","isPro":false,"fullname":"qi wang","user":"lucaskingjade","type":"user"},{"_id":"664aa8ceb5e5f95dc60ad19c","avatarUrl":"/avatars/76b5f3513e4bc03fdc6ec04dccfce9c1.svg","isPro":false,"fullname":"Jingyuan Shi","user":"SpinyNewt","type":"user"},{"_id":"69a3fc75a022769cc825d47d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/_lZ52_J42oOLitKVgFgWq.jpeg","isPro":false,"fullname":"Zhou Yutong","user":"WYATTMOORE45","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"66a1ba268cce07a5e6fff661","avatarUrl":"/avatars/73c4478fb63cdd410cc9acc67846d2f1.svg","isPro":false,"fullname":"Linrui Tian","user":"MrTaller","type":"user"},{"_id":"687363d49a81c7dcbcfa2d84","avatarUrl":"/avatars/5d943a5c811ed931c3fdcfee19253049.svg","isPro":false,"fullname":"jj","user":"realman123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67bc94e9254f794a8443696a","name":"Wan-Video","fullname":"WanXiang","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a1200045edac9f7508bae9/xDDqIckXCigt91N-NXJHL.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25659.md"}">
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration
Abstract
StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage distillation and maintaining visual consistency through memory mechanisms.
AI-generated summary
Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.
Community
Audio-video multimodal generation
with real-time interactive experience
on a single GPU.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.25659 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.25659 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.25659 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.