Generating a coherent multi-shot video requires structured cross-shot memory: subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3. The video stream maintains two fixed-size slots — a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail — both updated by a boundary-conditioned gate at every cut. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six cultural regions and 13 languages. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots achieves performance comparable to strong open-source and closed-source baselines on cross-shot coherence and audio-video quality.</p>\n","updatedAt":"2026-06-25T05:11:23.574Z","author":{"_id":"64d1efef4a204a4d125fd4fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d1efef4a204a4d125fd4fc/mOjMzPUkAR-SmzYwofazX.jpeg","fullname":"Jiehui Huang","name":"JackAILab","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.874901533126831},"editors":["JackAILab"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64d1efef4a204a4d125fd4fc/mOjMzPUkAR-SmzYwofazX.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.21661","authors":[{"_id":"6a3bca355ac9fb074498499a","name":"Jiehui Huang","hidden":false},{"_id":"6a3bca355ac9fb074498499b","name":"Yuechen Zhang","hidden":false},{"_id":"6a3bca355ac9fb074498499c","name":"Bin Xia","hidden":false},{"_id":"6a3bca355ac9fb074498499d","name":"Jiahao Wang","hidden":false},{"_id":"6a3bca355ac9fb074498499e","name":"Xu He","hidden":false},{"_id":"6a3bca355ac9fb074498499f","name":"Zhenchao Tang","hidden":false},{"_id":"6a3bca355ac9fb07449849a0","name":"Meng Chu","hidden":false},{"_id":"6a3bca355ac9fb07449849a1","name":"Xin Tao","hidden":false},{"_id":"6a3bca355ac9fb07449849a2","name":"Pengfei Wan","hidden":false},{"_id":"6a3bca355ac9fb07449849a3","name":"Jiaya Jia","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64d1efef4a204a4d125fd4fc/lGX2-6EJmm-CsgHfMm2jc.mp4","https://cdn-uploads.huggingface.co/production/uploads/64d1efef4a204a4d125fd4fc/aQP7qWyXmVZ_q27G6AAK2.png"],"publishedAt":"2026-06-19T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating","submittedOnDailyBy":{"_id":"64d1efef4a204a4d125fd4fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d1efef4a204a4d125fd4fc/mOjMzPUkAR-SmzYwofazX.jpeg","isPro":false,"fullname":"Jiehui Huang","user":"JackAILab","type":"user","name":"JackAILab"},"summary":"Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.","upvotes":7,"discussionId":"6a3bca355ac9fb07449849a4","projectPage":"https://jackailab.github.io/Projects/UnityShots/","githubRepo":"https://github.com/JIA-Lab-research/UnityShots","githubRepoAddedBy":"user","ai_summary":"UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors.","ai_keywords":["multi-shot audio-video generation","LTX-2.3","long-term memory","short-term memory","boundary-conditioned gate","visual cut probability","beat-tracker signals","reference speaker token","discrete cut-type prior","AdaLN","cross-shot coherence"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64d1efef4a204a4d125fd4fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d1efef4a204a4d125fd4fc/mOjMzPUkAR-SmzYwofazX.jpeg","isPro":false,"fullname":"Jiehui Huang","user":"JackAILab","type":"user"},{"_id":"64a385281cbf675203fbb7df","avatarUrl":"/avatars/f259d080d3127c45bcf564a8d1fafcc6.svg","isPro":false,"fullname":"Junjie Wang","user":"xiaomoguhzz","type":"user"},{"_id":"6506b77a773ceaa8d52ecea1","avatarUrl":"/avatars/0e769a0795063e1491c44760a4a83097.svg","isPro":false,"fullname":"CJH","user":"Howe666","type":"user"},{"_id":"668df98de9e585e8718f767f","avatarUrl":"/avatars/2be52f4ae88a0991c8ae584f8e870734.svg","isPro":false,"fullname":"Xiangyang Luo","user":"XiangyangLuo02","type":"user"},{"_id":"697847187ab756d3d851b6c7","avatarUrl":"/avatars/1911290667fbe9c6e5c039f55e9a56b5.svg","isPro":false,"fullname":"CrystalWatkins","user":"CrystalWatkins","type":"user"},{"_id":"6649afbb32236a0fad74434e","avatarUrl":"/avatars/b32af9b3cecc7f6bbe68c2260b481cb4.svg","isPro":false,"fullname":"Yaokun Li","user":"Iron-lyk","type":"user"},{"_id":"697847c68678874a434c1418","avatarUrl":"/avatars/2d7ba05760a878433c38584e84e17ea1.svg","isPro":false,"fullname":"StephanieMaynard","user":"StephanieMaynard","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.21661.md","query":{}}">
UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating
Authors: ,
,
,
,
,
,
,
,
,
Abstract
UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors.
Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.
Community
Generating a coherent multi-shot video requires structured cross-shot memory: subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3. The video stream maintains two fixed-size slots — a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail — both updated by a boundary-conditioned gate at every cut. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six cultural regions and 13 languages. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots achieves performance comparable to strong open-source and closed-source baselines on cross-shot coherence and audio-video quality.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.21661 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.21661 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.