Project Page: <a href=\"https://cvlab-kaist.github.io/WorldKV/\" rel=\"nofollow\">https://cvlab-kaist.github.io/WorldKV/</a></p>\n","updatedAt":"2026-05-22T02:05:46.171Z","author":{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","fullname":"Yi Jung","name":"YJ-142150","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.36561959981918335},"editors":["YJ-142150"],"editorAvatarUrls":["/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22718","authors":[{"_id":"6a0fb610a53a61ce2e422bf6","name":"Jung Yi","hidden":false},{"_id":"6a0fb610a53a61ce2e422bf7","name":"Minjae Kim","hidden":false},{"_id":"6a0fb610a53a61ce2e422bf8","name":"Paul Hyunbin Cho","hidden":false},{"_id":"6a0fb610a53a61ce2e422bf9","name":"Wooseok Jang","hidden":false},{"_id":"6a0fb610a53a61ce2e422bfa","name":"Sangdoo Yun","hidden":false},{"_id":"6a0fb610a53a61ce2e422bfb","name":"Seungryong Kim","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"WorldKV: Efficient World Memory with World Retrieval and Compression","submittedOnDailyBy":{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user","name":"YJ-142150"},"summary":"Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/","upvotes":26,"discussionId":"6a0fb610a53a61ce2e422bfc","projectPage":"https://cvlab-kaist.github.io/WorldKV/","ai_summary":"WorldKV enables persistent world generation in video diffusion models by retrieving and compressing key-value cache chunks to maintain consistency while improving throughput.","ai_keywords":["autoregressive video diffusion models","KV-cache attention","sliding window inference","world retrieval","world compression","key-key similarity","attention window","token pruning","camera/action correspondence"],"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"67e3a3cc0c2f0d766d401bdb","avatarUrl":"/avatars/0de4c3b11295505ec9d3626e65302cbd.svg","isPro":false,"fullname":"Siyoon Jin","user":"clwm515","type":"user"},{"_id":"637c49ec9c470afa3880b137","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/pdcMPz8N6vQM1tc8IA1lV.png","isPro":false,"fullname":"Seongchan Kim","user":"Seongchan","type":"user"},{"_id":"652554ff88514c588fb9ea01","avatarUrl":"/avatars/50f2218632d1423980a3e5bef4e1c4e8.svg","isPro":false,"fullname":"Junghyun Park","user":"jamespark30","type":"user"},{"_id":"668e33799c9aa124a3c69ce0","avatarUrl":"/avatars/e7efef019e5b0447a8e78f6ba70eb4f4.svg","isPro":true,"fullname":"joungbinlee","user":"joungbinlee","type":"user"},{"_id":"651277c2b6ffd31931db5290","avatarUrl":"/avatars/8495b84e8aed407da07908ee829e0510.svg","isPro":false,"fullname":"JihoPark","user":"jiho31","type":"user"},{"_id":"67c7b179e3f9241dde9ff772","avatarUrl":"/avatars/37cc7a744d8077a0fe7d926cde9d52b2.svg","isPro":false,"fullname":"LeeJaeho","user":"Jaeho0810","type":"user"},{"_id":"6752ac9be0c39c0eaf6ba90d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cbByUmYoPVUAr35MWQeVm.png","isPro":false,"fullname":"lee","user":"lshlsh","type":"user"},{"_id":"69af7d90164b3dcc95c96cdf","avatarUrl":"/avatars/7fed3d8a2124910bef30fb7df9140422.svg","isPro":false,"fullname":"kak","user":"Kaowai","type":"user"},{"_id":"661e49608b9ee68c0a519b7a","avatarUrl":"/avatars/86ded1cf3692ee8a5a4c9255fa683785.svg","isPro":false,"fullname":"Yejichoi","user":"cyjcyj91","type":"user"},{"_id":"663113eb4458bc18ebde1007","avatarUrl":"/avatars/0be6d7939acaa7050219830de410d701.svg","isPro":false,"fullname":"HYUNAHKO","user":"HYUNAHKO","type":"user"},{"_id":"64cb5884d469fc2cf83bdd76","avatarUrl":"/avatars/10e63cf62d8200beef3e31846796e398.svg","isPro":false,"fullname":"JisooKim","user":"Jiiiiiisoo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22718.md"}">
WorldKV: Efficient World Memory with World Retrieval and Compression
Abstract
WorldKV enables persistent world generation in video diffusion models by retrieving and compressing key-value cache chunks to maintain consistency while improving throughput.
AI-generated summary
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.22718 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.22718 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.22718 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.