Latent Spatial Memory stores persistent 3D scene content directly as latent tokens in video world models.</p>\n","updatedAt":"2026-06-09T02:34:09.244Z","author":{"_id":"66699aa8a33847217b5a49c7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/u8Z-6U8U7ARXOpdBDI7Qm.png","fullname":"Weijie Wang","name":"lhmd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6691362857818604},"editors":["lhmd"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/u8Z-6U8U7ARXOpdBDI7Qm.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09828","authors":[{"_id":"6a2779806dde1c5ef75bcf14","name":"Weijie Wang","hidden":false},{"_id":"6a2779806dde1c5ef75bcf15","name":"Haoyu Zhao","hidden":false},{"_id":"6a2779806dde1c5ef75bcf16","name":"Yifan Yang","hidden":false},{"_id":"6a2779806dde1c5ef75bcf17","name":"Feng Chen","hidden":false},{"_id":"6a2779806dde1c5ef75bcf18","name":"Zeyu Zhang","hidden":false},{"_id":"6a2779806dde1c5ef75bcf19","name":"Yefei He","hidden":false},{"_id":"6a2779806dde1c5ef75bcf1a","name":"Zicheng Duan","hidden":false},{"_id":"6a2779806dde1c5ef75bcf1b","name":"Donny Y. Chen","hidden":false},{"_id":"6a2779806dde1c5ef75bcf1c","name":"Yuqing Yang","hidden":false},{"_id":"6a2779806dde1c5ef75bcf1d","name":"Bohan Zhuang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66699aa8a33847217b5a49c7/Q97GIUfnhwwu-bjSqdWaL.mp4"],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Latent Spatial Memory for Video World Models","submittedOnDailyBy":{"_id":"66699aa8a33847217b5a49c7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/u8Z-6U8U7ARXOpdBDI7Qm.png","isPro":false,"fullname":"Weijie Wang","user":"lhmd","type":"user","name":"lhmd"},"summary":"Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.","upvotes":34,"discussionId":"6a2779816dde1c5ef75bcf1e","projectPage":"https://microsoft.github.io/LatentSpatialMemory/","githubRepo":"https://github.com/microsoft/LatentSpatialMemory","githubRepoAddedBy":"user","ai_summary":"Latent spatial memory for video world models stores 3D scene information directly in diffusion latent space, eliminating pixel-space reconstruction overhead and achieving faster generation with reduced memory usage.","ai_keywords":["video world models","point cloud memory","RGB space","diffusion latent space","latent spatial memory","depth-guided back-projection","latent-space warping","end-to-end video generation","memory footprint","WorldScore","RealEstate10K"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66699aa8a33847217b5a49c7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/u8Z-6U8U7ARXOpdBDI7Qm.png","isPro":false,"fullname":"Weijie Wang","user":"lhmd","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64574380f3ef144c0e69d484","avatarUrl":"/avatars/a0a84757cb0bf09c24291803e1389b49.svg","isPro":false,"fullname":"Feng Chen","user":"chenfeng1271","type":"user"},{"_id":"6979d8678bf19c83d7eedfbc","avatarUrl":"/avatars/cbdb313ffd54ff8ccb38ba138bf634d2.svg","isPro":false,"fullname":"Auricchio Terri","user":"lhm-t","type":"user"},{"_id":"67fb6b6081692bf8e2bd49b1","avatarUrl":"/avatars/d2f28aa4e19c39cb39f5a301014e5739.svg","isPro":false,"fullname":"memory of fish","user":"fish456","type":"user"},{"_id":"683d5646ae87a04bca3ce0d9","avatarUrl":"/avatars/1d9030a622cbec7960fcac703573c533.svg","isPro":false,"fullname":"XIAOLONG","user":"lxl-158","type":"user"},{"_id":"69f15425228008b58be54388","avatarUrl":"/avatars/6d0541578c890abb0cd1672d5649c01b.svg","isPro":false,"fullname":"anthodg","user":"anthodg","type":"user"},{"_id":"63ebc290d64e6436e2311074","avatarUrl":"/avatars/13f08fbf3736e471e10bfc417377575e.svg","isPro":false,"fullname":"Akide Liu","user":"Akide","type":"user"},{"_id":"69f0292ef2053a6fa22da915","avatarUrl":"/avatars/c2b7b9130f5917c8cb59bb3ecb5cb89b.svg","isPro":false,"fullname":"Aukidelog","user":"3dlover-1","type":"user"},{"_id":"69f0288031b9968683c90b0f","avatarUrl":"/avatars/913935e378e880db56a608bbc5441e36.svg","isPro":false,"fullname":"Hakshi","user":"ffrecon","type":"user"},{"_id":"69f157560abcd9bdb8f30461","avatarUrl":"/avatars/f31703dbde4cded202a71a8cdcc5b86c.svg","isPro":false,"fullname":"Hayyat Zhang","user":"Hayyat2","type":"user"},{"_id":"69f154c2a50f64510fd18643","avatarUrl":"/avatars/e99f7ef9d993e0d5973a6f3f48ef7c7f.svg","isPro":false,"fullname":"kelao su","user":"kelao123321","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09828.md"}">
Latent Spatial Memory for Video World Models
Authors: ,
,
,
,
,
,
,
,
,
Abstract
Latent spatial memory for video world models stores 3D scene information directly in diffusion latent space, eliminating pixel-space reconstruction overhead and achieving faster generation with reduced memory usage.
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce latent spatial memory for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to 10.57times faster end-to-end video generation and 55times reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.
Community
Latent Spatial Memory stores persistent 3D scene content directly as latent tokens in video world models.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.09828 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.09828 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.09828 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.