0.99) analogous to those observed in LLMs.","html":"<p>Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute – following power-law scaling relationships (R² > 0.99) analogous to those observed in LLMs.</p>\n","updatedAt":"2026-06-01T16:06:56.615Z","author":{"_id":"6471c12a0c2b5fdaf1f07c45","avatarUrl":"/avatars/6783067212d24d1e716a8b4c64df61b4.svg","fullname":"Stefan Baumann","name":"stefan-baumann","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9080444574356079},"editors":["stefan-baumann"],"editorAvatarUrls":["/avatars/6783067212d24d1e716a8b4c64df61b4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31535","authors":[{"_id":"6a1dacad808ddbc3c7d43990","name":"Ulrich Prestel","hidden":false},{"_id":"6a1dacad808ddbc3c7d43991","name":"Stefan Andreas Baumann","hidden":false},{"_id":"6a1dacad808ddbc3c7d43992","name":"Nick Stracke","hidden":false},{"_id":"6a1dacad808ddbc3c7d43993","name":"Björn Ommer","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/9nMOeRcYQHxcFrsn3tJxg.png","https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/bY_tUsDfLoKqPoknAK4xx.png","https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/o_8-QB7U8XKX46YE7J7ck.png","https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/Uxemb78UY1uq1-ltPXEj7.png"],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video","submittedOnDailyBy":{"_id":"6471c12a0c2b5fdaf1f07c45","avatarUrl":"/avatars/6783067212d24d1e716a8b4c64df61b4.svg","isPro":false,"fullname":"Stefan Baumann","user":"stefan-baumann","type":"user","name":"stefan-baumann"},"summary":"Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder","upvotes":3,"discussionId":"6a1dacad808ddbc3c7d43994","projectPage":"https://compvis.github.io/rayder/","githubRepo":"https://github.com/CompVis/rayder","githubRepoAddedBy":"user","ai_summary":"RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.","ai_keywords":["novel view synthesis","self-supervised learning","feed-forward transformer","camera estimation","scene reconstruction","rendering","dynamic state","power-law scaling","zero-shot open-set performance"],"githubStars":1,"organization":{"_id":"62cfeeb73c54a34d508b82a9","name":"CompVis","fullname":"CompVis","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1657794102363-5e3aec01f55e2b62848a5217.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6471c12a0c2b5fdaf1f07c45","avatarUrl":"/avatars/6783067212d24d1e716a8b4c64df61b4.svg","isPro":false,"fullname":"Stefan Baumann","user":"stefan-baumann","type":"user"},{"_id":"6425e56785f26ab94af19797","avatarUrl":"/avatars/555fafac9b555da7be1a112fac9a0cbf.svg","isPro":false,"fullname":"Ulrich Prestel","user":"upr","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62cfeeb73c54a34d508b82a9","name":"CompVis","fullname":"CompVis","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1657794102363-5e3aec01f55e2b62848a5217.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31535.md"}">
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
Abstract
RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.
AI-generated summary
Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder
Community
Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute – following power-law scaling relationships (R² > 0.99) analogous to those observed in LLMs.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.31535 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.31535 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.