Hugging Face Daily Papers · June 1, 2026 · 4 min read

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

0.99) analogous to those observed in LLMs.","html":"<p>Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute – following power-law scaling relationships (R² > 0.99) analogous to those observed in LLMs.</p>\n","updatedAt":"2026-06-01T16:06:56.615Z","author":{"_id":"6471c12a0c2b5fdaf1f07c45","avatarUrl":"/avatars/6783067212d24d1e716a8b4c64df61b4.svg","fullname":"Stefan Baumann","name":"stefan-baumann","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9080444574356079},"editors":["stefan-baumann"],"editorAvatarUrls":["/avatars/6783067212d24d1e716a8b4c64df61b4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31535","authors":[{"_id":"6a1dacad808ddbc3c7d43990","name":"Ulrich Prestel","hidden":false},{"_id":"6a1dacad808ddbc3c7d43991","name":"Stefan Andreas Baumann","hidden":false},{"_id":"6a1dacad808ddbc3c7d43992","name":"Nick Stracke","hidden":false},{"_id":"6a1dacad808ddbc3c7d43993","name":"Björn Ommer","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/9nMOeRcYQHxcFrsn3tJxg.png","https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/bY_tUsDfLoKqPoknAK4xx.png","https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/o_8-QB7U8XKX46YE7J7ck.png","https://cdn-uploads.huggingface.co/production/uploads/6471c12a0c2b5fdaf1f07c45/Uxemb78UY1uq1-ltPXEj7.png"],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video","submittedOnDailyBy":{"_id":"6471c12a0c2b5fdaf1f07c45","avatarUrl":"/avatars/6783067212d24d1e716a8b4c64df61b4.svg","isPro":false,"fullname":"Stefan Baumann","user":"stefan-baumann","type":"user","name":"stefan-baumann"},"summary":"Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder","upvotes":3,"discussionId":"6a1dacad808ddbc3c7d43994","projectPage":"https://compvis.github.io/rayder/","githubRepo":"https://github.com/CompVis/rayder","githubRepoAddedBy":"user","ai_summary":"RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.","ai_keywords":["novel view synthesis","self-supervised learning","feed-forward transformer","camera estimation","scene reconstruction","rendering","dynamic state","power-law scaling","zero-shot open-set performance"],"githubStars":1,"organization":{"_id":"62cfeeb73c54a34d508b82a9","name":"CompVis","fullname":"CompVis","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1657794102363-5e3aec01f55e2b62848a5217.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6471c12a0c2b5fdaf1f07c45","avatarUrl":"/avatars/6783067212d24d1e716a8b4c64df61b4.svg","isPro":false,"fullname":"Stefan Baumann","user":"stefan-baumann","type":"user"},{"_id":"6425e56785f26ab94af19797","avatarUrl":"/avatars/555fafac9b555da7be1a112fac9a0cbf.svg","isPro":false,"fullname":"Ulrich Prestel","user":"upr","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62cfeeb73c54a34d508b82a9","name":"CompVis","fullname":"CompVis","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1657794102363-5e3aec01f55e2b62848a5217.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31535.md"}">

Papers

arxiv:2605.31535

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Published on May 29

· Submitted by

Stefan Baumann on Jun 1

CompVis

Upvote

Authors:

Abstract

RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis, enabling stable training on real-world video through dynamic state absorption and demonstrating clean scaling behavior.

AI-generated summary

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

stefan-baumann

Paper submitter about 6 hours ago

Self-supervised novel view synthesis methods are fundamentally data-limited: they require static-scene training data, which is scarce. RayDer removes this bottleneck by enabling stable training on general, dynamic real-world video. By consolidating three separate networks into one unified transformer, introducing dynamic state prediction with dropout, and improving pose learning through autoregressive training, RayDer's performance scales predictably with data, model size, and compute – following power-law scaling relationships (R² > 0.99) analogous to those observed in LLMs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31535

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31535 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31535 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers