Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.</p>\n","updatedAt":"2026-05-25T03:34:56.576Z","author":{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","fullname":"lizizun","name":"lizizun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8722148537635803},"editors":["lizizun"],"editorAvatarUrls":["/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23903","authors":[{"_id":"6a13c2774d9e8d8602d202c3","name":"Zizun Li","hidden":false},{"_id":"6a13c2774d9e8d8602d202c4","name":"Haoyu Guo","hidden":false},{"_id":"6a13c2774d9e8d8602d202c5","name":"Runzhe Teng","hidden":false},{"_id":"6a13c2774d9e8d8602d202c6","name":"Chunhua Shen","hidden":false},{"_id":"6a13c2774d9e8d8602d202c7","name":"Tong He","hidden":false}],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"Geo-Align: Video Generation Alignment via Metric Geometry Reward","submittedOnDailyBy":{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","isPro":false,"fullname":"lizizun","user":"lizizun","type":"user","name":"lizizun"},"summary":"Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.","upvotes":4,"discussionId":"6a13c2774d9e8d8602d202c8","projectPage":"https://lizizun.github.io/geo-align-page/","githubRepo":"https://github.com/LiZizun/GeoAlign","githubRepoAddedBy":"user","ai_summary":"Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction.","ai_keywords":["Reinforcement Learning","camera-controlled video re-rendering","scale-aware perceptual reward","metric 3D estimator","camera trajectories","supervised fine-tuning","synthetic datasets","real-world video data","pretrained model","data pipeline strategy"],"githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","isPro":false,"fullname":"lizizun","user":"lizizun","type":"user"},{"_id":"652ce0d4c543a08aa92e010f","avatarUrl":"/avatars/7978304e3fe99b0d4d0712441c6a24f3.svg","isPro":false,"fullname":"Haoyu Guo","user":"ghy0324","type":"user"},{"_id":"651f8133dbf879b8c58f5136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651f8133dbf879b8c58f5136/0L8Ecgi5Ietkm_DchJwE-.png","isPro":false,"fullname":"Zikai Zhou","user":"Klayand","type":"user"},{"_id":"6747ede3a9c72aebe1322382","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/inILqQ05sESbYLdsEldJ_.png","isPro":false,"fullname":"Tong He","user":"tonghe90","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23903.md"}">
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Abstract
Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction.
AI-generated summary
Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.
Community
Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.23903 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.23903 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.23903 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.