Hugging Face Daily Papers · May 25, 2026 · 5 min read

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.</p>\n","updatedAt":"2026-05-25T03:34:56.576Z","author":{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","fullname":"lizizun","name":"lizizun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8722148537635803},"editors":["lizizun"],"editorAvatarUrls":["/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23903","authors":[{"_id":"6a13c2774d9e8d8602d202c3","name":"Zizun Li","hidden":false},{"_id":"6a13c2774d9e8d8602d202c4","name":"Haoyu Guo","hidden":false},{"_id":"6a13c2774d9e8d8602d202c5","name":"Runzhe Teng","hidden":false},{"_id":"6a13c2774d9e8d8602d202c6","name":"Chunhua Shen","hidden":false},{"_id":"6a13c2774d9e8d8602d202c7","name":"Tong He","hidden":false}],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"Geo-Align: Video Generation Alignment via Metric Geometry Reward","submittedOnDailyBy":{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","isPro":false,"fullname":"lizizun","user":"lizizun","type":"user","name":"lizizun"},"summary":"Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.","upvotes":4,"discussionId":"6a13c2774d9e8d8602d202c8","projectPage":"https://lizizun.github.io/geo-align-page/","githubRepo":"https://github.com/LiZizun/GeoAlign","githubRepoAddedBy":"user","ai_summary":"Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction.","ai_keywords":["Reinforcement Learning","camera-controlled video re-rendering","scale-aware perceptual reward","metric 3D estimator","camera trajectories","supervised fine-tuning","synthetic datasets","real-world video data","pretrained model","data pipeline strategy"],"githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65e7eb86c7a0617cc71d3df4","avatarUrl":"/avatars/01020b6b5ccb08bf8aa10fd5f8b2701d.svg","isPro":false,"fullname":"lizizun","user":"lizizun","type":"user"},{"_id":"652ce0d4c543a08aa92e010f","avatarUrl":"/avatars/7978304e3fe99b0d4d0712441c6a24f3.svg","isPro":false,"fullname":"Haoyu Guo","user":"ghy0324","type":"user"},{"_id":"651f8133dbf879b8c58f5136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651f8133dbf879b8c58f5136/0L8Ecgi5Ietkm_DchJwE-.png","isPro":false,"fullname":"Zikai Zhou","user":"Klayand","type":"user"},{"_id":"6747ede3a9c72aebe1322382","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/inILqQ05sESbYLdsEldJ_.png","isPro":false,"fullname":"Tong He","user":"tonghe90","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23903.md"}">

Papers

arxiv:2605.23903

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Published on May 22

· Submitted by

lizizun on May 25

Upvote

Authors:

Abstract

Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction.

AI-generated summary

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

lizizun

Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.23903

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.23903 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.23903 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.23903 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers