Lumos-Nexus is a training-efficient unified video generation framework that uses Unified Progressive Frequency Bridging to enhance visual fidelity while maintaining reasoning-driven semantic control, evaluated via the new VR-Bench.</p>\n","updatedAt":"2026-06-01T02:24:01.884Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.77399742603302},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1ced50df1f8833ac4091fc","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false},"createdAt":"2026-06-01T02:24:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Github: https://github.com/alibaba-damo-academy/Lumos-Custom","html":"<p>Github: <a href=\"https://github.com/alibaba-damo-academy/Lumos-Custom\" rel=\"nofollow\">https://github.com/alibaba-damo-academy/Lumos-Custom</a></p>\n","updatedAt":"2026-06-01T02:24:16.464Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6265082955360413},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31603","authors":[{"_id":"6a1ced27808ddbc3c7d43437","name":"Jiazheng Xing","hidden":false},{"_id":"6a1ced27808ddbc3c7d43438","name":"Hangjie Yuan","hidden":false},{"_id":"6a1ced27808ddbc3c7d43439","name":"Lingling Cai","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343a","name":"Xinyu Liu","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343b","name":"Yujie Wei","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343c","name":"Fei Du","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343d","name":"Hai Ci","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343e","name":"Tao Feng","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343f","name":"Jiasheng Tang","hidden":false},{"_id":"6a1ced27808ddbc3c7d43440","name":"Weihua Chen","hidden":false},{"_id":"6a1ced27808ddbc3c7d43441","name":"Fan Wang","hidden":false},{"_id":"6a1ced27808ddbc3c7d43442","name":"Yong Liu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6039478ab3ecf716b1a5fd4d/55y3BANPKmcXH3cBsQh_f.mp4"],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.","upvotes":2,"discussionId":"6a1ced28808ddbc3c7d43443","projectPage":"https://jiazheng-xing.github.io/nexus-lumos-home/","ai_summary":"Lumos-Nexus is a training-efficient video generation framework that uses a two-stage approach with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency Bridging.","ai_keywords":["unified video generation framework","lightweight generator","reasoning-driven semantic control","high-capacity pretrained generator","shared latent space","Unified Progressive Frequency Bridging","video synthesis","visual fidelity","temporal coherence","VR-Bench","VBench"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"649d54b314afbb10ce2a9eeb","avatarUrl":"/avatars/15c325d8c2273ff63569f23015e98486.svg","isPro":false,"fullname":"Hangjie Yuan","user":"JacobYuan","type":"user"},{"_id":"66c4603ff6a7fb4e28b2ca11","avatarUrl":"/avatars/1eaafa54947d2b8f285bde30bfabb994.svg","isPro":false,"fullname":"Xing","user":"Ockham98","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31603.md"}">
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models
Authors: ,
,
,
,
,
,
,
,
,
,
,
Abstract
Lumos-Nexus is a training-efficient video generation framework that uses a two-stage approach with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency Bridging.
AI-generated summary
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.
Community
Lumos-Nexus is a training-efficient unified video generation framework that uses Unified Progressive Frequency Bridging to enhance visual fidelity while maintaining reasoning-driven semantic control, evaluated via the new VR-Bench.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.31603 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.31603 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.31603 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.