Hugging Face Daily Papers · June 1, 2026 · 4 min read

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Lumos-Nexus is a training-efficient unified video generation framework that uses Unified Progressive Frequency Bridging to enhance visual fidelity while maintaining reasoning-driven semantic control, evaluated via the new VR-Bench.</p>\n","updatedAt":"2026-06-01T02:24:01.884Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.77399742603302},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1ced50df1f8833ac4091fc","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false},"createdAt":"2026-06-01T02:24:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Github: https://github.com/alibaba-damo-academy/Lumos-Custom","html":"<p>Github: <a href=\"https://github.com/alibaba-damo-academy/Lumos-Custom\" rel=\"nofollow\">https://github.com/alibaba-damo-academy/Lumos-Custom</a></p>\n","updatedAt":"2026-06-01T02:24:16.464Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6265082955360413},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31603","authors":[{"_id":"6a1ced27808ddbc3c7d43437","name":"Jiazheng Xing","hidden":false},{"_id":"6a1ced27808ddbc3c7d43438","name":"Hangjie Yuan","hidden":false},{"_id":"6a1ced27808ddbc3c7d43439","name":"Lingling Cai","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343a","name":"Xinyu Liu","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343b","name":"Yujie Wei","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343c","name":"Fei Du","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343d","name":"Hai Ci","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343e","name":"Tao Feng","hidden":false},{"_id":"6a1ced27808ddbc3c7d4343f","name":"Jiasheng Tang","hidden":false},{"_id":"6a1ced27808ddbc3c7d43440","name":"Weihua Chen","hidden":false},{"_id":"6a1ced27808ddbc3c7d43441","name":"Fan Wang","hidden":false},{"_id":"6a1ced27808ddbc3c7d43442","name":"Yong Liu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6039478ab3ecf716b1a5fd4d/55y3BANPKmcXH3cBsQh_f.mp4"],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.","upvotes":2,"discussionId":"6a1ced28808ddbc3c7d43443","projectPage":"https://jiazheng-xing.github.io/nexus-lumos-home/","ai_summary":"Lumos-Nexus is a training-efficient video generation framework that uses a two-stage approach with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency Bridging.","ai_keywords":["unified video generation framework","lightweight generator","reasoning-driven semantic control","high-capacity pretrained generator","shared latent space","Unified Progressive Frequency Bridging","video synthesis","visual fidelity","temporal coherence","VR-Bench","VBench"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"649d54b314afbb10ce2a9eeb","avatarUrl":"/avatars/15c325d8c2273ff63569f23015e98486.svg","isPro":false,"fullname":"Hangjie Yuan","user":"JacobYuan","type":"user"},{"_id":"66c4603ff6a7fb4e28b2ca11","avatarUrl":"/avatars/1eaafa54947d2b8f285bde30bfabb994.svg","isPro":false,"fullname":"Xing","user":"Ockham98","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31603.md"}">

Papers

arxiv:2605.31603

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Published on May 29

· Submitted by

taesiri on Jun 1

Upvote

Authors:

Abstract

Lumos-Nexus is a training-efficient video generation framework that uses a two-stage approach with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency Bridging.

AI-generated summary

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

View arXiv page View PDF Project page Add to collection

Community

taesiri

Paper submitter about 9 hours ago

taesiri

Paper submitter about 9 hours ago

Github: https://github.com/alibaba-damo-academy/Lumos-Custom

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31603

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31603 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31603 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31603 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers