🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental paradigm shift in multimodal AI: from modality integration to true unification. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.</p>\n","updatedAt":"2026-05-13T02:36:28.989Z","author":{"_id":"64b4a717aa03b6520839e9b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4a717aa03b6520839e9b8/Rt3ERG-6BVEA4hAwOz0_I.jpeg","fullname":"Haiwen Diao","name":"Paranioar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":44,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8825268745422363},"editors":["Paranioar"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b4a717aa03b6520839e9b8/Rt3ERG-6BVEA4hAwOz0_I.jpeg"],"reactions":[{"reaction":"🔥","users":["reesezhang","yl-1993"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12500","authors":[{"_id":"6a03e33986b054ce2fa40dcc","name":"Haiwen Diao","hidden":false},{"_id":"6a03e33986b054ce2fa40dcd","name":"Penghao Wu","hidden":false},{"_id":"6a03e33986b054ce2fa40dce","name":"Hanming Deng","hidden":false},{"_id":"6a03e33986b054ce2fa40dcf","user":{"_id":"66bc7862aa7cdcb1c31a1efb","avatarUrl":"/avatars/4c2ab907247fe071ff5cdd71c404ca7c.svg","isPro":false,"fullname":"wang jiahao","user":"TokenWang","type":"user","name":"TokenWang"},"name":"Jiahao Wang","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:44:20.693Z","hidden":false},{"_id":"6a03e33986b054ce2fa40dd0","name":"Shihao Bai","hidden":false},{"_id":"6a03e33986b054ce2fa40dd1","name":"Silei Wu","hidden":false},{"_id":"6a03e33986b054ce2fa40dd2","name":"Weichen Fan","hidden":false},{"_id":"6a03e33986b054ce2fa40dd3","name":"Wenjie Ye","hidden":false},{"_id":"6a03e33986b054ce2fa40dd4","name":"Wenwen Tong","hidden":false},{"_id":"6a03e33986b054ce2fa40dd5","name":"Xiangyu Fan","hidden":false},{"_id":"6a03e33986b054ce2fa40dd6","name":"Yan Li","hidden":false},{"_id":"6a03e33986b054ce2fa40dd7","name":"Yubo Wang","hidden":false},{"_id":"6a03e33986b054ce2fa40dd8","name":"Zhijie Cao","hidden":false},{"_id":"6a03e33986b054ce2fa40dd9","name":"Zhiqian Lin","hidden":false},{"_id":"6a03e33986b054ce2fa40dda","name":"Zhitao Yang","hidden":false},{"_id":"6a03e33986b054ce2fa40ddb","user":{"_id":"652d06833b5997ed71ce5c46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d06833b5997ed71ce5c46/O_D6bpa5mGxLA7uCjmVCG.jpeg","isPro":false,"fullname":"Zhongang Cai","user":"caizhongang","type":"user","name":"caizhongang"},"name":"Zhongang Cai","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:44:18.517Z","hidden":false},{"_id":"6a03e33986b054ce2fa40ddc","name":"Yuwei Niu","hidden":false},{"_id":"6a03e33986b054ce2fa40ddd","name":"Yue Zhu","hidden":false},{"_id":"6a03e33986b054ce2fa40dde","name":"Bo Liu","hidden":false},{"_id":"6a03e33986b054ce2fa40ddf","name":"Chengguang Lv","hidden":false},{"_id":"6a03e33986b054ce2fa40de0","name":"Haojia Yu","hidden":false},{"_id":"6a03e33986b054ce2fa40de1","name":"Haozhe Xie","hidden":false},{"_id":"6a03e33986b054ce2fa40de2","name":"Hongli Wang","hidden":false},{"_id":"6a03e33986b054ce2fa40de3","user":{"_id":"662df5a92b1b529a43a43310","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662df5a92b1b529a43a43310/0zplLZcHue064FuVrH7vl.jpeg","isPro":false,"fullname":"Jianan Fan","user":"muyiyunzi","type":"user","name":"muyiyunzi"},"name":"Jianan Fan","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:44:14.299Z","hidden":false},{"_id":"6a03e33986b054ce2fa40de4","name":"Jiaqi Li","hidden":false},{"_id":"6a03e33986b054ce2fa40de5","name":"Jiefan Lu","hidden":false},{"_id":"6a03e33986b054ce2fa40de6","name":"Jingcheng Ni","hidden":false},{"_id":"6a03e33986b054ce2fa40de7","name":"Junxiang Xu","hidden":false},{"_id":"6a03e33986b054ce2fa40de8","name":"Kaihuan Liang","hidden":false},{"_id":"6a03e33986b054ce2fa40de9","name":"Lianqiang Shi","hidden":false},{"_id":"6a03e33986b054ce2fa40dea","name":"Linjun Dai","hidden":false},{"_id":"6a03e33986b054ce2fa40deb","name":"Linyan Wang","hidden":false},{"_id":"6a03e33986b054ce2fa40dec","name":"Oscar Qian","hidden":false},{"_id":"6a03e33986b054ce2fa40ded","user":{"_id":"63ed9040c5b3c73408642ffb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ed9040c5b3c73408642ffb/-xxR7lULo3TzizwSgl3H5.png","isPro":false,"fullname":"GaoPeng","user":"gaclove","type":"user","name":"gaclove"},"name":"Peng Gao","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:44:16.488Z","hidden":false},{"_id":"6a03e33986b054ce2fa40dee","name":"Pengfei Liu","hidden":false},{"_id":"6a03e33986b054ce2fa40def","name":"Qingping Sun","hidden":false},{"_id":"6a03e33986b054ce2fa40df0","name":"Rui Shen","hidden":false},{"_id":"6a03e33986b054ce2fa40df1","name":"Ruisi Wang","hidden":false},{"_id":"6a03e33986b054ce2fa40df2","name":"Shengnan Ma","hidden":false},{"_id":"6a03e33986b054ce2fa40df3","name":"Shuang Yang","hidden":false},{"_id":"6a03e33986b054ce2fa40df4","name":"Siyi Xie","hidden":false},{"_id":"6a03e33986b054ce2fa40df5","name":"Siying Li","hidden":false},{"_id":"6a03e33986b054ce2fa40df6","name":"Tianbo Zhong","hidden":false},{"_id":"6a03e33986b054ce2fa40df7","name":"Xiangli Kong","hidden":false},{"_id":"6a03e33986b054ce2fa40df8","name":"Xuanke Shi","hidden":false},{"_id":"6a03e33986b054ce2fa40df9","name":"Yang Gao","hidden":false},{"_id":"6a03e33986b054ce2fa40dfa","name":"Yongqiang Yao","hidden":false},{"_id":"6a03e33986b054ce2fa40dfb","name":"Yves Wang","hidden":false},{"_id":"6a03e33986b054ce2fa40dfc","name":"Zhengqi Bai","hidden":false},{"_id":"6a03e33986b054ce2fa40dfd","name":"Zhengyu Lin","hidden":false},{"_id":"6a03e33986b054ce2fa40dfe","name":"Zixin Yin","hidden":false},{"_id":"6a03e33986b054ce2fa40dff","name":"Wenxiu Sun","hidden":false},{"_id":"6a03e33986b054ce2fa40e00","name":"Ruihao Gong","hidden":false},{"_id":"6a03e33986b054ce2fa40e01","name":"Quan Wang","hidden":false},{"_id":"6a03e33986b054ce2fa40e02","name":"Lewei Lu","hidden":false},{"_id":"6a03e33986b054ce2fa40e03","user":{"_id":"6626a471430a124253f197c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6626a471430a124253f197c8/uVEk5nnW-bS6-no0rQ7Wh.png","isPro":false,"fullname":"yl-1993","user":"yl-1993","type":"user","name":"yl-1993"},"name":"Lei Yang","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:44:11.767Z","hidden":false},{"_id":"6a03e33986b054ce2fa40e04","name":"Ziwei Liu","hidden":false},{"_id":"6a03e33986b054ce2fa40e05","name":"Dahua Lin","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","submittedOnDailyBy":{"_id":"64b4a717aa03b6520839e9b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4a717aa03b6520839e9b8/Rt3ERG-6BVEA4hAwOz0_I.jpeg","isPro":false,"fullname":"Haiwen Diao","user":"Paranioar","type":"user","name":"Paranioar"},"summary":"Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.","upvotes":115,"discussionId":"6a03e33a86b054ce2fa40e06","githubRepo":"https://github.com/OpenSenseNova/SenseNova-U1","githubRepoAddedBy":"user","ai_summary":"Unified vision-language models treat understanding and generation as integrated processes rather than separate tasks, demonstrating strong performance across multiple multimodal capabilities including image synthesis and action reasoning.","ai_keywords":["vision-language models","multimodal intelligence","unified paradigm","NEO-unify","dense model","mixture-of-experts","vision-language perception","knowledge reasoning","agentic decision-making","spatial intelligence","any-to-image synthesis","text-rich infographic generation","vision-language-action","world model"],"githubStars":1636,"organization":{"_id":"64f0405f8a4cf3e5e6b38f9c","name":"sensenova","fullname":"SenseNova","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d06833b5997ed71ce5c46/k66xcOMf4NVbMSFulUjHY.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b4a717aa03b6520839e9b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b4a717aa03b6520839e9b8/Rt3ERG-6BVEA4hAwOz0_I.jpeg","isPro":false,"fullname":"Haiwen Diao","user":"Paranioar","type":"user"},{"_id":"64101f81b27543634e377fc1","avatarUrl":"/avatars/557dd9d4707e3b38e0805dfb87c08004.svg","isPro":false,"fullname":"Penghao Wu","user":"craigwu","type":"user"},{"_id":"62ab1ac1d48b4d8b048a3473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656826685333-62ab1ac1d48b4d8b048a3473.png","isPro":false,"fullname":"Ziwei Liu","user":"liuziwei7","type":"user"},{"_id":"67e1104a03fc96dae04b09b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/PgkZDAmYxLbUViPQMfjzk.png","isPro":false,"fullname":"bzq","user":"s70049","type":"user"},{"_id":"64aeb3f80d8a0c9ccf11d33f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64aeb3f80d8a0c9ccf11d33f/u7P6iJ-T1JuglpX7LSUti.jpeg","isPro":false,"fullname":"zhuyue","user":"rukawaYue","type":"user"},{"_id":"66bc7862aa7cdcb1c31a1efb","avatarUrl":"/avatars/4c2ab907247fe071ff5cdd71c404ca7c.svg","isPro":false,"fullname":"wang jiahao","user":"TokenWang","type":"user"},{"_id":"652d06833b5997ed71ce5c46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d06833b5997ed71ce5c46/O_D6bpa5mGxLA7uCjmVCG.jpeg","isPro":false,"fullname":"Zhongang Cai","user":"caizhongang","type":"user"},{"_id":"659e00c09dd5a71bd3dedae7","avatarUrl":"/avatars/b70a3d9288ab9c5d36e001dfb3ccdbcc.svg","isPro":false,"fullname":"Wang Bo","user":"wangbo0727","type":"user"},{"_id":"68380a82b6df663e37eb7c9f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/gqTJJrlk-Bc4pQy6NAJ1V.png","isPro":false,"fullname":"Catherine Yang","user":"ysCatherine","type":"user"},{"_id":"655c607ebfb531437a526d26","avatarUrl":"/avatars/744029bed5b1423ad984d953ef11d4ad.svg","isPro":false,"fullname":"sound","user":"soundbupt","type":"user"},{"_id":"6565bc5ee5aac326bfc98e39","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vIfHy9Y1yAK6A96UCHNBH.jpeg","isPro":false,"fullname":"Ting Pan","user":"PhyscalX","type":"user"},{"_id":"647d4f1236e109abce409c3b","avatarUrl":"/avatars/d166f5f8be666e96b522a0a0effd21c4.svg","isPro":false,"fullname":"Wenwen Tong","user":"tongww","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"64f0405f8a4cf3e5e6b38f9c","name":"sensenova","fullname":"SenseNova","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d06833b5997ed71ce5c46/k66xcOMf4NVbMSFulUjHY.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12500.md"}">
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Unified vision-language models treat understanding and generation as integrated processes rather than separate tasks, demonstrating strong performance across multiple multimodal capabilities including image synthesis and action reasoning.
AI-generated summary
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
Community
🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental paradigm shift in multimodal AI: from modality integration to true unification. Rather than relying on adapters to translate between modalities, SenseNova U1 models think-and-act across language and vision natively.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.12500 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.12500 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.12500 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.