Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With<br>the extreme desire for better visual experience and the rapid development of imaging technology,<br>the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However,<br>UHR image generation poses great challenges due to the scarcity and complexity of high-resolution<br>content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset<br>curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios<br>(each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our<br>large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to<br>native 100MP generation with three training schemes. Finally, leveraging both conventional metrics<br>and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark<br>establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and<br>semantic alignment. Extensive experimental results on our benchmark and the constructive exploration<br>of training strategies collaboratively provide valuable insights for future breakthroughs.</p>\n","updatedAt":"2026-05-20T08:10:50.817Z","author":{"_id":"6583a19d1804c2d060aa1373","avatarUrl":"/avatars/14dd62334b79d9457ffd9985100540ce.svg","fullname":"HAOYANG","name":"Lewandofski","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8990998864173889},"editors":["Lewandofski"],"editorAvatarUrls":["/avatars/14dd62334b79d9457ffd9985100540ce.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20147","authors":[{"_id":"6a0d5caf65eb30f20d962de5","name":"Haojun Chen","hidden":false},{"_id":"6a0d5caf65eb30f20d962de6","name":"Haoyang He","hidden":false},{"_id":"6a0d5caf65eb30f20d962de7","name":"Chengming Xu","hidden":false},{"_id":"6a0d5caf65eb30f20d962de8","name":"Qingdong He","hidden":false},{"_id":"6a0d5caf65eb30f20d962de9","name":"Junwei Zhu","hidden":false},{"_id":"6a0d5caf65eb30f20d962dea","name":"Yabiao Wang","hidden":false},{"_id":"6a0d5caf65eb30f20d962deb","name":"Zhucun Xue","hidden":false},{"_id":"6a0d5caf65eb30f20d962dec","name":"Xianfang Zeng","hidden":false},{"_id":"6a0d5caf65eb30f20d962ded","name":"Zhennan Chen","hidden":false},{"_id":"6a0d5caf65eb30f20d962dee","name":"Xiaobin Hu","hidden":false},{"_id":"6a0d5caf65eb30f20d962def","name":"Hao Zhao","hidden":false},{"_id":"6a0d5caf65eb30f20d962df0","name":"Yong Liu","hidden":false},{"_id":"6a0d5caf65eb30f20d962df1","name":"Jiangning Zhang","hidden":false},{"_id":"6a0d5caf65eb30f20d962df2","name":"Dacheng Tao","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6583a19d1804c2d060aa1373/7plCPIO50sGc6ZOL2GF9f.jpeg"],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset","submittedOnDailyBy":{"_id":"6583a19d1804c2d060aa1373","avatarUrl":"/avatars/14dd62334b79d9457ffd9985100540ce.svg","isPro":false,"fullname":"HAOYANG","user":"Lewandofski","type":"user","name":"Lewandofski"},"summary":"Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.","upvotes":7,"discussionId":"6a0d5caf65eb30f20d962df3","projectPage":"https://haojunchen663.github.io/projects/PixVerve/","githubRepo":"https://github.com/HaojunChen663/PixVerve-95K","githubRepoAddedBy":"user","ai_summary":"A large-scale UHR image-text dataset and evaluation benchmark are introduced to advance ultra-high-resolution text-to-image generation capabilities.","ai_keywords":["text-to-image models","Ultra-High-Resolution","image-text dataset","PixVerve-95K","PixVerve-Bench","multimodal large language model","visual quality","semantic alignment"],"githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69de1d68da6d3334acc80ac5","avatarUrl":"/avatars/77328d595f654df06be447c7a487f631.svg","isPro":false,"fullname":"Haojun Chen","user":"HaojunChen","type":"user"},{"_id":"69be035dec98cfa04a3ab5eb","avatarUrl":"/avatars/dbbea06475e0c98062890cd8d19932df.svg","isPro":false,"fullname":"Haoyang","user":"Lewandofskee","type":"user"},{"_id":"652fab9d04a34a9282bf29d6","avatarUrl":"/avatars/cd5967b37ebb1225e9ae1d46f196e2e2.svg","isPro":false,"fullname":"Chengming Xu","user":"ChengmingX","type":"user"},{"_id":"66125699174b378a72e810aa","avatarUrl":"/avatars/67ec3db3e4eb64b9a574268da3ddc362.svg","isPro":false,"fullname":"Rongzhi Li","user":"rngzhi","type":"user"},{"_id":"68773a799cd3191dfd01caae","avatarUrl":"/avatars/0f9441b08571ccf56c776352dc2e61c7.svg","isPro":false,"fullname":"Han Lin","user":"H4Y4CH1","type":"user"},{"_id":"66449e619ff401732687f013","avatarUrl":"/avatars/251897d1324a70a9bf761513871c5841.svg","isPro":false,"fullname":"chen","user":"zhen-nan","type":"user"},{"_id":"68bb0e42a14418015d344509","avatarUrl":"/avatars/35982535035b4ad618c88a7c89fe1d37.svg","isPro":false,"fullname":"NAMENAME","user":"VLAD545645645","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20147.md"}">
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A large-scale UHR image-text dataset and evaluation benchmark are introduced to advance ultra-high-resolution text-to-image generation capabilities.
AI-generated summary
Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.
Community
Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With
the extreme desire for better visual experience and the rapid development of imaging technology,
the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However,
UHR image generation poses great challenges due to the scarcity and complexity of high-resolution
content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset
curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios
(each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our
large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to
native 100MP generation with three training schemes. Finally, leveraging both conventional metrics
and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark
establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and
semantic alignment. Extensive experimental results on our benchmark and the constructive exploration
of training strategies collaboratively provide valuable insights for future breakthroughs.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.20147 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.20147 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.20147 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.