It constructs WATER-S, a 2M-scale synthetic artistic text dataset, and proposes WATERec, a strong STR baseline supporting arbitrary-shaped inputs. It achieves 90.40% accuracy on WordArt-Bench, the first result exceeding 90%, surpassing both general-purpose and OCR-specialized VLMs by a large margin.</p>\n","updatedAt":"2026-06-25T01:40:42.971Z","author":{"_id":"674d092c6421c58761fc83eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png","fullname":"Xingsong Ye","name":"Yesianrohn","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8706870675086975},"editors":["Yesianrohn"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.24484","authors":[{"_id":"6a3c8685f3facdb67e9ff033","name":"Xingsong Ye","hidden":false},{"_id":"6a3c8685f3facdb67e9ff034","name":"Yongkun Du","hidden":false},{"_id":"6a3c8685f3facdb67e9ff035","name":"Jiaxin Zhang","hidden":false},{"_id":"6a3c8685f3facdb67e9ff036","name":"Haojie Zhang","hidden":false},{"_id":"6a3c8685f3facdb67e9ff037","name":"Chong Sun","hidden":false},{"_id":"6a3c8685f3facdb67e9ff038","name":"Chen Li","hidden":false},{"_id":"6a3c8685f3facdb67e9ff039","name":"Jing Lyu","hidden":false},{"_id":"6a3c8685f3facdb67e9ff03a","name":"Zhineng Chen","hidden":false}],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods","submittedOnDailyBy":{"_id":"674d092c6421c58761fc83eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png","isPro":false,"fullname":"Xingsong Ye","user":"Yesianrohn","type":"user","name":"Yesianrohn"},"summary":"WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.","upvotes":6,"discussionId":"6a3c8685f3facdb67e9ff03b","githubRepo":"https://github.com/YesianRohn/WATER","githubRepoAddedBy":"user","ai_summary":"A large-scale synthetic dataset and specialized model architecture are introduced to address the challenges of artistic text recognition by improving data diversity and model flexibility for irregular text layouts.","ai_keywords":["Scene Text Recognition","WordArt","synthetic dataset","visual encoder","autoregressive decoder","arbitrary-shaped inputs","WordArt-Bench","vision-language models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"643cb0625fcffe09fb6ca688","name":"Fudan-University","fullname":"Fudan University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/kWv0cGlAhAG3iNWVxowkJ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"674d092c6421c58761fc83eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png","isPro":false,"fullname":"Xingsong Ye","user":"Yesianrohn","type":"user"},{"_id":"697c61037e2142d01a928324","avatarUrl":"/avatars/540e8696cb50398e7f4f58624008181b.svg","isPro":false,"fullname":"Li K8s","user":"kubernetes66","type":"user"},{"_id":"66c7f331cc47b8e6e94d4297","avatarUrl":"/avatars/de8c1a84fee818e3a5a4a531806c6969.svg","isPro":false,"fullname":"cym","user":"XD-MU","type":"user"},{"_id":"669205f1ccca14aa8f13f770","avatarUrl":"/avatars/11ce274e93345fe3790ac9fa687e2bcb.svg","isPro":false,"fullname":"Hao Yu","user":"Longin-Yu","type":"user"},{"_id":"656d2c56c4d6794d7f99118c","avatarUrl":"/avatars/323c495efe7e7049781c30b13bafacc7.svg","isPro":false,"fullname":"Shaochi Tuan","user":"tuanshaochi","type":"user"},{"_id":"697847187ab756d3d851b6c7","avatarUrl":"/avatars/1911290667fbe9c6e5c039f55e9a56b5.svg","isPro":false,"fullname":"CrystalWatkins","user":"CrystalWatkins","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"643cb0625fcffe09fb6ca688","name":"Fudan-University","fullname":"Fudan University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/kWv0cGlAhAG3iNWVxowkJ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.24484.md","query":{}}">
Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods
Abstract
A large-scale synthetic dataset and specialized model architecture are introduced to address the challenges of artistic text recognition by improving data diversity and model flexibility for irregular text layouts.
WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.
Community
It constructs WATER-S, a 2M-scale synthetic artistic text dataset, and proposes WATERec, a strong STR baseline supporting arbitrary-shaped inputs. It achieves 90.40% accuracy on WordArt-Bench, the first result exceeding 90%, surpassing both general-purpose and OCR-specialized VLMs by a large margin.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.24484 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.