Hugging Face Daily Papers · June 25, 2026 · 4 min read

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

It constructs WATER-S, a 2M-scale synthetic artistic text dataset, and proposes WATERec, a strong STR baseline supporting arbitrary-shaped inputs. It achieves 90.40% accuracy on WordArt-Bench, the first result exceeding 90%, surpassing both general-purpose and OCR-specialized VLMs by a large margin.</p>\n","updatedAt":"2026-06-25T01:40:42.971Z","author":{"_id":"674d092c6421c58761fc83eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png","fullname":"Xingsong Ye","name":"Yesianrohn","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8706870675086975},"editors":["Yesianrohn"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.24484","authors":[{"_id":"6a3c8685f3facdb67e9ff033","name":"Xingsong Ye","hidden":false},{"_id":"6a3c8685f3facdb67e9ff034","name":"Yongkun Du","hidden":false},{"_id":"6a3c8685f3facdb67e9ff035","name":"Jiaxin Zhang","hidden":false},{"_id":"6a3c8685f3facdb67e9ff036","name":"Haojie Zhang","hidden":false},{"_id":"6a3c8685f3facdb67e9ff037","name":"Chong Sun","hidden":false},{"_id":"6a3c8685f3facdb67e9ff038","name":"Chen Li","hidden":false},{"_id":"6a3c8685f3facdb67e9ff039","name":"Jing Lyu","hidden":false},{"_id":"6a3c8685f3facdb67e9ff03a","name":"Zhineng Chen","hidden":false}],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods","submittedOnDailyBy":{"_id":"674d092c6421c58761fc83eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png","isPro":false,"fullname":"Xingsong Ye","user":"Yesianrohn","type":"user","name":"Yesianrohn"},"summary":"WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.","upvotes":6,"discussionId":"6a3c8685f3facdb67e9ff03b","githubRepo":"https://github.com/YesianRohn/WATER","githubRepoAddedBy":"user","ai_summary":"A large-scale synthetic dataset and specialized model architecture are introduced to address the challenges of artistic text recognition by improving data diversity and model flexibility for irregular text layouts.","ai_keywords":["Scene Text Recognition","WordArt","synthetic dataset","visual encoder","autoregressive decoder","arbitrary-shaped inputs","WordArt-Bench","vision-language models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"643cb0625fcffe09fb6ca688","name":"Fudan-University","fullname":"Fudan University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/kWv0cGlAhAG3iNWVxowkJ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"674d092c6421c58761fc83eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674d092c6421c58761fc83eb/lQSRWX_YyTzpRq2aJtwYT.png","isPro":false,"fullname":"Xingsong Ye","user":"Yesianrohn","type":"user"},{"_id":"697c61037e2142d01a928324","avatarUrl":"/avatars/540e8696cb50398e7f4f58624008181b.svg","isPro":false,"fullname":"Li K8s","user":"kubernetes66","type":"user"},{"_id":"66c7f331cc47b8e6e94d4297","avatarUrl":"/avatars/de8c1a84fee818e3a5a4a531806c6969.svg","isPro":false,"fullname":"cym","user":"XD-MU","type":"user"},{"_id":"669205f1ccca14aa8f13f770","avatarUrl":"/avatars/11ce274e93345fe3790ac9fa687e2bcb.svg","isPro":false,"fullname":"Hao Yu","user":"Longin-Yu","type":"user"},{"_id":"656d2c56c4d6794d7f99118c","avatarUrl":"/avatars/323c495efe7e7049781c30b13bafacc7.svg","isPro":false,"fullname":"Shaochi Tuan","user":"tuanshaochi","type":"user"},{"_id":"697847187ab756d3d851b6c7","avatarUrl":"/avatars/1911290667fbe9c6e5c039f55e9a56b5.svg","isPro":false,"fullname":"CrystalWatkins","user":"CrystalWatkins","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"643cb0625fcffe09fb6ca688","name":"Fudan-University","fullname":"Fudan University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/kWv0cGlAhAG3iNWVxowkJ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.24484.md","query":{}}">

Papers

arxiv:2606.24484

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Published on Jun 23

· Submitted by

Xingsong Ye on Jun 25

Fudan University

Upvote

Authors:

Abstract

A large-scale synthetic dataset and specialized model architecture are introduced to address the challenges of artistic text recognition by improving data diversity and model flexibility for irregular text layouts.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.

View arXiv page View PDF GitHub 3 Add to collection

Community

Yesianrohn

Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.24484

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.24484 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Abstract

Community

Models citing this paper 1

Datasets citing this paper 3

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers