DocAtlas is a framework for constructing high-fidelity multilingual OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks, using differential rendering to produce model-free structural annotations from native documents. Evaluating 16 models reveals persistent gaps in low-resource scripts; DPO with rendering-derived ground truth achieves stable cross-lingual transfer (+1.9% in-domain, +1.8% out-of-domain) without base-language degradation, where supervised fine-tuning collapses by up to 21%.</p>\n","updatedAt":"2026-05-20T02:04:15.438Z","author":{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","fullname":"Ahmed Heakl","name":"ahmedheakl","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":66,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8376038074493408},"editors":["ahmedheakl"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12623","authors":[{"_id":"6a0d02e665eb30f20d962ae4","name":"Ahmed Heakl","hidden":false},{"_id":"6a0d02e665eb30f20d962ae5","name":"Youssef Mohamed","hidden":false},{"_id":"6a0d02e665eb30f20d962ae6","name":"Abdullah Sohail","hidden":false},{"_id":"6a0d02e665eb30f20d962ae7","name":"Rania Elbadry","hidden":false},{"_id":"6a0d02e665eb30f20d962ae8","name":"Ahmed Nassar","hidden":false},{"_id":"6a0d02e665eb30f20d962ae9","name":"Peter W. J. Staar","hidden":false},{"_id":"6a0d02e665eb30f20d962aea","name":"Fahad Shahbaz Khan","hidden":false},{"_id":"6a0d02e665eb30f20d962aeb","name":"Imran Razzak","hidden":false},{"_id":"6a0d02e665eb30f20d962aec","name":"Salman Khan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/sFrJJSRX55EkIoILx87Yk.png","https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/fRXs9ryh9Qk1xD4xy3fKP.png","https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/IO1iX2aMn6iebtoda5-jp.png","https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/NYorNvAh0b2NOIZKVk8N_.png"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"DocAtlas: Multilingual Document Understanding Across 80+ Languages","submittedOnDailyBy":{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user","name":"ahmedheakl"},"summary":"Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.","upvotes":2,"discussionId":"6a0d02e665eb30f20d962aed","ai_summary":"DocAtlas framework creates high-fidelity OCR datasets across 82 languages using differential rendering and synthetic generation, demonstrating improved multilingual model adaptation through Direct Preference Optimization.","ai_keywords":["DocAtlas","OCR datasets","multilingual document understanding","low-resource languages","differential rendering","synthetic generation","DocTag format","Direct Preference Optimization","multilingual adaptation","supervised fine-tuning"],"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12623.md"}">
DocAtlas: Multilingual Document Understanding Across 80+ Languages
Abstract
DocAtlas framework creates high-fidelity OCR datasets across 82 languages using differential rendering and synthetic generation, demonstrating improved multilingual model adaptation through Direct Preference Optimization.
AI-generated summary
Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.
Community
DocAtlas is a framework for constructing high-fidelity multilingual OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks, using differential rendering to produce model-free structural annotations from native documents. Evaluating 16 models reveals persistent gaps in low-resource scripts; DPO with rendering-derived ground truth achieves stable cross-lingual transfer (+1.9% in-domain, +1.8% out-of-domain) without base-language degradation, where supervised fine-tuning collapses by up to 21%.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.12623 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.12623 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.