Hugging Face Daily Papers · May 20, 2026 · 3 min read

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

DocAtlas is a framework for constructing high-fidelity multilingual OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks, using differential rendering to produce model-free structural annotations from native documents. Evaluating 16 models reveals persistent gaps in low-resource scripts; DPO with rendering-derived ground truth achieves stable cross-lingual transfer (+1.9% in-domain, +1.8% out-of-domain) without base-language degradation, where supervised fine-tuning collapses by up to 21%.</p>\n","updatedAt":"2026-05-20T02:04:15.438Z","author":{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","fullname":"Ahmed Heakl","name":"ahmedheakl","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":66,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8376038074493408},"editors":["ahmedheakl"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12623","authors":[{"_id":"6a0d02e665eb30f20d962ae4","name":"Ahmed Heakl","hidden":false},{"_id":"6a0d02e665eb30f20d962ae5","name":"Youssef Mohamed","hidden":false},{"_id":"6a0d02e665eb30f20d962ae6","name":"Abdullah Sohail","hidden":false},{"_id":"6a0d02e665eb30f20d962ae7","name":"Rania Elbadry","hidden":false},{"_id":"6a0d02e665eb30f20d962ae8","name":"Ahmed Nassar","hidden":false},{"_id":"6a0d02e665eb30f20d962ae9","name":"Peter W. J. Staar","hidden":false},{"_id":"6a0d02e665eb30f20d962aea","name":"Fahad Shahbaz Khan","hidden":false},{"_id":"6a0d02e665eb30f20d962aeb","name":"Imran Razzak","hidden":false},{"_id":"6a0d02e665eb30f20d962aec","name":"Salman Khan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/sFrJJSRX55EkIoILx87Yk.png","https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/fRXs9ryh9Qk1xD4xy3fKP.png","https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/IO1iX2aMn6iebtoda5-jp.png","https://cdn-uploads.huggingface.co/production/uploads/656864e12d73834278a8dea7/NYorNvAh0b2NOIZKVk8N_.png"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"DocAtlas: Multilingual Document Understanding Across 80+ Languages","submittedOnDailyBy":{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user","name":"ahmedheakl"},"summary":"Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.","upvotes":2,"discussionId":"6a0d02e665eb30f20d962aed","ai_summary":"DocAtlas framework creates high-fidelity OCR datasets across 82 languages using differential rendering and synthetic generation, demonstrating improved multilingual model adaptation through Direct Preference Optimization.","ai_keywords":["DocAtlas","OCR datasets","multilingual document understanding","low-resource languages","differential rendering","synthetic generation","DocTag format","Direct Preference Optimization","multilingual adaptation","supervised fine-tuning"],"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"656864e12d73834278a8dea7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656864e12d73834278a8dea7/sfAWS2eyPtFHb_2GZIypp.jpeg","isPro":true,"fullname":"Ahmed Heakl","user":"ahmedheakl","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12623.md"}">

Papers

arxiv:2605.12623

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Published on May 12

· Submitted by

Ahmed Heakl on May 20

Mohamed Bin Zayed University of Artificial Intelligence

Upvote

Authors:

Abstract

DocAtlas framework creates high-fidelity OCR datasets across 82 languages using differential rendering and synthetic generation, demonstrating improved multilingual model adaptation through Direct Preference Optimization.

AI-generated summary

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline.

View arXiv page View PDF Add to collection

Community

ahmedheakl

Paper submitter about 11 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.12623

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12623 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12623 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

DocAtlas: Multilingual Document Understanding Across 80+ Languages

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers