Hugging Face Daily Papers · · 6 min read

Leveraging Morphology for Historical Script Metrological Analysis

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning.<br>Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages.</p>\n","updatedAt":"2026-06-12T08:13:10.617Z","author":{"_id":"66c9a126dc312dec43f34f98","avatarUrl":"/avatars/814f7ebd9a3fd626be898e92afa7e3fa.svg","fullname":"Raphael","name":"RaphaelBfr","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8903177380561829},"editors":["RaphaelBfr"],"editorAvatarUrls":["/avatars/814f7ebd9a3fd626be898e92afa7e3fa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09446","authors":[{"_id":"6a2a7443eb926374a3821968","name":"Malamatenia Vlachou Efstathiou","hidden":false},{"_id":"6a2a7443eb926374a3821969","user":{"_id":"66c9a126dc312dec43f34f98","avatarUrl":"/avatars/814f7ebd9a3fd626be898e92afa7e3fa.svg","isPro":false,"fullname":"Raphael","user":"RaphaelBfr","type":"user","name":"RaphaelBfr"},"name":"Raphaël Baena","status":"claimed_verified","statusLastChangedAt":"2026-06-12T07:42:52.413Z","hidden":false},{"_id":"6a2a7443eb926374a382196a","name":"Dominique Stutzmann","hidden":false},{"_id":"6a2a7443eb926374a382196b","name":"Mathieu Aubry","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"Leveraging Morphology for Historical Script Metrological Analysis","submittedOnDailyBy":{"_id":"66c9a126dc312dec43f34f98","avatarUrl":"/avatars/814f7ebd9a3fd626be898e92afa7e3fa.svg","isPro":false,"fullname":"Raphael","user":"RaphaelBfr","type":"user","name":"RaphaelBfr"},"summary":"Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning.\n Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages.\n Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.","upvotes":0,"discussionId":"6a2a7443eb926374a382196c","projectPage":"https://malamatenia.github.io/morphology4metrology-analysis/","githubRepo":"https://github.com/raphael-baena/morphology4metrology","githubRepoAddedBy":"user","ai_summary":"A transformer-based architecture with prototype learning enables scalable paleographic measurements from historical documents using only line-level transcriptions, demonstrating its effectiveness on a 160-page codex with minimal training data requirements.","ai_keywords":["transformer-based detection architecture","prototype-based line reconstruction module","character prototypes","line-level transcription","learnable typewriter","character bounding box prediction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":12,"organization":{"_id":"65b7830478003ec6b3ac3571","name":"LesPonts","fullname":"Ecole des Ponts ParisTech","avatar":"https://www.gravatar.com/avatar/b346dcfe9d9e877e2d3f9fd64a6b7d66?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"65b7830478003ec6b3ac3571","name":"LesPonts","fullname":"Ecole des Ponts ParisTech","avatar":"https://www.gravatar.com/avatar/b346dcfe9d9e877e2d3f9fd64a6b7d66?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09446.md","query":{}}">
Papers
arxiv:2606.09446

Leveraging Morphology for Historical Script Metrological Analysis

Published on Jun 8
· Submitted by
Raphael
on Jun 12
Authors:
,
,

Abstract

A transformer-based architecture with prototype learning enables scalable paleographic measurements from historical documents using only line-level transcriptions, demonstrating its effectiveness on a 160-page codex with minimal training data requirements.

Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning. Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages. Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.

Community

Paper author Paper submitter about 2 hours ago

Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning.
Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09446
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09446 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers