PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks
Abstract:Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.
| Comments: | Preprint. This manuscript has not yet been peer reviewed |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.16881 [cs.CL] |
| (or arXiv:2605.16881v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16881
arXiv-issued DOI via DataCite (pending registration)
|
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints
May 20
-
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
May 20
-
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
May 20
-
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
May 20
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.