From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication
Abstract:Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate's normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.
| Comments: | Preprint. This manuscript has not yet been peer reviewed |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.01136 [cs.CL] |
| (or arXiv:2606.01136v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.01136
arXiv-issued DOI via DataCite (pending registration)
|
Access Paper:
- View PDF
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset
Jun 2
-
Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval
Jun 2
-
AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection
Jun 2
-
CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards
Jun 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.