Hugging Face Daily Papers · May 26, 2026 · 4 min read

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A benchmark for evaluating faithfulness metrics with ground-truth.\n📄 <a href=\"https://arxiv.org/pdf/2605.25052\" rel=\"nofollow\">https://arxiv.org/pdf/2605.25052</a> 💻 <a href=\"https://github.com/yoavgur/BonaFide/tree/main\" rel=\"nofollow\">https://github.com/yoavgur/BonaFide/tree/main</a> 🤗 <a href=\"https://huggingface.co/collections/yoavgurarieh/bonafide\">https://huggingface.co/collections/yoavgurarieh/bonafide</a>\n","updatedAt":"2026-05-26T04:55:52.047Z","author":{"_id":"621febb6c7f47c5eb5df001d","avatarUrl":"/avatars/6096101dc01f1a7e39b1d0826170412d.svg","fullname":"Yoav Gur Arieh","name":"yoavgur","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5636927485466003},"editors":["yoavgur"],"editorAvatarUrls":["/avatars/6096101dc01f1a7e39b1d0826170412d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25052","authors":[{"_id":"6a152105b57a1823d5708b5d","user":{"_id":"67609a46525a7cf186ca8ca4","avatarUrl":"/avatars/f9027eca2181dee7dce899e7a590e803.svg","isPro":false,"fullname":"Yoav Gur Arieh","user":"yoavgurarieh","type":"user","name":"yoavgurarieh"},"name":"Yoav Gur-Arieh","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:08:59.971Z","hidden":false},{"_id":"6a152105b57a1823d5708b5e","name":"Ana Marasović","hidden":false},{"_id":"6a152105b57a1823d5708b5f","user":{"_id":"610b729f9da682cd54ad9adf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1628140189042-noauth.jpeg","isPro":false,"fullname":"Mor Geva","user":"mega","type":"user","name":"mega"},"name":"Mor Geva","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:08:56.716Z","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth","submittedOnDailyBy":{"_id":"621febb6c7f47c5eb5df001d","avatarUrl":"/avatars/6096101dc01f1a7e39b1d0826170412d.svg","isPro":false,"fullname":"Yoav Gur Arieh","user":"yoavgur","type":"user","name":"yoavgur"},"summary":"Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.","upvotes":2,"discussionId":"6a152105b57a1823d5708b60","projectPage":"https://huggingface.co/collections/yoavgurarieh/bonafide","githubRepo":"https://github.com/yoavgur/BonaFide","githubRepoAddedBy":"user","ai_summary":"Researchers created a benchmark with 3,066 labeled chains of thought examples across 13 tasks and 10 models to systematically evaluate faithfulness metrics, revealing that most metrics perform near randomly and have significant limitations in reliability and efficiency.","ai_keywords":["chains of thought","faithfulness metrics","ground-truth labels","automated labeling pipeline","BonaFide benchmark","intermediate computations","prediction biases","AUROC"],"githubStars":1,"organization":{"_id":"6107dfc57602f8e9ed8bb5cb","name":"tau","fullname":"Tel Aviv University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1628143727824-610b729f9da682cd54ad9adf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"621febb6c7f47c5eb5df001d","avatarUrl":"/avatars/6096101dc01f1a7e39b1d0826170412d.svg","isPro":false,"fullname":"Yoav Gur Arieh","user":"yoavgur","type":"user"},{"_id":"6767149148a78ad25f16aef1","avatarUrl":"/avatars/3171a5767309ad3b57f22db2e1ccd449.svg","isPro":false,"fullname":"Asaf Avrahamy","user":"AsafAvra","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6107dfc57602f8e9ed8bb5cb","name":"tau","fullname":"Tel Aviv University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1628143727824-610b729f9da682cd54ad9adf.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25052.md"}">

Papers

arxiv:2605.25052

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Published on May 24

· Submitted by

Yoav Gur Arieh on May 26

Tel Aviv University

Upvote

Authors:

Yoav Gur-Arieh ,

Mor Geva

Abstract

Researchers created a benchmark with 3,066 labeled chains of thought examples across 13 tasks and 10 models to systematically evaluate faithfulness metrics, revealing that most metrics perform near randomly and have significant limitations in reliability and efficiency.

AI-generated summary

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

yoavgur

Paper submitter about 3 hours ago

A benchmark for evaluating faithfulness metrics with ground-truth.

📄 https://arxiv.org/pdf/2605.25052
💻 https://github.com/yoavgur/BonaFide/tree/main
🤗 https://huggingface.co/collections/yoavgurarieh/bonafide

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.25052

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.25052 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Abstract

Community

Models citing this paper 0

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers