Hugging Face Daily Papers · May 18, 2026 · 9 min read

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

\n\t<a id=\"citevqa-exposing-attribution-hallucination-in-document-intelligence\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#citevqa-exposing-attribution-hallucination-in-document-intelligence\" rel=\"nofollow\">\n\t\t<svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg>\n\t</a>\n\t\n\t\tCiteVQA: Exposing \"Attribution Hallucination\" in Document Intelligence\n\t\n</h3>\nWhile Multimodal Large Language Models (MLLMs) have made incredible strides in document understanding, current evaluations focus almost exclusively on final answer accuracy. This \"answer-only\" approach masks a critical failure mode: models frequently output the correct answer while grounding it in entirely incorrect passages. In high-stakes domains like medicine, law, and finance, this black-box reasoning poses severe risks because trustworthiness requires verification. Element-level citation—the ability to pinpoint the exact paragraph, table, or image used to derive an answer—is crucial. By forcing models to provide precise bounding-box citations, every generated claim becomes directly and visually verifiable by human users, successfully bridging the gap between text generation and source verification. \nTo solve this evaluation blind spot, the authors introduce CiteVQA, a benchmark designed to test both answer accuracy and evidence faithfulness. \n<ul>\n<li>The Dataset: CiteVQA comprises 1,897 complex questions across 711 multi-page PDFs spanning seven domains. Unlike traditional benchmarks, it mandates that models return element-level bounding-box citations alongside every single answer. </li>\n<li>A Rigorous New Metric: The paper introduces Strict Attributed Accuracy (SAA), an evaluation metric that credits a prediction only when both the textual answer and the cited visual region are correct. </li>\n<li>Key Discovery: An exhaustive audit of 20 leading MLLMs revealed a pervasive phenomenon termed \"Attribution Hallucination.\". Models frequently get the answer right but fail to cite the correct source. The strongest open-source MLLM achieved an SAA of just 22.5, exposing a massive reliability gap that traditional answer-only metrics completely overlook.</li>\n</ul>\n","updatedAt":"2026-05-18T08:33:42.338Z","author":{"_id":"66431358af62c6c266b921ad","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66431358af62c6c266b921ad/36VacSxNJSPM9Ap_35HSZ.jpeg","fullname":"Wang","name":"zr-wang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8462041020393372},"editors":["zr-wang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66431358af62c6c266b921ad/36VacSxNJSPM9Ap_35HSZ.jpeg"],"reactions":[],"isReport":false}},{"id":"6a0bc13d61397cb23bf58da6","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:47:41.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding](https://huggingface.co/papers/2605.08888) (2026)\n* [Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks](https://huggingface.co/papers/2605.01417) (2026)\n* [Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation](https://huggingface.co/papers/2604.27720) (2026)\n* [MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence](https://huggingface.co/papers/2605.07919) (2026)\n* [Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents](https://huggingface.co/papers/2605.06635) (2026)\n* [DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams](https://huggingface.co/papers/2604.25231) (2026)\n* [DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding](https://huggingface.co/papers/2604.12812) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.08888\">DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.01417\">Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.27720\">Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07919\">MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.06635\">Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.25231\">DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.12812\">DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-19T01:47:41.324Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7128981351852417},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12882","authors":[{"_id":"6a0540aeb1a8cbabc9f08804","name":"Dongsheng Ma","hidden":false},{"_id":"6a0540aeb1a8cbabc9f08805","name":"Jiayu Li","hidden":false},{"_id":"6a0540aeb1a8cbabc9f08806","name":"Zhengren Wang","hidden":false},{"_id":"6a0540aeb1a8cbabc9f08807","name":"Yijie Wang","hidden":false},{"_id":"6a0540aeb1a8cbabc9f08808","name":"Jiahao Kong","hidden":false},{"_id":"6a0540aeb1a8cbabc9f08809","name":"Weijun Zeng","hidden":false},{"_id":"6a0540aeb1a8cbabc9f0880a","name":"Jutao Xiao","hidden":false},{"_id":"6a0540aeb1a8cbabc9f0880b","name":"Jie Yang","hidden":false},{"_id":"6a0540aeb1a8cbabc9f0880c","name":"Wentao Zhang","hidden":false},{"_id":"6a0540aeb1a8cbabc9f0880d","name":"Bin Wang","hidden":false},{"_id":"6a0540aeb1a8cbabc9f0880e","name":"Conghui He","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence","submittedOnDailyBy":{"_id":"66431358af62c6c266b921ad","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66431358af62c6c266b921ad/36VacSxNJSPM9Ap_35HSZ.jpeg","isPro":false,"fullname":"Wang","user":"zr-wang","type":"user","name":"zr-wang"},"summary":"Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.","upvotes":161,"discussionId":"6a0540afb1a8cbabc9f0880f","projectPage":"https://huggingface.co/datasets/opendatalab/CiteVQA","githubRepo":"https://github.com/opendatalab/CiteVQA","githubRepoAddedBy":"user","ai_summary":"CiteVQA introduces a benchmark for document vision-language models that evaluates both answer accuracy and correct citation of supporting evidence, revealing significant attribution hallucinations in current models.","ai_keywords":["Multimodal Large Language Models","Doc-VQA","document understanding","bounding-box citations","Strict Attributed Accuracy","Attribution Hallucination","masking ablation","expert review","PDF documents","cross-domain evaluation"],"githubStars":55,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"66431358af62c6c266b921ad","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66431358af62c6c266b921ad/36VacSxNJSPM9Ap_35HSZ.jpeg","isPro":false,"fullname":"Wang","user":"zr-wang","type":"user"},{"_id":"67b6fdf88b56622e70cdc3f4","avatarUrl":"/avatars/6cf7b6232a23024ea4578ddaab62fc6e.svg","isPro":false,"fullname":"Jiayu Li","user":"Jiayu320","type":"user"},{"_id":"6a05da7a11567cc0594dd1c5","avatarUrl":"/avatars/7cdd69d0f5f874fdf2d0d96eb566873f.svg","isPro":false,"fullname":"yue li","user":"randiiiiiii","type":"user"},{"_id":"66ced91dda5d6bbe4d1a30ad","avatarUrl":"/avatars/d5ce6869031e488d3030bec4770124ad.svg","isPro":false,"fullname":"momo","user":"momo199","type":"user"},{"_id":"6820500c6a14bc131f7b86b2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/LD_7Z1R5xQM74T-4ByL-O.png","isPro":false,"fullname":"One","user":"HeyPJay","type":"user"},{"_id":"67cc4e8b547e3ec05e360d1b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/lZ87W6j2I_Jhu8IM9J297.png","isPro":false,"fullname":"yuan","user":"ququyuan","type":"user"},{"_id":"69d48dd511b71ec1b559c009","avatarUrl":"/avatars/d5a3d8bd7712fcb50e328e9dbf904509.svg","isPro":false,"fullname":"shaw","user":"jeter1","type":"user"},{"_id":"69297dcde85dec07411cc459","avatarUrl":"/avatars/b6ef7ee21eac50712c1dcc537b99722a.svg","isPro":false,"fullname":"sora","user":"SoraArti","type":"user"},{"_id":"661e62c6bac5d981f886f77b","avatarUrl":"/avatars/f1eb51ed4499ca434c8939573dfbd5e2.svg","isPro":false,"fullname":"Bozhou Li","user":"zooblastlbz","type":"user"},{"_id":"67f76cbd284d6735c51eb812","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BBiOjNM9l5tavl6PqQMcI.png","isPro":false,"fullname":"huruo","user":"huruo","type":"user"},{"_id":"69f1a5ad2088fb018979a921","avatarUrl":"/avatars/6b5ca30533c0ad868887be26a181d4bd.svg","isPro":false,"fullname":"Xukun Qin","user":"xkqin","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12882.md"}">

Papers

arxiv:2605.12882

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Published on May 13

· Submitted by

Wang on May 18

Authors:

Abstract

CiteVQA introduces a benchmark for document vision-language models that evaluates both answer accuracy and correct citation of supporting evidence, revealing significant attribution hallucinations in current models.

AI-generated summary

Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.

View arXiv page View PDF Project page GitHub 55 Add to collection

Community

zr-wang

Paper submitter about 17 hours ago

CiteVQA: Exposing "Attribution Hallucination" in Document Intelligence

While Multimodal Large Language Models (MLLMs) have made incredible strides in document understanding, current evaluations focus almost exclusively on final answer accuracy. This "answer-only" approach masks a critical failure mode: models frequently output the correct answer while grounding it in entirely incorrect passages. In high-stakes domains like medicine, law, and finance, this black-box reasoning poses severe risks because trustworthiness requires verification. Element-level citation—the ability to pinpoint the exact paragraph, table, or image used to derive an answer—is crucial. By forcing models to provide precise bounding-box citations, every generated claim becomes directly and visually verifiable by human users, successfully bridging the gap between text generation and source verification.

To solve this evaluation blind spot, the authors introduce CiteVQA, a benchmark designed to test both answer accuracy and evidence faithfulness.

The Dataset: CiteVQA comprises 1,897 complex questions across 711 multi-page PDFs spanning seven domains. Unlike traditional benchmarks, it mandates that models return element-level bounding-box citations alongside every single answer.
A Rigorous New Metric: The paper introduces Strict Attributed Accuracy (SAA), an evaluation metric that credits a prediction only when both the textual answer and the cited visual region are correct.
Key Discovery: An exhaustive audit of 20 leading MLLMs revealed a pervasive phenomenon termed "Attribution Hallucination.". Models frequently get the answer right but fail to cite the correct source. The strongest open-source MLLM achieved an SAA of just 22.5, exposing a massive reliability gap that traditional answer-only metrics completely overlook.