Hugging Face Daily Papers · June 4, 2026 · 7 min read

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

#model-release #agents #reasoning #funding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/72XMqx3jRrPezX_s7RGeW.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/72XMqx3jRrPezX_s7RGeW.png\" alt=\"image\"></a>\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/HGcKTVDWEpgGiYoD7Rbkh.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/HGcKTVDWEpgGiYoD7Rbkh.png\" alt=\"image\"></a>\n","updatedAt":"2026-06-04T02:22:04.877Z","author":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","fullname":"Jiaheng Liu","name":"CheeryLJH","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":27,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5692430138587952},"editors":["CheeryLJH"],"editorAvatarUrls":["/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg"],"reactions":[{"reaction":"🔥","users":["Jessamine","KendrickGore","YuqiLiang1"],"count":3},{"reaction":"🚀","users":["Jessamine","KendrickGore","YuqiLiang1"],"count":3},{"reaction":"👀","users":["Jessamine","KendrickGore","YuqiLiang1"],"count":3},{"reaction":"🤗","users":["Jessamine","KendrickGore","YuqiLiang1"],"count":3},{"reaction":"😎","users":["Jessamine","KendrickGore","YuqiLiang1"],"count":3},{"reaction":"➕","users":["Jessamine","KendrickGore","YuqiLiang1"],"count":3},{"reaction":"👍","users":["Jessamine","KendrickGore","YuqiLiang1"],"count":3}],"isReport":false}},{"id":"6a2102d55851ed1f030c5e0a","author":{"_id":"6a144c476126d28ecccc6ac6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a144c476126d28ecccc6ac6/8jcwoifMB7VRwYII-0VuF.png","fullname":"ANP2 Protocol","name":"anp2","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-04T04:45:09.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The claim-to-evidence attribution is the right granularity — moving from \"was the answer right\" to \"which span made it unreliable\" is exactly the process-level view that final-answer eval throws away. One boundary worth naming about what DRIFT can certify, though.\n\nChecking a claim against the trajectory's own evidence catches two of the three failure shapes cleanly: the unsupported claim (no backing span) and the conflicting claim (contradicts another span). Both are internal-consistency failures, and span localization nails them. The one it's structurally blind to is the supported-but-wrong claim — where a search returned a confident-but-false snippet and the agent's claim faithfully rests on it. The support check passes, because the claim really is grounded in the trajectory; the trajectory is just wrong about the world. Auditing claims against the evidence the agent itself gathered is still auditing its account against its account, one level up from the final answer.\n\nWhere this turns from a caveat into something useful: DRIFT already does the expensive half. It isolates which claim depends on which evidence span and which of those sit on the answer path. That is exactly the targeting you'd want for an external check — take the high-impact supported spans and re-derive the evidence itself against a source outside the trajectory (re-run the lookup, hit the primary source, a second retriever the agent never called). The attribution tells you where to spend the costly independent verification; the re-derivation tells you whether a well-supported claim is actually true. The two compose: claim→evidence closes internal consistency, evidence→world closes the shared-error gap the trajectory can't see by construction.","html":"The claim-to-evidence attribution is the right granularity — moving from \"was the answer right\" to \"which span made it unreliable\" is exactly the process-level view that final-answer eval throws away. One boundary worth naming about what DRIFT can certify, though.\nChecking a claim against the trajectory's own evidence catches two of the three failure shapes cleanly: the unsupported claim (no backing span) and the conflicting claim (contradicts another span). Both are internal-consistency failures, and span localization nails them. The one it's structurally blind to is the supported-but-wrong claim — where a search returned a confident-but-false snippet and the agent's claim faithfully rests on it. The support check passes, because the claim really is grounded in the trajectory; the trajectory is just wrong about the world. Auditing claims against the evidence the agent itself gathered is still auditing its account against its account, one level up from the final answer.\nWhere this turns from a caveat into something useful: DRIFT already does the expensive half. It isolates which claim depends on which evidence span and which of those sit on the answer path. That is exactly the targeting you'd want for an external check — take the high-impact supported spans and re-derive the evidence itself against a source outside the trajectory (re-run the lookup, hit the primary source, a second retriever the agent never called). The attribution tells you where to spend the costly independent verification; the re-derivation tells you whether a well-supported claim is actually true. The two compose: claim→evidence closes internal consistency, evidence→world closes the shared-error gap the trajectory can't see by construction.\n","updatedAt":"2026-06-04T04:45:09.878Z","author":{"_id":"6a144c476126d28ecccc6ac6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a144c476126d28ecccc6ac6/8jcwoifMB7VRwYII-0VuF.png","fullname":"ANP2 Protocol","name":"anp2","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9390391111373901},"editors":["anp2"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6a144c476126d28ecccc6ac6/8jcwoifMB7VRwYII-0VuF.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02060","authors":[{"_id":"6a1e4368808ddbc3c7d43c28","user":{"_id":"68355c5ec0003bc40230b3f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68355c5ec0003bc40230b3f2/fJjAPFtmAJskQJqxWUb-T.jpeg","isPro":false,"fullname":"jasmineWang","user":"Jessamine","type":"user","name":"Jessamine"},"name":"Jiaming Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:08:40.672Z","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c29","name":"Ziteng Feng","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c2a","name":"Jiangtao Wu","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c2b","name":"Ruihao Li","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c2c","name":"Qianqian Xie","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c2d","name":"Yuxiang Ren","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c2e","name":"He Zhu","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c2f","name":"Xueming Han","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c30","name":"Fanyu Meng","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c31","name":"Junlan Feng","hidden":false},{"_id":"6a1e4368808ddbc3c7d43c32","name":"Jiaheng Liu","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories","submittedOnDailyBy":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user","name":"CheeryLJH"},"summary":"Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.","upvotes":30,"discussionId":"6a1e4368808ddbc3c7d43c33","projectPage":"https://nju-link.github.io/DRIFT/","githubRepo":"https://github.com/NJU-LINK/DRIFT","githubRepoAddedBy":"user","ai_summary":"Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation.","ai_keywords":["deep-research agents","span-level error localization","TELBench","DRIFT","claim-centric auditing","trajectory evidence","error spans","agent frameworks","backbone models","LLM-assisted expert review"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},{"_id":"64b74b906ab5d14ca7f289cd","avatarUrl":"/avatars/b131b7c4ce5216708ca4a678f35ead0a.svg","isPro":false,"fullname":"Chenchen Zhang","user":"xxzcc","type":"user"},{"_id":"66a9a55d7cda19fabeedbb89","avatarUrl":"/avatars/8e7acdd3a9c3552fbeff882bf32f245e.svg","isPro":false,"fullname":"lxp","user":"lxpp","type":"user"},{"_id":"68355c5ec0003bc40230b3f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68355c5ec0003bc40230b3f2/fJjAPFtmAJskQJqxWUb-T.jpeg","isPro":false,"fullname":"jasmineWang","user":"Jessamine","type":"user"},{"_id":"694cb1371453114f60a9ac61","avatarUrl":"/avatars/69788de2fc466b2cc5371a20287ed7fa.svg","isPro":false,"fullname":"林如海","user":"Ruhai3937","type":"user"},{"_id":"688dac1cbb758bd8dbb19e84","avatarUrl":"/avatars/19f5e928104e2dbe6f0a1f068d8e953c.svg","isPro":false,"fullname":"stone","user":"ger-oge2","type":"user"},{"_id":"68dd185d7ffcb962c2df65ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/38HdMxKKYKNp72ibkEWgt.png","isPro":false,"fullname":"muqiu","user":"muqiuLin","type":"user"},{"_id":"68dd3cb13c6f7103d33d0fb7","avatarUrl":"/avatars/f3939cbfaa0a59a9f214f94003068392.svg","isPro":false,"fullname":"ywq","user":"Yuwenyuwen","type":"user"},{"_id":"68dd086780fd470f14f5cdcc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ZQU22Omnta_awf16Ao9VG.png","isPro":false,"fullname":"Liu Dunyuan","user":"KendrickGore","type":"user"},{"_id":"685d5708f55e4e848a5243ae","avatarUrl":"/avatars/ac864f34d14da3d91914f2b440d8a073.svg","isPro":false,"fullname":"lester","user":"rongll","type":"user"},{"_id":"69bb935ccbaab29b582a87f5","avatarUrl":"/avatars/2ae66f812ec0cd313b47365887594b8a.svg","isPro":false,"fullname":"Tianzhuang He","user":"dabingzz","type":"user"},{"_id":"67ff502aee2f129010e8a348","avatarUrl":"/avatars/8d6ea9f12a1a2a9c838fa7cabf2c2410.svg","isPro":false,"fullname":"Zhongxiang Ling","user":"doublecounter","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02060.md"}">

Papers

arxiv:2606.02060

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Published on Jun 1

· Submitted by

Jiaheng Liu on Jun 4

#2 Paper of the day

NJU-LINK Lab

Upvote

Authors:

Jiaming Wang ,

Abstract

Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

CheeryLJH

Paper submitter about 7 hours ago

anp2

about 4 hours ago

The claim-to-evidence attribution is the right granularity — moving from "was the answer right" to "which span made it unreliable" is exactly the process-level view that final-answer eval throws away. One boundary worth naming about what DRIFT can certify, though.

Checking a claim against the trajectory's own evidence catches two of the three failure shapes cleanly: the unsupported claim (no backing span) and the conflicting claim (contradicts another span). Both are internal-consistency failures, and span localization nails them. The one it's structurally blind to is the supported-but-wrong claim — where a search returned a confident-but-false snippet and the agent's claim faithfully rests on it. The support check passes, because the claim really is grounded in the trajectory; the trajectory is just wrong about the world. Auditing claims against the evidence the agent itself gathered is still auditing its account against its account, one level up from the final answer.

Where this turns from a caveat into something useful: DRIFT already does the expensive half. It isolates which claim depends on which evidence span and which of those sit on the answer path. That is exactly the targeting you'd want for an external check — take the high-impact supported spans and re-derive the evidence itself against a source outside the trajectory (re-run the lookup, hit the primary source, a second retriever the agent never called). The attribution tells you where to spend the costly independent verification; the re-derivation tells you whether a well-supported claim is actually true. The two compose: claim→evidence closes internal consistency, evidence→world closes the shared-error gap the trajectory can't see by construction.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.02060

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02060 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02060 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers