Hugging Face Daily Papers · June 1, 2026 · 3 min read

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

LongDS is the benchmark to test whether data analysis agents can reliably track evolving analytical states over hundreds of interactions.</p>\n","updatedAt":"2026-06-01T01:41:25.183Z","author":{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","fullname":"Ningyu Zhang","name":"Ningyu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":49,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8196978569030762},"editors":["Ningyu"],"editorAvatarUrls":["/avatars/e0fccbb2577d76088e09f054c35cffbc.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30434","authors":[{"_id":"6a1ce2e1808ddbc3c7d433bc","name":"Kewei Xu","hidden":false},{"_id":"6a1ce2e1808ddbc3c7d433bd","name":"Xiaoben Lu","hidden":false},{"_id":"6a1ce2e1808ddbc3c7d433be","name":"Shuofei Qiao","hidden":false},{"_id":"6a1ce2e1808ddbc3c7d433bf","name":"Zihan Ding","hidden":false},{"_id":"6a1ce2e1808ddbc3c7d433c0","name":"Haoming Xu","hidden":false},{"_id":"6a1ce2e1808ddbc3c7d433c1","name":"Lei Liang","hidden":false},{"_id":"6a1ce2e1808ddbc3c7d433c2","name":"Ningyu Zhang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis","submittedOnDailyBy":{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","isPro":false,"fullname":"Ningyu Zhang","user":"Ningyu","type":"user","name":"Ningyu"},"summary":"Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.","upvotes":13,"discussionId":"6a1ce2e1808ddbc3c7d433c3","ai_summary":"LongDS benchmark evaluates agents' ability to maintain and update analytical states over extended data analysis sessions using real-world tasks from Kaggle notebooks.","ai_keywords":["long-horizon","multi-turn data analysis","analytical state","state-evolution patterns","counterfactual perturbation","rollback","multi-state composition","agent steps","interaction budget"],"organization":{"_id":"67c1d682826160b28f778510","name":"antgroup","fullname":"Ant Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/662e1f9da266499277937d33/7VcPHdLSGlged3ixK1dys.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","isPro":false,"fullname":"Ningyu Zhang","user":"Ningyu","type":"user"},{"_id":"63a942dd2e05ca32e35335df","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a942dd2e05ca32e35335df/kuKfBLEXfWnvnoUUmoXW6.jpeg","isPro":false,"fullname":"haoming xu","user":"haomingx","type":"user"},{"_id":"60bccec062080d33f875cd0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bccec062080d33f875cd0c/KvEhYxx9-Tff_Qb7PsjAL.png","isPro":true,"fullname":"Peter Szemraj","user":"pszemraj","type":"user"},{"_id":"67da5ee9e9a8647189ebe776","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/bT2Hvly_KD3VUe7eiD5nI.png","isPro":false,"fullname":"DingZihan","user":"dzh123","type":"user"},{"_id":"6a17c715d8eef017751231f6","avatarUrl":"/avatars/5a48c70c73b21aa4a86fbaa6c442ffaf.svg","isPro":false,"fullname":"Xiaoben Lu","user":"xiaoben7","type":"user"},{"_id":"6776ae0c91b4c75dac91249c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6776ae0c91b4c75dac91249c/uJk3ZnRrzjPCcBNjmrWLI.png","isPro":false,"fullname":"Oran Feng","user":"xiachongfeng","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"66abc6da92b9eb71fe476118","avatarUrl":"/avatars/6d1618f45cc76da80335ad926ad24552.svg","isPro":false,"fullname":"xy.r","user":"ShawnRu","type":"user"},{"_id":"68dba06a1e61717a96edbe43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ZoRsQt_bj3YWkBHvbRVWa.png","isPro":false,"fullname":"Feiyang Ying","user":"TengJiao33","type":"user"},{"_id":"67026ef05ce58dd0c3fc0d1c","avatarUrl":"/avatars/94d907941a00ddc9a8030b5c6772bc59.svg","isPro":false,"fullname":"xukewei","user":"xukewei","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"65535b54140fc44a74d43635","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/MIrD8OzDKF2aI38i7ZPjR.jpeg","isPro":false,"fullname":"Zhisong Qiu","user":"consultantQ","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67c1d682826160b28f778510","name":"antgroup","fullname":"Ant Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/662e1f9da266499277937d33/7VcPHdLSGlged3ixK1dys.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30434.md"}">

Papers

arxiv:2605.30434

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Published on May 28

· Submitted by

Ningyu Zhang on Jun 1

Ant Group

Upvote

Authors:

Abstract

LongDS benchmark evaluates agents' ability to maintain and update analytical states over extended data analysis sessions using real-world tasks from Kaggle notebooks.

AI-generated summary

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

View arXiv page View PDF Add to collection

Community

Ningyu

Paper submitter about 9 hours ago

LongDS is the benchmark to test whether data analysis agents can reliably track evolving analytical states over hundreds of interactions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30434

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30434 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30434 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30434 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers