Hugging Face Daily Papers · June 23, 2026 · 5 min read

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

#model-release #agents #rag #benchmark #funding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at <a href=\"https://github.com/AGI-Eval-Official/DailyReport\" rel=\"nofollow\">https://github.com/AGI-Eval-Official/DailyReport</a>.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/67f735137c9890012492144a/5ACAc7JwovoUYjAba3Uf8.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/67f735137c9890012492144a/5ACAc7JwovoUYjAba3Uf8.png\" alt=\"image\"></a>\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/67f735137c9890012492144a/QblkQedv0XBfNSNLtrIsR.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/67f735137c9890012492144a/QblkQedv0XBfNSNLtrIsR.png\" alt=\"structure\"></a>\n","updatedAt":"2026-06-23T02:18:20.784Z","author":{"_id":"67f735137c9890012492144a","avatarUrl":"/avatars/3465e45425014f528ee04e158efe63e2.svg","fullname":"Youpeng Wang","name":"wagoriginal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8763216733932495},"editors":["wagoriginal"],"editorAvatarUrls":["/avatars/3465e45425014f528ee04e158efe63e2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12871","authors":[{"_id":"6a38df56db23715e9da13b47","name":"Jingxuan Han","hidden":false},{"_id":"6a38df56db23715e9da13b48","name":"Wei Liu","hidden":false},{"_id":"6a38df56db23715e9da13b49","name":"Mingyang Zhu","hidden":false},{"_id":"6a38df56db23715e9da13b4a","user":{"_id":"67f735137c9890012492144a","avatarUrl":"/avatars/3465e45425014f528ee04e158efe63e2.svg","isPro":false,"fullname":"Youpeng Wang","user":"wagoriginal","type":"user","name":"wagoriginal"},"name":"Youpeng Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-22T16:12:12.263Z","hidden":false},{"_id":"6a38df56db23715e9da13b4b","name":"Ziwen Wang","hidden":false},{"_id":"6a38df56db23715e9da13b4c","name":"Lin Qiu","hidden":false},{"_id":"6a38df56db23715e9da13b4d","name":"Xuezhi Cao","hidden":false},{"_id":"6a38df56db23715e9da13b4e","name":"Xunliang Cai","hidden":false},{"_id":"6a38df56db23715e9da13b4f","name":"Zheren Fu","hidden":false},{"_id":"6a38df56db23715e9da13b50","name":"Licheng Zhang","hidden":false},{"_id":"6a38df56db23715e9da13b51","name":"Zhendong Mao","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks","submittedOnDailyBy":{"_id":"67f735137c9890012492144a","avatarUrl":"/avatars/3465e45425014f528ee04e158efe63e2.svg","isPro":false,"fullname":"Youpeng Wang","user":"wagoriginal","type":"user","name":"wagoriginal"},"summary":"Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.","upvotes":8,"discussionId":"6a38df56db23715e9da13b52","githubRepo":"https://github.com/AGI-Eval-Official/DailyReport","githubRepoAddedBy":"user","ai_summary":"Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches.","ai_keywords":["search agents","large language models","information-seeking tasks","web sources","autonomous exploration","comprehensive responses","open-ended benchmark","daily search tasks","cascade rubrics","performance attribution","user-centric aggregation","user preference score"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67f735137c9890012492144a","avatarUrl":"/avatars/3465e45425014f528ee04e158efe63e2.svg","isPro":false,"fullname":"Youpeng Wang","user":"wagoriginal","type":"user"},{"_id":"6757a963cc09297033a63f15","avatarUrl":"/avatars/795591083f4786f6a69e1fd1d3f07652.svg","isPro":false,"fullname":"Avery","user":"muleyy","type":"user"},{"_id":"6a0fc58cc8676ad292e8a15d","avatarUrl":"/avatars/2649c5c1d64cd72b77265747b89e0cea.svg","isPro":false,"fullname":"Ruizhe Li","user":"imlrz01","type":"user"},{"_id":"646cd947da8e99940b6e55cf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646cd947da8e99940b6e55cf/9c0P0WppFqNW9pdo8LgOS.jpeg","isPro":false,"fullname":"Shengyuan Ding","user":"ChrisDing1105","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"65342bfe484d775cb0bfecac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65342bfe484d775cb0bfecac/s53d6gBnSpmZRd1Dd9D5d.jpeg","isPro":false,"fullname":"Octavi Grau","user":"octavigrau","type":"user"},{"_id":"682844d0f300562abd28e0c9","avatarUrl":"/avatars/8cd5e929cfa19331f38a6f9c97f841b0.svg","isPro":false,"fullname":"dsadsa","user":"yueyue0407","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12871.md","query":{}}">

Papers

arxiv:2606.12871

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Published on Jun 11

· Submitted by

Youpeng Wang on Jun 23

Upvote

Authors:

Youpeng Wang ,

Abstract

Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF GitHub 5 Add to collection

Community

wagoriginal

Paper author Paper submitter about 23 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.12871

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12871 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12871 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12871 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers