Hugging Face Daily Papers · · 4 min read

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/N_N9BVfc_U4vEyDInCGn_.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/N_N9BVfc_U4vEyDInCGn_.png\" alt=\"image\"></a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/6rWr6IUzP64iayolCS0fI.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/6rWr6IUzP64iayolCS0fI.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-16T05:30:27.779Z","author":{"_id":"60d74d1affe0328e0167dc5f","avatarUrl":"/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg","fullname":"Jiwan Chung","name":"jiwan-chung","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4557325541973114},"editors":["jiwan-chung"],"editorAvatarUrls":["/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15673","authors":[{"_id":"6a30dec0a0d4daae42860242","name":"Jiwan Chung","hidden":false},{"_id":"6a30dec0a0d4daae42860243","name":"JiHyuk Byun","hidden":false},{"_id":"6a30dec0a0d4daae42860244","name":"Vibhav Vineet","hidden":false},{"_id":"6a30dec0a0d4daae42860245","name":"Seon Joo Kim","hidden":false}],"publishedAt":"2026-04-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking","submittedOnDailyBy":{"_id":"60d74d1affe0328e0167dc5f","avatarUrl":"/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg","isPro":false,"fullname":"Jiwan Chung","user":"jiwan-chung","type":"user","name":"jiwan-chung"},"summary":"Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.","upvotes":10,"discussionId":"6a30dec1a0d4daae42860246","projectPage":"https://jiwanchung.github.io/webstep/","ai_summary":"WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss.","ai_keywords":["web agents","semantic MDP","process-level analysis","automatic semantic state tracking","semantic trajectory","exploration reach","execution accuracy","skill characterizations","bifurcation analysis","task difficulty"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"69bd0d1415b495870e4b786d","name":"yonseiworld","fullname":"Yonsei University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742e770459000b812f3a276/3DGZ3X6xThktpxnvbyEui.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62b7eb18609021927892404c","avatarUrl":"/avatars/4aca2984bc58abb171a79e3a30927173.svg","isPro":false,"fullname":"Jaehyun Kang","user":"jaehyunkang","type":"user"},{"_id":"60d74d1affe0328e0167dc5f","avatarUrl":"/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg","isPro":false,"fullname":"Jiwan Chung","user":"jiwan-chung","type":"user"},{"_id":"665f4528e823e776bce8bff7","avatarUrl":"/avatars/ef49a6e5d24a2a4ae131dd44e85688ca.svg","isPro":false,"fullname":"Youngbeom Yoo","user":"yyb8552","type":"user"},{"_id":"6a21384274cea828b6549a39","avatarUrl":"/avatars/cdcb1a3e0458c2e5ea4f1a7aa3a009b4.svg","isPro":false,"fullname":"Hanjung Kim","user":"hanjungk","type":"user"},{"_id":"67b9d24bf67f79415b31db1e","avatarUrl":"/avatars/b85a721f8945e3b97ec943949207f49e.svg","isPro":false,"fullname":"Junyoung Hong","user":"shamanneo","type":"user"},{"_id":"6513030fb3a463e17df56edd","avatarUrl":"/avatars/867bd4316b2de758654ad3a84ea868c1.svg","isPro":false,"fullname":"Hyun, Jeongseok","user":"js-hyun","type":"user"},{"_id":"66d5730dd51528a038bb09f4","avatarUrl":"/avatars/f9323ef7523a9345a5a5dcd435e8ffa4.svg","isPro":false,"fullname":"Junhee Park","user":"junipark","type":"user"},{"_id":"6891845acd9caf594a23f96b","avatarUrl":"/avatars/0a7eb4b91209d978d7e68a796c12111d.svg","isPro":false,"fullname":"JiHyuk-Byun","user":"JiHyuk-Byun","type":"user"},{"_id":"6728f359053804164121bcc2","avatarUrl":"/avatars/c6c877b5447fe0fbbe71669e3ed4da7f.svg","isPro":false,"fullname":"Yeeun Choi","user":"yenncye","type":"user"},{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"/avatars/57a3958221140059976c572d257ac1c3.svg","isPro":false,"fullname":"Mijin Yoo","user":"mynameisyoomimi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69bd0d1415b495870e4b786d","name":"yonseiworld","fullname":"Yonsei University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742e770459000b812f3a276/3DGZ3X6xThktpxnvbyEui.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15673.md","query":{}}">
Papers
arxiv:2606.15673

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Published on Apr 8
· Submitted by
Jiwan Chung
on Jun 16
Authors:
,
,
,

Abstract

WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss.

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.

Community

Paper submitter about 8 hours ago

image

image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.15673
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15673 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15673 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15673 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers