Hugging Face Daily Papers · June 16, 2026 · 4 min read

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

#model-release #agents #benchmark #funding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/N_N9BVfc_U4vEyDInCGn_.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/N_N9BVfc_U4vEyDInCGn_.png\" alt=\"image\"></a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/6rWr6IUzP64iayolCS0fI.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/60d74d1affe0328e0167dc5f/6rWr6IUzP64iayolCS0fI.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-16T05:30:27.779Z","author":{"_id":"60d74d1affe0328e0167dc5f","avatarUrl":"/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg","fullname":"Jiwan Chung","name":"jiwan-chung","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4557325541973114},"editors":["jiwan-chung"],"editorAvatarUrls":["/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15673","authors":[{"_id":"6a30dec0a0d4daae42860242","name":"Jiwan Chung","hidden":false},{"_id":"6a30dec0a0d4daae42860243","name":"JiHyuk Byun","hidden":false},{"_id":"6a30dec0a0d4daae42860244","name":"Vibhav Vineet","hidden":false},{"_id":"6a30dec0a0d4daae42860245","name":"Seon Joo Kim","hidden":false}],"publishedAt":"2026-04-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking","submittedOnDailyBy":{"_id":"60d74d1affe0328e0167dc5f","avatarUrl":"/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg","isPro":false,"fullname":"Jiwan Chung","user":"jiwan-chung","type":"user","name":"jiwan-chung"},"summary":"Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.","upvotes":10,"discussionId":"6a30dec1a0d4daae42860246","projectPage":"https://jiwanchung.github.io/webstep/","ai_summary":"WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss.","ai_keywords":["web agents","semantic MDP","process-level analysis","automatic semantic state tracking","semantic trajectory","exploration reach","execution accuracy","skill characterizations","bifurcation analysis","task difficulty"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"69bd0d1415b495870e4b786d","name":"yonseiworld","fullname":"Yonsei University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742e770459000b812f3a276/3DGZ3X6xThktpxnvbyEui.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62b7eb18609021927892404c","avatarUrl":"/avatars/4aca2984bc58abb171a79e3a30927173.svg","isPro":false,"fullname":"Jaehyun Kang","user":"jaehyunkang","type":"user"},{"_id":"60d74d1affe0328e0167dc5f","avatarUrl":"/avatars/9b1a2df9402e9c26e1eb7c818af9bae0.svg","isPro":false,"fullname":"Jiwan Chung","user":"jiwan-chung","type":"user"},{"_id":"665f4528e823e776bce8bff7","avatarUrl":"/avatars/ef49a6e5d24a2a4ae131dd44e85688ca.svg","isPro":false,"fullname":"Youngbeom Yoo","user":"yyb8552","type":"user"},{"_id":"6a21384274cea828b6549a39","avatarUrl":"/avatars/cdcb1a3e0458c2e5ea4f1a7aa3a009b4.svg","isPro":false,"fullname":"Hanjung Kim","user":"hanjungk","type":"user"},{"_id":"67b9d24bf67f79415b31db1e","avatarUrl":"/avatars/b85a721f8945e3b97ec943949207f49e.svg","isPro":false,"fullname":"Junyoung Hong","user":"shamanneo","type":"user"},{"_id":"6513030fb3a463e17df56edd","avatarUrl":"/avatars/867bd4316b2de758654ad3a84ea868c1.svg","isPro":false,"fullname":"Hyun, Jeongseok","user":"js-hyun","type":"user"},{"_id":"66d5730dd51528a038bb09f4","avatarUrl":"/avatars/f9323ef7523a9345a5a5dcd435e8ffa4.svg","isPro":false,"fullname":"Junhee Park","user":"junipark","type":"user"},{"_id":"6891845acd9caf594a23f96b","avatarUrl":"/avatars/0a7eb4b91209d978d7e68a796c12111d.svg","isPro":false,"fullname":"JiHyuk-Byun","user":"JiHyuk-Byun","type":"user"},{"_id":"6728f359053804164121bcc2","avatarUrl":"/avatars/c6c877b5447fe0fbbe71669e3ed4da7f.svg","isPro":false,"fullname":"Yeeun Choi","user":"yenncye","type":"user"},{"_id":"6884f5827d771707a5cde4ca","avatarUrl":"/avatars/57a3958221140059976c572d257ac1c3.svg","isPro":false,"fullname":"Mijin Yoo","user":"mynameisyoomimi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69bd0d1415b495870e4b786d","name":"yonseiworld","fullname":"Yonsei University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6742e770459000b812f3a276/3DGZ3X6xThktpxnvbyEui.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15673.md","query":{}}">

Papers

arxiv:2606.15673

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Published on Apr 8

· Submitted by

Jiwan Chung on Jun 16

Yonsei University

Upvote

Authors:

Abstract

WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.