Hugging Face Daily Papers · · 5 min read

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.<br>Project Homepage:<a href=\"https://workflow-gym.github.io/\" rel=\"nofollow\">https://workflow-gym.github.io/</a></p>\n","updatedAt":"2026-06-10T05:37:32.532Z","author":{"_id":"67bb0efdeddf07954b98b1e4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/iCWB0HCTGsdX4TqAyYKM6.png","fullname":"Liya","name":"juliazzzvvv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8846675753593445},"editors":["juliazzzvvv"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/iCWB0HCTGsdX4TqAyYKM6.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11042","authors":[{"_id":"6a28c74be7d78ea7587e536b","name":"Liya Zhu","hidden":false},{"_id":"6a28c74be7d78ea7587e536c","name":"Jingzhe Ding","hidden":false},{"_id":"6a28c74be7d78ea7587e536d","name":"Jian Zhang","hidden":false},{"_id":"6a28c74be7d78ea7587e536e","name":"Jianbo Xue","hidden":false},{"_id":"6a28c74be7d78ea7587e536f","name":"Shihao Liang","hidden":false},{"_id":"6a28c74be7d78ea7587e5370","name":"Ge Zhang","hidden":false},{"_id":"6a28c74be7d78ea7587e5371","name":"Xiang Gao","hidden":false},{"_id":"6a28c74be7d78ea7587e5372","name":"Qingshui Gu","hidden":false},{"_id":"6a28c74be7d78ea7587e5373","name":"Mailun Gao","hidden":false},{"_id":"6a28c74be7d78ea7587e5374","name":"Huimin Che","hidden":false},{"_id":"6a28c74be7d78ea7587e5375","name":"Yan Zhao","hidden":false},{"_id":"6a28c74be7d78ea7587e5376","name":"Peiheng Zhou","hidden":false},{"_id":"6a28c74be7d78ea7587e5377","name":"Haojun Wang","hidden":false},{"_id":"6a28c74be7d78ea7587e5378","name":"Chaobo Xian","hidden":false},{"_id":"6a28c74be7d78ea7587e5379","name":"Lili Le","hidden":false},{"_id":"6a28c74be7d78ea7587e537a","name":"Chi Wu","hidden":false},{"_id":"6a28c74be7d78ea7587e537b","name":"Yiwei Liu","hidden":false},{"_id":"6a28c74be7d78ea7587e537c","name":"Shengda Long","hidden":false},{"_id":"6a28c74be7d78ea7587e537d","name":"Jiale Yang","hidden":false},{"_id":"6a28c74be7d78ea7587e537e","name":"Fangzhi Xu","hidden":false},{"_id":"6a28c74be7d78ea7587e537f","name":"Sijin Wu","hidden":false},{"_id":"6a28c74be7d78ea7587e5380","name":"Haodong Duan","hidden":false},{"_id":"6a28c74be7d78ea7587e5381","name":"Yi Zhu","hidden":false},{"_id":"6a28c74be7d78ea7587e5382","name":"Chao He","hidden":false},{"_id":"6a28c74be7d78ea7587e5383","name":"Zhaojian Li","hidden":false},{"_id":"6a28c74be7d78ea7587e5384","name":"Minchao Wang","hidden":false},{"_id":"6a28c74be7d78ea7587e5385","name":"Huan Zhou","hidden":false},{"_id":"6a28c74be7d78ea7587e5386","name":"Jiani Hou","hidden":false},{"_id":"6a28c74be7d78ea7587e5387","name":"Chuqian Yu","hidden":false},{"_id":"6a28c74be7d78ea7587e5388","name":"Weiran Shi","hidden":false},{"_id":"6a28c74be7d78ea7587e5389","name":"Hongwan Gao","hidden":false},{"_id":"6a28c74be7d78ea7587e538a","name":"Jiamin Chen","hidden":false},{"_id":"6a28c74be7d78ea7587e538b","name":"Guanhong Chen","hidden":false},{"_id":"6a28c74be7d78ea7587e538c","name":"Tingqin Luo","hidden":false},{"_id":"6a28c74be7d78ea7587e538d","name":"Kaiyuan Zhang","hidden":false},{"_id":"6a28c74be7d78ea7587e538e","name":"Zhixin Yao","hidden":false},{"_id":"6a28c74be7d78ea7587e538f","name":"Qing Hua","hidden":false},{"_id":"6a28c74be7d78ea7587e5390","name":"Yuhao Jiang","hidden":false},{"_id":"6a28c74be7d78ea7587e5391","name":"Jin Chen","hidden":false},{"_id":"6a28c74be7d78ea7587e5392","name":"Pu Chen","hidden":false},{"_id":"6a28c74be7d78ea7587e5393","name":"Zhenyu Hu","hidden":false},{"_id":"6a28c74be7d78ea7587e5394","name":"Xingyu Li","hidden":false},{"_id":"6a28c74be7d78ea7587e5395","name":"Zhengxuan Jiang","hidden":false},{"_id":"6a28c74be7d78ea7587e5396","name":"Meng Cao","hidden":false},{"_id":"6a28c74be7d78ea7587e5397","name":"Tianfeng Long","hidden":false},{"_id":"6a28c74be7d78ea7587e5398","name":"Haozhe Wang","hidden":false},{"_id":"6a28c74be7d78ea7587e5399","name":"Mingzhang Wang","hidden":false},{"_id":"6a28c74be7d78ea7587e539a","name":"Yichen Zhang","hidden":false},{"_id":"6a28c74be7d78ea7587e539b","name":"Yiming Dai","hidden":false},{"_id":"6a28c74be7d78ea7587e539c","name":"Chenchen Zhang","hidden":false},{"_id":"6a28c74be7d78ea7587e539d","name":"Jiaying Wang","hidden":false},{"_id":"6a28c74be7d78ea7587e539e","name":"Zhiyong Wu","hidden":false},{"_id":"6a28c74be7d78ea7587e539f","name":"Shen Yan","hidden":false},{"_id":"6a28c74be7d78ea7587e53a0","name":"Yujia Qin","hidden":false},{"_id":"6a28c74be7d78ea7587e53a1","name":"Wenhao Huang","hidden":false},{"_id":"6a28c74be7d78ea7587e53a2","name":"Zaiyuan Wang","hidden":false},{"_id":"6a28c74be7d78ea7587e53a3","name":"Xiaolong Chang","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.","upvotes":14,"discussionId":"6a28c74be7d78ea7587e53a4","projectPage":"https://workflow-gym.github.io/","ai_summary":"Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding.","ai_keywords":["GUI agents","long-horizon tasks","professional software","workflow consistency","error propagation","objective drift","domain-specific software environments"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"638efcf4c67af472d316d424","avatarUrl":"/avatars/97a57859d7d87a3a8f1bb41d32a72bc2.svg","isPro":false,"fullname":"Ge Zhang","user":"zhangysk","type":"user"},{"_id":"6737eb499ab5a7623aaf038a","avatarUrl":"/avatars/dfe21c7db767a34f2aff3e80cebcfa22.svg","isPro":false,"fullname":"Jingzhe Ding","user":"JingzheDing","type":"user"},{"_id":"609de40f2536842fab9d4176","avatarUrl":"/avatars/a92d4fb93c3b7c68715bae01f988b4d8.svg","isPro":false,"fullname":"hhh","user":"leehh","type":"user"},{"_id":"67bb0efdeddf07954b98b1e4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/iCWB0HCTGsdX4TqAyYKM6.png","isPro":false,"fullname":"Liya","user":"juliazzzvvv","type":"user"},{"_id":"671e6628b02f88b7e2f8e4c9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/jDIZiR7SO-D7G-urfkGgi.png","isPro":false,"fullname":"Yiwei Liu","user":"LEw1sin","type":"user"},{"_id":"6885c7c06f4bb6e9d3d1907b","avatarUrl":"/avatars/9bdffb435a5f8ee452db915e6f3dd753.svg","isPro":false,"fullname":"JunWang","user":"JunWang1704","type":"user"},{"_id":"648853e1cd9f45eeaab48cd9","avatarUrl":"/avatars/d74f6cfd9afab217f8112e65a68dc379.svg","isPro":false,"fullname":"Aiden","user":"soasipray","type":"user"},{"_id":"69f16a44de5637f0b5abc799","avatarUrl":"/avatars/25b45b6b9285d3e6813633875ca72572.svg","isPro":false,"fullname":"Cassie Yu","user":"cassieyu74","type":"user"},{"_id":"64f1d764bb97012498e0a41e","avatarUrl":"/avatars/bb8afe157b1ead054191d12bdf450557.svg","isPro":false,"fullname":"Florentijn","user":"Florentijnli","type":"user"},{"_id":"67f9d060395fb1a0d7e4ae21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/GjpOfOuazN7IxcXBpVqRm.png","isPro":false,"fullname":"Shihao Li","user":"Leexeo","type":"user"},{"_id":"665ebae8bcbb98f60db0b4b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665ebae8bcbb98f60db0b4b1/YTKM4qTZXh_2SeU8U7BfB.webp","isPro":false,"fullname":"Jiale Zhao","user":"Heisenburger2000","type":"user"},{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11042.md"}">
Papers
arxiv:2606.11042

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Published on Jun 9
· Submitted by
taesiri
on Jun 10
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding.

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

Community

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.
Project Homepage:https://workflow-gym.github.io/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.11042
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11042 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11042 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11042 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers