Hugging Face Daily Papers · · 3 min read

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/6yByXb_A099i4BHRHonNc.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/6yByXb_A099i4BHRHonNc.png\" alt=\"image\"></a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/wPUrbrZ82FMYHsrf10D1s.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/wPUrbrZ82FMYHsrf10D1s.png\" alt=\"image\"></a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/PNJYpH7LTLyRzlmHJbJRu.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/PNJYpH7LTLyRzlmHJbJRu.png\" alt=\"image\"></a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/WS0XS-wSmKlgcJFWEj3qL.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6388a7e98a5dbe2f3dc61faa/WS0XS-wSmKlgcJFWEj3qL.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-05-21T05:56:10.560Z","author":{"_id":"6388a7e98a5dbe2f3dc61faa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6388a7e98a5dbe2f3dc61faa/Zj8pR7BG_wOQjsYzosb5p.jpeg","fullname":"Qi Mao","name":"HelenMao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.49543264508247375},"editors":["HelenMao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6388a7e98a5dbe2f3dc61faa/Zj8pR7BG_wOQjsYzosb5p.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19484","authors":[{"_id":"6a0e81b4164dbbc68a26c5a1","name":"Haobo Hu","hidden":false},{"_id":"6a0e81b4164dbbc68a26c5a2","name":"Xiangwu Guo","hidden":false},{"_id":"6a0e81b4164dbbc68a26c5a3","name":"Zhiheng Chen","hidden":false},{"_id":"6a0e81b4164dbbc68a26c5a4","name":"Difei Gao","hidden":false},{"_id":"6a0e81b4164dbbc68a26c5a5","name":"Haotian Liu","hidden":false},{"_id":"6a0e81b4164dbbc68a26c5a6","name":"Libiao Jin","hidden":false},{"_id":"6a0e81b4164dbbc68a26c5a7","name":"Qi Mao","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing","submittedOnDailyBy":{"_id":"6388a7e98a5dbe2f3dc61faa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6388a7e98a5dbe2f3dc61faa/Zj8pR7BG_wOQjsYzosb5p.jpeg","isPro":false,"fullname":"Qi Mao","user":"HelenMao","type":"user","name":"HelenMao"},"summary":"While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.","upvotes":18,"discussionId":"6a0e81b4164dbbc68a26c5a8","githubRepo":"https://github.com/CUC-MIPG/CutVerse","githubRepoAddedBy":"user","ai_summary":"Current GUI agents show limited effectiveness in professional media post-production tasks despite advances in spatial grounding and multimodal alignment.","ai_keywords":[""],"githubStars":0,"organization":{"_id":"67dab498ed21a53369f5de73","name":"CUC-MIPG","fullname":"Multimedia Intelligent Processing Group in Communication University of China","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/640d704c8036cc2142299c19/B85B31gd7-0kjK_Rpvv3g.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"640d704c8036cc2142299c19","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640d704c8036cc2142299c19/Wt9AslcVxWOSSc11epk8l.jpeg","isPro":true,"fullname":"Lan Chen","user":"Orannue","type":"user"},{"_id":"6603c21926220b0f3dc29549","avatarUrl":"/avatars/3d3e528f9ff4e589035c3d00f22f0aca.svg","isPro":false,"fullname":"Haobo Hu","user":"HHAObo","type":"user"},{"_id":"66f2432d5405e6677d430bbf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/gxQWA3WTHGEQUL10fiIFh.png","isPro":false,"fullname":"GUO XIANGWU","user":"SamuelGuo","type":"user"},{"_id":"68cf444ea6e99175a057cb3a","avatarUrl":"/avatars/3ef84f5adf69847911231b1b41167be8.svg","isPro":false,"fullname":"haotian liu","user":"haotian0109","type":"user"},{"_id":"6713a356a77c43bfa414b606","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/MPZpxntH9n9F04BSIBswo.png","isPro":false,"fullname":"chk","user":"yihaitt","type":"user"},{"_id":"6a0ea5ee8f0d6eae482bcac9","avatarUrl":"/avatars/59d59ee370a32d5a310979f735b4c449.svg","isPro":false,"fullname":"luqi","user":"luqiluqi","type":"user"},{"_id":"6388a7e98a5dbe2f3dc61faa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6388a7e98a5dbe2f3dc61faa/Zj8pR7BG_wOQjsYzosb5p.jpeg","isPro":false,"fullname":"Qi Mao","user":"HelenMao","type":"user"},{"_id":"673d87d933ce9d46414ff67b","avatarUrl":"/avatars/040af60f982c4b62f666ce8296efba75.svg","isPro":false,"fullname":"yyl","user":"l13001","type":"user"},{"_id":"69aeb21297f5d3f722022d02","avatarUrl":"/avatars/440e7d41693cdaa5a9cc6c6a5e145b18.svg","isPro":false,"fullname":"Bamzzo","user":"Bamzzo","type":"user"},{"_id":"67151778ff4fc99ee4a6cdcc","avatarUrl":"/avatars/93e0e05f33a450d8e44759352b410b26.svg","isPro":false,"fullname":"LindiaC","user":"LindiaC","type":"user"},{"_id":"64aa2210e04e7f92245f54d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64aa2210e04e7f92245f54d2/OE43T22bLWBVgmqcJyUtu.png","isPro":false,"fullname":"Li","user":"kotion","type":"user"},{"_id":"69bd0cfe745b859b4b223c9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Z1kfL_BmDs6YbBpI4yjuN.jpeg","isPro":false,"fullname":"WU Shiyu","user":"xie-chenxi486","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67dab498ed21a53369f5de73","name":"CUC-MIPG","fullname":"Multimedia Intelligent Processing Group in Communication University of China","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/640d704c8036cc2142299c19/B85B31gd7-0kjK_Rpvv3g.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19484.md"}">
Papers
arxiv:2605.19484

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Authors:
,
,
,
,
,
,

Abstract

Current GUI agents show limited effectiveness in professional media post-production tasks despite advances in spatial grounding and multimodal alignment.

AI-generated summary

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

Community

Paper submitter about 7 hours ago

image

image

image

image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.19484
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.19484 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.19484 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.19484 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers