Hugging Face Daily Papers · · 5 min read

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

🚀 Excited to share Video2GUI — a fully automated framework that turns unlabeled YouTube videos into grounded GUI interaction trajectories at internet scale.</p>\n<p>Existing GUI agent datasets rely on costly manual annotation and are limited to narrow domains, which bottlenecks generalization. We ask: can we instead mine the massive supply of GUI tutorials already on the web? Starting from 500M+ YouTube videos, our coarse-to-fine filtering + VLM-driven trajectory extraction + spatial grounding pipeline produces WildGUI: 12.7M trajectories, 124.5M screenshots, 1,500+ apps and websites across web, mobile, and desktop — the largest open-source GUI pretraining dataset to date.</p>\n<p>Pretraining Qwen2.5-VL and Mimo-VL on WildGUI yields 5–20% gains across ScreenSpot-Pro, OSWorld-G, AndroidControl, CAGUI, OSWorld, and AndroidWorld — matching or surpassing SOTA. On ScreenSpot-Pro, accuracy improves from 41.2 → 56.9, a nearly 38% relative gain. Happy to discuss the pipeline design and how internet-scale video mining can power the next generation of GUI agents. Dataset and pipeline will be released!</p>\n","updatedAt":"2026-05-21T06:20:39.822Z","author":{"_id":"6225a9983207dfc568407204","avatarUrl":"/avatars/c970db6232d84ae8c0fa5f11d561d67c.svg","fullname":"xwm","name":"xwm","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7562177777290344},"editors":["xwm"],"editorAvatarUrls":["/avatars/c970db6232d84ae8c0fa5f11d561d67c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14747","authors":[{"_id":"6a0ea1ba164dbbc68a26c6a0","name":"Weimin Xiong","hidden":false},{"_id":"6a0ea1ba164dbbc68a26c6a1","name":"Shuhao Gu","hidden":false},{"_id":"6a0ea1ba164dbbc68a26c6a2","name":"Bowen Ye","hidden":false},{"_id":"6a0ea1ba164dbbc68a26c6a3","name":"Zihao Yue","hidden":false},{"_id":"6a0ea1ba164dbbc68a26c6a4","name":"Lei Li","hidden":false},{"_id":"6a0ea1ba164dbbc68a26c6a5","name":"Feifan Song","hidden":false},{"_id":"6a0ea1ba164dbbc68a26c6a6","name":"Sujian Li","hidden":false},{"_id":"6a0ea1ba164dbbc68a26c6a7","name":"Hao Tian","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining","submittedOnDailyBy":{"_id":"6225a9983207dfc568407204","avatarUrl":"/avatars/c970db6232d84ae8c0fa5f11d561d67c.svg","isPro":false,"fullname":"xwm","user":"xwm","type":"user","name":"xwm"},"summary":"Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.","upvotes":61,"discussionId":"6a0ea1ba164dbbc68a26c6a8","projectPage":"https://weiminxiong.github.io/Video2GUI/","githubRepo":"https://github.com/WeiminXiong/Video2GUI","githubRepoAddedBy":"user","ai_summary":"A large-scale GUI dataset was created by automatically extracting interaction trajectories from internet videos, enabling improved performance in GUI agents through pre-training on this diverse collection.","ai_keywords":["multimodal large language models","graphical user interface agents","GUI grounding","action benchmarks","pre-training","GUI interaction trajectories","unlabeled Internet videos","coarse-to-fine filtering","structured agent trajectories","large-scale dataset","GUI agents"],"githubStars":15,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6225a9983207dfc568407204","avatarUrl":"/avatars/c970db6232d84ae8c0fa5f11d561d67c.svg","isPro":false,"fullname":"xwm","user":"xwm","type":"user"},{"_id":"6517f0df593b3af3120d242e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6517f0df593b3af3120d242e/gx4G616pnhw2ugEDPjwnH.jpeg","isPro":true,"fullname":"Zihao Yue","user":"yuezih","type":"user"},{"_id":"6847abd1892b779986a5ec80","avatarUrl":"/avatars/0f820eedf74d8e0636af94ef7cd4d04c.svg","isPro":false,"fullname":"Rang Li","user":"lirang04","type":"user"},{"_id":"67ff577bee2f129010eabb57","avatarUrl":"/avatars/ea94166019d5ebff8a5db97211aca911.svg","isPro":false,"fullname":"Evan James","user":"Evanxuliu","type":"user"},{"_id":"64be9a91d05a97d722cd974d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/LPNjgdIliZUBakslgJX2F.jpeg","isPro":false,"fullname":"Chengxuan Zhu","user":"zcx65535","type":"user"},{"_id":"66275a79cd54f5fffb857883","avatarUrl":"/avatars/9978905723b6b962b8c872f44e737372.svg","isPro":false,"fullname":"HCH","user":"Hecc411","type":"user"},{"_id":"64d2fce8129a210e569e0c76","avatarUrl":"/avatars/a79a832dc3a46ece1b9e542369fc4888.svg","isPro":false,"fullname":"Dawei Zhu","user":"dwzhu","type":"user"},{"_id":"686671774a5881552894667e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/686671774a5881552894667e/D0ZToL2zZqUResKex5FPc.webp","isPro":false,"fullname":"kumirei","user":"kumirei","type":"user"},{"_id":"645db15ff4f49de580a10269","avatarUrl":"/avatars/ea1bdd7a478f4c4a7b3e134c4330ec78.svg","isPro":false,"fullname":"snowflakewang","user":"SnowflakeWang","type":"user"},{"_id":"67829d6435d35379d71719a8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2N1aBxPBXzD_BM7yLafmH.png","isPro":false,"fullname":"Zhenghan Yu","user":"ZhenghanYU","type":"user"},{"_id":"6470390f850a938d6c571cd8","avatarUrl":"/avatars/8d6eab491315e1938b2c2a52b44889f8.svg","isPro":false,"fullname":"Jinhao Dong","user":"whatseeker","type":"user"},{"_id":"6a0ea6c1b63c6078aa5cff34","avatarUrl":"/avatars/97aa825397671441625d0f28af3e61fe.svg","isPro":false,"fullname":"Xiaoyun Zhang","user":"Zxy328","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}}">
Papers
arxiv:2605.14747

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Published on May 14
· Submitted by
xwm
on May 21
#2 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

A large-scale GUI dataset was created by automatically extracting interaction trajectories from internet videos, enabling improved performance in GUI agents through pre-training on this diverse collection.

AI-generated summary

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

Community

Paper submitter about 7 hours ago

🚀 Excited to share Video2GUI — a fully automated framework that turns unlabeled YouTube videos into grounded GUI interaction trajectories at internet scale.

Existing GUI agent datasets rely on costly manual annotation and are limited to narrow domains, which bottlenecks generalization. We ask: can we instead mine the massive supply of GUI tutorials already on the web? Starting from 500M+ YouTube videos, our coarse-to-fine filtering + VLM-driven trajectory extraction + spatial grounding pipeline produces WildGUI: 12.7M trajectories, 124.5M screenshots, 1,500+ apps and websites across web, mobile, and desktop — the largest open-source GUI pretraining dataset to date.

Pretraining Qwen2.5-VL and Mimo-VL on WildGUI yields 5–20% gains across ScreenSpot-Pro, OSWorld-G, AndroidControl, CAGUI, OSWorld, and AndroidWorld — matching or surpassing SOTA. On ScreenSpot-Pro, accuracy improves from 41.2 → 56.9, a nearly 38% relative gain. Happy to discuss the pipeline design and how internet-scale video mining can power the next generation of GUI agents. Dataset and pipeline will be released!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14747 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.14747 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14747 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers