Hugging Face Daily Papers · · 5 min read

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs-comprising static images, synchronous audio, and video clips—at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise.</p>\n","updatedAt":"2026-05-20T01:54:14.711Z","author":{"_id":"6406db5cd684369027166986","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6406db5cd684369027166986/Zl-orrGcbY0RbfjfKszn1.jpeg","fullname":"Shiyu Huang","name":"ShiyuHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8740447759628296},"editors":["ShiyuHuang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6406db5cd684369027166986/Zl-orrGcbY0RbfjfKszn1.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18758","authors":[{"_id":"6a0d13ed65eb30f20d962b69","name":"Felix Henry","hidden":false},{"_id":"6a0d13ed65eb30f20d962b6a","name":"Xiaochen Lin","hidden":false},{"_id":"6a0d13ed65eb30f20d962b6b","name":"Jiangyou Zhu","hidden":false},{"_id":"6a0d13ed65eb30f20d962b6c","name":"Yangfan","hidden":false},{"_id":"6a0d13ed65eb30f20d962b6d","name":"Bingqian Zhang","hidden":false},{"_id":"6a0d13ed65eb30f20d962b6e","name":"Min Chen","hidden":false},{"_id":"6a0d13ed65eb30f20d962b6f","name":"Shiyu Huang","hidden":false}],"publishedAt":"2026-04-03T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments","submittedOnDailyBy":{"_id":"6406db5cd684369027166986","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6406db5cd684369027166986/Zl-orrGcbY0RbfjfKszn1.jpeg","isPro":false,"fullname":"Shiyu Huang","user":"ShiyuHuang","type":"user","name":"ShiyuHuang"},"summary":"Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.","upvotes":11,"discussionId":"6a0d13ed65eb30f20d962b70","projectPage":"https://omni-gui.github.io/","githubRepo":"https://github.com/omni-gui/OmniGUI","githubRepoAddedBy":"user","ai_summary":"OmniGUI presents a novel multimodal benchmark for GUI agents that incorporates simultaneous audio, video, and image inputs to better simulate real smartphone interactions.","ai_keywords":["multimodal inputs","GUI agents","smartphone environments","action prediction","cross-modal interference"],"githubStars":9,"organization":{"_id":"69f1aab8adabd52c93880411","name":"OmniGUI","fullname":"OmniGUI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6761634c0c3e130d9dcaf6d7/98XFP77EY6z-UpX4g5tEY.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6406db5cd684369027166986","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6406db5cd684369027166986/Zl-orrGcbY0RbfjfKszn1.jpeg","isPro":false,"fullname":"Shiyu Huang","user":"ShiyuHuang","type":"user"},{"_id":"6761634c0c3e130d9dcaf6d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/kyvNuU41MMB6VaCPqvEx9.png","isPro":false,"fullname":"Xiaochen Lin","user":"XIAOCHENLIN00zz","type":"user"},{"_id":"6a0d171a15ea42410c7495cb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a0d171a15ea42410c7495cb/DMpOC2ouvBc3-lQyVd-0L.jpeg","isPro":false,"fullname":"Jiangyou Zhu","user":"kbman007","type":"user"},{"_id":"686e12c2a0a5f73f5df1974d","avatarUrl":"/avatars/f24737b9a8235c7f4f886964e684051c.svg","isPro":false,"fullname":"bingqian zhang","user":"zbq8","type":"user"},{"_id":"6579708f73080b490c03b12b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6579708f73080b490c03b12b/53p0lFtLn7u-XVQRTYAzF.png","isPro":false,"fullname":"Felix Henry","user":"FelixHenry","type":"user"},{"_id":"65e987b3588ff8b178b46450","avatarUrl":"/avatars/c4afd6c4ff577622da3869a011f105e9.svg","isPro":false,"fullname":"yf","user":"yf1995","type":"user"},{"_id":"6937da73529aeeb9f41c2676","avatarUrl":"/avatars/0ad79b7c176f4276d77ffd8701957fad.svg","isPro":false,"fullname":"june","user":"Jun-bupt","type":"user"},{"_id":"6a0d20d8d9178b9d7219a68f","avatarUrl":"/avatars/9eeaf47248f21eebd5434d4ef135ebf2.svg","isPro":false,"fullname":"henry","user":"felixhenry9","type":"user"},{"_id":"67c8086c7784324fbe1a40e5","avatarUrl":"/avatars/1d19524f138af1fc2fd76997af4f8b7b.svg","isPro":false,"fullname":"Huxurui","user":"Gandio","type":"user"},{"_id":"64e2f0084b78ab059665db3e","avatarUrl":"/avatars/9d255a8b1f83f9af9dfaffcea694703c.svg","isPro":false,"fullname":"Gang Zhao","user":"herojack1998","type":"user"},{"_id":"661e16f2e6e166452dc53ca0","avatarUrl":"/avatars/876a4991b582a781d332d70570e38f04.svg","isPro":false,"fullname":"hwb","user":"sakura0731","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69f1aab8adabd52c93880411","name":"OmniGUI","fullname":"OmniGUI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6761634c0c3e130d9dcaf6d7/98XFP77EY6z-UpX4g5tEY.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18758.md"}">
Papers
arxiv:2605.18758

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

Published on Apr 3
· Submitted by
Shiyu Huang
on May 20
Authors:
,
,
,
,
,
,

Abstract

OmniGUI presents a novel multimodal benchmark for GUI agents that incorporates simultaneous audio, video, and image inputs to better simulate real smartphone interactions.

AI-generated summary

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.

Community

Paper submitter about 11 hours ago

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs-comprising static images, synchronous audio, and video clips—at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18758
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18758 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18758 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers