Hugging Face Daily Papers · · 3 min read

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

New iOS Phone Agent Paper</p>\n","updatedAt":"2026-06-18T13:04:49.306Z","author":{"_id":"664aebe829eadb3ab4e4ca3f","avatarUrl":"/avatars/548d5656e082c8959ac78b883f0805af.svg","fullname":"Lawrence Jang","name":"ljang0","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6199835538864136},"editors":["ljang0"],"editorAvatarUrls":["/avatars/548d5656e082c8959ac78b883f0805af.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09764","authors":[{"_id":"6a2cd645a0d4daae4285f105","name":"Lawrence Keunho Jang","hidden":false},{"_id":"6a2cd645a0d4daae4285f106","name":"Mareks Woodside","hidden":false},{"_id":"6a2cd645a0d4daae4285f107","name":"Geronimo Carom","hidden":false},{"_id":"6a2cd645a0d4daae4285f108","name":"Andrew Keunwoo Jang","hidden":false},{"_id":"6a2cd645a0d4daae4285f109","name":"Jing Yu Koh","hidden":false},{"_id":"6a2cd645a0d4daae4285f10a","name":"Ruslan Salakhutdinov","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"iOSWorld: A Benchmark for Personally Intelligent Phone Agents","submittedOnDailyBy":{"_id":"664aebe829eadb3ab4e4ca3f","avatarUrl":"/avatars/548d5656e082c8959ac78b883f0805af.svg","isPro":false,"fullname":"Lawrence Jang","user":"ljang0","type":"user","name":"ljang0"},"summary":"A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\\% overall but only 37\\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.","upvotes":1,"discussionId":"6a2cd646a0d4daae4285f10b","ai_summary":"IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities.","ai_keywords":["iOS simulator","persistent user identity","mobile agent benchmarks","computer-use models","vision-only","vision+XML","accessibility-tree"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"664aebe829eadb3ab4e4ca3f","avatarUrl":"/avatars/548d5656e082c8959ac78b883f0805af.svg","isPro":false,"fullname":"Lawrence Jang","user":"ljang0","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09764.md","query":{}}">
Papers
arxiv:2606.09764

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Published on Jun 8
· Submitted by
Lawrence Jang
on Jun 18
Authors:
,
,
,
,
,

Abstract

IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities.

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

Community

Paper submitter about 3 hours ago

New iOS Phone Agent Paper

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09764
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09764 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09764 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09764 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers