Hugging Face Daily Papers · · 4 min read

Training Open Models for Agentic Phone Use

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

This is a really interesting approach to the phone agent problem. Using a mix of real and mock environments to bridge that gap between simulation speed and real-world reliability makes a lot of sense, especially since resetting real apps is such a headache.</p>\n<p>I'm curious if you have any thoughts on why cross-app workflows are still lagging behind. Do you think the bottleneck is more about the model's long-term memory or the complexity of moving between distinct app interfaces?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/dd8dcb05-a1cd-43ec-a37f-f2a03b2509ac\" rel=\"nofollow\">https://researchpod.app/episode/dd8dcb05-a1cd-43ec-a37f-f2a03b2509ac</a></p>\n","updatedAt":"2026-06-23T11:22:44.416Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9442108273506165},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.23049","authors":[{"_id":"6a3a0941fdcd3514343bb5ff","user":{"_id":"64912976b95c3f0a1e6233cb","avatarUrl":"/avatars/3e338c5eef2514055ed98ae6141a5d1a.svg","isPro":false,"fullname":"Zhengyang Tang","user":"tangzhy","type":"user","name":"tangzhy"},"name":"Zhengyang Tang","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:18.449Z","hidden":false},{"_id":"6a3a0941fdcd3514343bb600","name":"Xin Lai","hidden":false},{"_id":"6a3a0941fdcd3514343bb601","name":"Pengyuan Lyu","hidden":false},{"_id":"6a3a0941fdcd3514343bb602","name":"Xinyuan Wang","hidden":false},{"_id":"6a3a0941fdcd3514343bb603","name":"Tianyi Bai","hidden":false},{"_id":"6a3a0941fdcd3514343bb604","name":"Chenxin Li","hidden":false},{"_id":"6a3a0941fdcd3514343bb605","name":"Yiduo Guo","hidden":false},{"_id":"6a3a0941fdcd3514343bb606","name":"Huawen Shen","hidden":false},{"_id":"6a3a0941fdcd3514343bb607","name":"Yuxuan Liu","hidden":false},{"_id":"6a3a0941fdcd3514343bb608","user":{"_id":"63aaf2a2a4bdd629b7eb2b5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63aaf2a2a4bdd629b7eb2b5b/WOa3nAUNy5D3MsFUV9B8Z.jpeg","isPro":false,"fullname":"Junyi Li","user":"ProvenceStar","type":"user","name":"ProvenceStar"},"name":"Junyi Li","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:16.575Z","hidden":false},{"_id":"6a3a0941fdcd3514343bb609","name":"Zhengyao Fang","hidden":false},{"_id":"6a3a0941fdcd3514343bb60a","name":"Yang Ding","hidden":false},{"_id":"6a3a0941fdcd3514343bb60b","name":"Yi Zhang","hidden":false},{"_id":"6a3a0941fdcd3514343bb60c","name":"Weinong Wang","hidden":false},{"_id":"6a3a0941fdcd3514343bb60d","name":"Xingran Zhou","hidden":false},{"_id":"6a3a0941fdcd3514343bb60e","name":"Liang Wu","hidden":false},{"_id":"6a3a0941fdcd3514343bb60f","name":"Fei Tang","hidden":false},{"_id":"6a3a0941fdcd3514343bb610","name":"Sunqi Fan","hidden":false},{"_id":"6a3a0941fdcd3514343bb611","name":"Shangpin Peng","hidden":false},{"_id":"6a3a0941fdcd3514343bb612","name":"Zheng Ruan","hidden":false},{"_id":"6a3a0941fdcd3514343bb613","name":"Anran Zhang","hidden":false},{"_id":"6a3a0941fdcd3514343bb614","name":"Benyou Wang","hidden":false},{"_id":"6a3a0941fdcd3514343bb615","name":"Ji-Rong Wen","hidden":false},{"_id":"6a3a0941fdcd3514343bb616","name":"Rui Yan","hidden":false},{"_id":"6a3a0941fdcd3514343bb617","name":"Chengquan Zhang","hidden":false},{"_id":"6a3a0941fdcd3514343bb618","name":"Han Hu","hidden":false}],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Training Open Models for Agentic Phone Use","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\\% after supervised fine-tuning to 40.67\\% after real-app RL and 45.33\\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\\% to 77.2\\% to 83.2\\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.","upvotes":9,"discussionId":"6a3a0942fdcd3514343bb619","projectPage":"https://phonebuddyai.github.io/","githubRepo":"https://github.com/PhoneBuddyAI/phonebuddy","githubRepoAddedBy":"user","ai_summary":"PhoneBuddy combines real and mock app environments to improve training of open models for phone use, demonstrating enhanced task success rates through mixed reinforcement learning approaches.","ai_keywords":["open models","real-app environment","mock-app environment","PhoneWorld","supervised fine-tuning","reinforcement learning","task success rate","AndroidWorld"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":5,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"64912976b95c3f0a1e6233cb","avatarUrl":"/avatars/3e338c5eef2514055ed98ae6141a5d1a.svg","isPro":false,"fullname":"Zhengyang Tang","user":"tangzhy","type":"user"},{"_id":"63aaf2a2a4bdd629b7eb2b5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63aaf2a2a4bdd629b7eb2b5b/WOa3nAUNy5D3MsFUV9B8Z.jpeg","isPro":false,"fullname":"Junyi Li","user":"ProvenceStar","type":"user"},{"_id":"642bddc1fc41757877f68327","avatarUrl":"/avatars/f275237f36a112624d59a7e3f73237d3.svg","isPro":false,"fullname":"Xin Lai","user":"xinlai","type":"user"},{"_id":"697c8b15a7f796854ef333c4","avatarUrl":"/avatars/94de3a736fac914944f1b57609e3819a.svg","isPro":false,"fullname":"Joel Wang","user":"joelhenwang","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"65f3d7ebc2d214f88485bc7d","avatarUrl":"/avatars/d5724567e69e39ec557045a2da237bdd.svg","isPro":false,"fullname":"RagMaster","user":"ragmaster1","type":"user"},{"_id":"64706424d9360cd9d8e5b0dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64706424d9360cd9d8e5b0dc/NcGoYhmm20uZXxP73yx-P.jpeg","isPro":false,"fullname":"Alex","user":"M0nteCarl0","type":"user"},{"_id":"67769df7f45aa32b2edfc87f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67769df7f45aa32b2edfc87f/2_SnR2JHYQFlekOMnkYTM.png","isPro":false,"fullname":"Junayed ahmed","user":"tamim-korex","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.23049.md","query":{}}">
Papers
arxiv:2606.23049

Training Open Models for Agentic Phone Use

Published on Jun 22
· Submitted by
taesiri
on Jun 23
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

PhoneBuddy combines real and mock app environments to improve training of open models for phone use, demonstrating enhanced task success rates through mixed reinforcement learning approaches.

Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.

Community

This is a really interesting approach to the phone agent problem. Using a mix of real and mock environments to bridge that gap between simulation speed and real-world reliability makes a lot of sense, especially since resetting real apps is such a headache.

I'm curious if you have any thoughts on why cross-app workflows are still lagging behind. Do you think the bottleneck is more about the model's long-term memory or the complexity of moving between distinct app interfaces?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/dd8dcb05-a1cd-43ec-a37f-f2a03b2509ac

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.23049
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.23049 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.23049 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.23049 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers