V1</p>\n","updatedAt":"2026-05-29T21:59:15.722Z","author":{"_id":"63578f828ed056fa1cccb7a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png","fullname":"yubol-bobo","name":"yubol","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6558868885040283},"editors":["yubol"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png"],"reactions":[],"isReport":false}},{"id":"6a1a41713647a1935083023d","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:46:25.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [WebXSkill: Skill Learning for Autonomous Web Agents](https://huggingface.co/papers/2604.13318) (2026)\n* [Kintsugi: Learning Policies by Repairing Executable Knowledge Bases](https://huggingface.co/papers/2605.09487) (2026)\n* [SkillEvolver: Skill Learning as a Meta-Skill](https://huggingface.co/papers/2605.10500) (2026)\n* [GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents](https://huggingface.co/papers/2605.29668) (2026)\n* [ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation](https://huggingface.co/papers/2604.23853) (2026)\n* [From History to State: Constant-Context Skill Learning for LLM Agents](https://huggingface.co/papers/2605.05413) (2026)\n* [SkillGen: Verified Inference-Time Agent Skill Synthesis](https://huggingface.co/papers/2605.10999) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.13318\">WebXSkill: Skill Learning for Autonomous Web Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09487\">Kintsugi: Learning Policies by Repairing Executable Knowledge Bases</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10500\">SkillEvolver: Skill Learning as a Meta-Skill</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29668\">GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.23853\">ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.05413\">From History to State: Constant-Context Skill Learning for LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10999\">SkillGen: Verified Inference-Time Agent Skill Synthesis</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:46:25.443Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7274698615074158},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24785","authors":[{"_id":"6a1718fada9422d403a42211","name":"Yubo Li","hidden":false},{"_id":"6a1718fada9422d403a42212","name":"Yidi Miao","hidden":false},{"_id":"6a1718fada9422d403a42213","name":"Yuntian Shen","hidden":false},{"_id":"6a1718fada9422d403a42214","name":"Yuxin Liu","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"PANDO: Efficient Multimodal AI Agents via Online Skill Distillation","submittedOnDailyBy":{"_id":"63578f828ed056fa1cccb7a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png","isPro":false,"fullname":"yubol-bobo","user":"yubol","type":"user","name":"yubol"},"summary":"Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.","upvotes":2,"discussionId":"6a1718fbda9422d403a42215","ai_summary":"PANDO is a web agent framework that improves efficiency through experience accumulation by reducing redundant actions, optimizing skill discovery, and enhancing prompt caching without sacrificing performance.","ai_keywords":["multimodal web agents","rollout search","verifier passes","offline skill discovery","specialist model stacks","skill-distillation framework","Skill Library","progress reflection","confidence-based skill demotion","hierarchical routing","visual compression","cache-aware prompting","VisualWebArena","token usage","action repetition rate","step overhead ratio","prompt cache utilization"],"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63578f828ed056fa1cccb7a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png","isPro":false,"fullname":"yubol-bobo","user":"yubol","type":"user"},{"_id":"6632f8609fcb8095d54bd60f","avatarUrl":"/avatars/ae36169fa36c23b27be6551807016be1.svg","isPro":false,"fullname":"Jack Shen","user":"Jack-2002","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24785.md"}">
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
Abstract
PANDO is a web agent framework that improves efficiency through experience accumulation by reducing redundant actions, optimizing skill discovery, and enhancing prompt caching without sacrificing performance.
AI-generated summary
Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.24785 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.24785 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.24785 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.