Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.</p>\n","updatedAt":"2026-05-18T02:20:36.755Z","author":{"_id":"63b6def76fca9d2a1902fa14","avatarUrl":"/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg","fullname":"张康宁","name":"zhangkangning","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8872467279434204},"editors":["zhangkangning"],"editorAvatarUrls":["/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg"],"reactions":[],"isReport":false}},{"id":"6a0bc137d4ed4a39cf207673","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:47:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck](https://huggingface.co/papers/2605.08526) (2026)\n* [WebXSkill: Skill Learning for Autonomous Web Agents](https://huggingface.co/papers/2604.13318) (2026)\n* [MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory](https://huggingface.co/papers/2605.15128) (2026)\n* [MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence](https://huggingface.co/papers/2605.12703) (2026)\n* [Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis](https://huggingface.co/papers/2603.29620) (2026)\n* [GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents](https://huggingface.co/papers/2604.07429) (2026)\n* [A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications](https://huggingface.co/papers/2605.07358) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.08526\">Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.13318\">WebXSkill: Skill Learning for Autonomous Web Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15128\">MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12703\">MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2603.29620\">Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07429\">GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07358\">A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-19T01:47:35.605Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7128389477729797},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.13527","authors":[{"_id":"6a0a776b75184a0d71e02628","name":"Kangning Zhang","hidden":false},{"_id":"6a0a776b75184a0d71e02629","name":"Shuai Shao","hidden":false},{"_id":"6a0a776b75184a0d71e0262a","name":"Qingyao Li","hidden":false},{"_id":"6a0a776b75184a0d71e0262b","name":"Jianghao Lin","hidden":false},{"_id":"6a0a776b75184a0d71e0262c","name":"Lingyue Fu","hidden":false},{"_id":"6a0a776b75184a0d71e0262d","name":"Shijian Wang","hidden":false},{"_id":"6a0a776b75184a0d71e0262e","user":{"_id":"63db16330cc3bc12bc0b6f8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db16330cc3bc12bc0b6f8f/ld0JQIfX1SBlDVDOmw9VT.jpeg","isPro":false,"fullname":"Wenxiang Jiao","user":"wxjiao","type":"user","name":"wxjiao"},"name":"Wenxiang Jiao","status":"claimed_verified","statusLastChangedAt":"2026-05-18T07:46:44.634Z","hidden":false},{"_id":"6a0a776b75184a0d71e0262f","name":"Yuan Lu","hidden":false},{"_id":"6a0a776b75184a0d71e02630","name":"Weiwen Liu","hidden":false},{"_id":"6a0a776b75184a0d71e02631","name":"Weinan Zhang","hidden":false},{"_id":"6a0a776b75184a0d71e02632","name":"Yong Yu","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"MMSkills: Towards Multimodal Skills for General Visual Agents","submittedOnDailyBy":{"_id":"63b6def76fca9d2a1902fa14","avatarUrl":"/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg","isPro":false,"fullname":"张康宁","user":"zhangkangning","type":"user","name":"zhangkangning"},"summary":"Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.","upvotes":101,"discussionId":"6a0a776b75184a0d71e02633","projectPage":"https://deepexperience.github.io/MMSkills/","githubRepo":"https://github.com/DeepExperience/MMSkills","githubRepoAddedBy":"user","ai_summary":"Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.","ai_keywords":["multimodal procedural knowledge","visual agents","skill packages","state-conditioned packages","visual grounding","agentic trajectory-to-skill Generator","branch-loaded multimodal skill agent","runtime visual decision making","GUI benchmarks","game-based visual-agent benchmarks"],"githubStars":103,"organization":{"_id":"63ec8ce89d77b7eb70568340","name":"ShanghaiJiaotongUniversity","fullname":"Shanghai Jiaotong University 1(NOT OFFICIAL)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec8c599d77b7eb70567d94/aD8jb0IbftwEH_V1kffGG.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63db16330cc3bc12bc0b6f8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db16330cc3bc12bc0b6f8f/ld0JQIfX1SBlDVDOmw9VT.jpeg","isPro":false,"fullname":"Wenxiang Jiao","user":"wxjiao","type":"user"},{"_id":"67234cce4c24aa1a7a3110d5","avatarUrl":"/avatars/5fe71499c95ee53540164b844b4c423b.svg","isPro":false,"fullname":"Shuai SHAO","user":"ShaoShuai0605","type":"user"},{"_id":"6617c1de028adf787611adf9","avatarUrl":"/avatars/387ceeca152ae19ce574edffecf7f23c.svg","isPro":false,"fullname":"Xichen Zhang","user":"hkuzxc","type":"user"},{"_id":"656800a7456d7733de5f1d89","avatarUrl":"/avatars/910ace29b232523159f1c6bbd6e7e8ad.svg","isPro":false,"fullname":"Rong Shan","user":"CyberDancer","type":"user"},{"_id":"64e184d3e3d040e495ba41d3","avatarUrl":"/avatars/830a84cee4f3e277221f573155bb97eb.svg","isPro":false,"fullname":"Weiming Zhang","user":"Yevzh","type":"user"},{"_id":"6a0681abd0b691db447981cc","avatarUrl":"/avatars/b9077cec8bf47c43d5b1afc0c9f29fc3.svg","isPro":false,"fullname":"Yuehao Liu","user":"YuehaoLiu","type":"user"},{"_id":"6a0a7c2d81845c6caf40f51d","avatarUrl":"/avatars/4b0e132879c06d0700d4ce3b550583c5.svg","isPro":false,"fullname":"zyj","user":"Lorrainez","type":"user"},{"_id":"6773597e01fc2b369e1d6e55","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BuHFmjZ-5RLhZqLx1y0b7.png","isPro":false,"fullname":"wnc","user":"WncFht","type":"user"},{"_id":"6915df153836b1a65cdeb757","avatarUrl":"/avatars/89bcf5b90550dba53a6dbfb37872cac5.svg","isPro":false,"fullname":"Haoshuo Zhang","user":"SeaHoney","type":"user"},{"_id":"64c4ab0388373ea6200e1cf3","avatarUrl":"/avatars/8ad27c35d3def048bc4ff96c0510bba6.svg","isPro":false,"fullname":"qingyao li","user":"simonlqy","type":"user"},{"_id":"66b858351893d14ee829b054","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b858351893d14ee829b054/b2ASTdQP3vvq2epNO96Fm.jpeg","isPro":false,"fullname":"xyliu","user":"xiaoyuanliu","type":"user"},{"_id":"66b257b58188b0d6f716ccc6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b257b58188b0d6f716ccc6/SltvFc4SGFEcPPMtn1rIk.png","isPro":false,"fullname":"Zhou Ruqi","user":"zrq51","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"63ec8ce89d77b7eb70568340","name":"ShanghaiJiaotongUniversity","fullname":"Shanghai Jiaotong University 1(NOT OFFICIAL)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec8c599d77b7eb70567d94/aD8jb0IbftwEH_V1kffGG.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.13527.md"}">
MMSkills: Towards Multimodal Skills for General Visual Agents
Authors: ,
,
,
,
,
,
,
,
,
Abstract
Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.
AI-generated summary
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.
Community
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.13527 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.13527 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.