Hugging Face Daily Papers · May 18, 2026 · 9 min read

MMSkills: Towards Multimodal Skills for General Visual Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.\n","updatedAt":"2026-05-18T02:20:36.755Z","author":{"_id":"63b6def76fca9d2a1902fa14","avatarUrl":"/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg","fullname":"张康宁","name":"zhangkangning","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8872467279434204},"editors":["zhangkangning"],"editorAvatarUrls":["/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg"],"reactions":[],"isReport":false}},{"id":"6a0bc137d4ed4a39cf207673","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:47:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck](https://huggingface.co/papers/2605.08526) (2026)\n* [WebXSkill: Skill Learning for Autonomous Web Agents](https://huggingface.co/papers/2604.13318) (2026)\n* [MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory](https://huggingface.co/papers/2605.15128) (2026)\n* [MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence](https://huggingface.co/papers/2605.12703) (2026)\n* [Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis](https://huggingface.co/papers/2603.29620) (2026)\n* [GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents](https://huggingface.co/papers/2604.07429) (2026)\n* [A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications](https://huggingface.co/papers/2605.07358) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.08526\">Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.13318\">WebXSkill: Skill Learning for Autonomous Web Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15128\">MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12703\">MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2603.29620\">Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07429\">GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07358\">A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-19T01:47:35.605Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7128389477729797},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.13527","authors":[{"_id":"6a0a776b75184a0d71e02628","name":"Kangning Zhang","hidden":false},{"_id":"6a0a776b75184a0d71e02629","name":"Shuai Shao","hidden":false},{"_id":"6a0a776b75184a0d71e0262a","name":"Qingyao Li","hidden":false},{"_id":"6a0a776b75184a0d71e0262b","name":"Jianghao Lin","hidden":false},{"_id":"6a0a776b75184a0d71e0262c","name":"Lingyue Fu","hidden":false},{"_id":"6a0a776b75184a0d71e0262d","name":"Shijian Wang","hidden":false},{"_id":"6a0a776b75184a0d71e0262e","user":{"_id":"63db16330cc3bc12bc0b6f8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db16330cc3bc12bc0b6f8f/ld0JQIfX1SBlDVDOmw9VT.jpeg","isPro":false,"fullname":"Wenxiang Jiao","user":"wxjiao","type":"user","name":"wxjiao"},"name":"Wenxiang Jiao","status":"claimed_verified","statusLastChangedAt":"2026-05-18T07:46:44.634Z","hidden":false},{"_id":"6a0a776b75184a0d71e0262f","name":"Yuan Lu","hidden":false},{"_id":"6a0a776b75184a0d71e02630","name":"Weiwen Liu","hidden":false},{"_id":"6a0a776b75184a0d71e02631","name":"Weinan Zhang","hidden":false},{"_id":"6a0a776b75184a0d71e02632","name":"Yong Yu","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"MMSkills: Towards Multimodal Skills for General Visual Agents","submittedOnDailyBy":{"_id":"63b6def76fca9d2a1902fa14","avatarUrl":"/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg","isPro":false,"fullname":"张康宁","user":"zhangkangning","type":"user","name":"zhangkangning"},"summary":"Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.","upvotes":101,"discussionId":"6a0a776b75184a0d71e02633","projectPage":"https://deepexperience.github.io/MMSkills/","githubRepo":"https://github.com/DeepExperience/MMSkills","githubRepoAddedBy":"user","ai_summary":"Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.","ai_keywords":["multimodal procedural knowledge","visual agents","skill packages","state-conditioned packages","visual grounding","agentic trajectory-to-skill Generator","branch-loaded multimodal skill agent","runtime visual decision making","GUI benchmarks","game-based visual-agent benchmarks"],"githubStars":103,"organization":{"_id":"63ec8ce89d77b7eb70568340","name":"ShanghaiJiaotongUniversity","fullname":"Shanghai Jiaotong University 1(NOT OFFICIAL)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec8c599d77b7eb70567d94/aD8jb0IbftwEH_V1kffGG.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63db16330cc3bc12bc0b6f8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db16330cc3bc12bc0b6f8f/ld0JQIfX1SBlDVDOmw9VT.jpeg","isPro":false,"fullname":"Wenxiang Jiao","user":"wxjiao","type":"user"},{"_id":"67234cce4c24aa1a7a3110d5","avatarUrl":"/avatars/5fe71499c95ee53540164b844b4c423b.svg","isPro":false,"fullname":"Shuai SHAO","user":"ShaoShuai0605","type":"user"},{"_id":"6617c1de028adf787611adf9","avatarUrl":"/avatars/387ceeca152ae19ce574edffecf7f23c.svg","isPro":false,"fullname":"Xichen Zhang","user":"hkuzxc","type":"user"},{"_id":"656800a7456d7733de5f1d89","avatarUrl":"/avatars/910ace29b232523159f1c6bbd6e7e8ad.svg","isPro":false,"fullname":"Rong Shan","user":"CyberDancer","type":"user"},{"_id":"64e184d3e3d040e495ba41d3","avatarUrl":"/avatars/830a84cee4f3e277221f573155bb97eb.svg","isPro":false,"fullname":"Weiming Zhang","user":"Yevzh","type":"user"},{"_id":"6a0681abd0b691db447981cc","avatarUrl":"/avatars/b9077cec8bf47c43d5b1afc0c9f29fc3.svg","isPro":false,"fullname":"Yuehao Liu","user":"YuehaoLiu","type":"user"},{"_id":"6a0a7c2d81845c6caf40f51d","avatarUrl":"/avatars/4b0e132879c06d0700d4ce3b550583c5.svg","isPro":false,"fullname":"zyj","user":"Lorrainez","type":"user"},{"_id":"6773597e01fc2b369e1d6e55","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BuHFmjZ-5RLhZqLx1y0b7.png","isPro":false,"fullname":"wnc","user":"WncFht","type":"user"},{"_id":"6915df153836b1a65cdeb757","avatarUrl":"/avatars/89bcf5b90550dba53a6dbfb37872cac5.svg","isPro":false,"fullname":"Haoshuo Zhang","user":"SeaHoney","type":"user"},{"_id":"64c4ab0388373ea6200e1cf3","avatarUrl":"/avatars/8ad27c35d3def048bc4ff96c0510bba6.svg","isPro":false,"fullname":"qingyao li","user":"simonlqy","type":"user"},{"_id":"66b858351893d14ee829b054","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b858351893d14ee829b054/b2ASTdQP3vvq2epNO96Fm.jpeg","isPro":false,"fullname":"xyliu","user":"xiaoyuanliu","type":"user"},{"_id":"66b257b58188b0d6f716ccc6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b257b58188b0d6f716ccc6/SltvFc4SGFEcPPMtn1rIk.png","isPro":false,"fullname":"Zhou Ruqi","user":"zrq51","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"63ec8ce89d77b7eb70568340","name":"ShanghaiJiaotongUniversity","fullname":"Shanghai Jiaotong University 1(NOT OFFICIAL)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec8c599d77b7eb70567d94/aD8jb0IbftwEH_V1kffGG.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.13527.md"}">

Papers

arxiv:2605.13527

MMSkills: Towards Multimodal Skills for General Visual Agents

Published on May 14

· Submitted by

张康宁 on May 18

#2 Paper of the day

Shanghai Jiaotong University 1(NOT OFFICIAL)

Upvote

101

Authors:

Wenxiang Jiao ,

Abstract

Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.

AI-generated summary