Hugging Face Daily Papers · · 9 min read

MMSkills: Towards Multimodal Skills for General Visual Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.</p>\n","updatedAt":"2026-05-18T02:20:36.755Z","author":{"_id":"63b6def76fca9d2a1902fa14","avatarUrl":"/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg","fullname":"张康宁","name":"zhangkangning","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8872467279434204},"editors":["zhangkangning"],"editorAvatarUrls":["/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg"],"reactions":[],"isReport":false}},{"id":"6a0bc137d4ed4a39cf207673","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:47:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck](https://huggingface.co/papers/2605.08526) (2026)\n* [WebXSkill: Skill Learning for Autonomous Web Agents](https://huggingface.co/papers/2604.13318) (2026)\n* [MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory](https://huggingface.co/papers/2605.15128) (2026)\n* [MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence](https://huggingface.co/papers/2605.12703) (2026)\n* [Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis](https://huggingface.co/papers/2603.29620) (2026)\n* [GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents](https://huggingface.co/papers/2604.07429) (2026)\n* [A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications](https://huggingface.co/papers/2605.07358) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.08526\">Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.13318\">WebXSkill: Skill Learning for Autonomous Web Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15128\">MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12703\">MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2603.29620\">Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07429\">GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07358\">A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-19T01:47:35.605Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7128389477729797},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.13527","authors":[{"_id":"6a0a776b75184a0d71e02628","name":"Kangning Zhang","hidden":false},{"_id":"6a0a776b75184a0d71e02629","name":"Shuai Shao","hidden":false},{"_id":"6a0a776b75184a0d71e0262a","name":"Qingyao Li","hidden":false},{"_id":"6a0a776b75184a0d71e0262b","name":"Jianghao Lin","hidden":false},{"_id":"6a0a776b75184a0d71e0262c","name":"Lingyue Fu","hidden":false},{"_id":"6a0a776b75184a0d71e0262d","name":"Shijian Wang","hidden":false},{"_id":"6a0a776b75184a0d71e0262e","user":{"_id":"63db16330cc3bc12bc0b6f8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db16330cc3bc12bc0b6f8f/ld0JQIfX1SBlDVDOmw9VT.jpeg","isPro":false,"fullname":"Wenxiang Jiao","user":"wxjiao","type":"user","name":"wxjiao"},"name":"Wenxiang Jiao","status":"claimed_verified","statusLastChangedAt":"2026-05-18T07:46:44.634Z","hidden":false},{"_id":"6a0a776b75184a0d71e0262f","name":"Yuan Lu","hidden":false},{"_id":"6a0a776b75184a0d71e02630","name":"Weiwen Liu","hidden":false},{"_id":"6a0a776b75184a0d71e02631","name":"Weinan Zhang","hidden":false},{"_id":"6a0a776b75184a0d71e02632","name":"Yong Yu","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"MMSkills: Towards Multimodal Skills for General Visual Agents","submittedOnDailyBy":{"_id":"63b6def76fca9d2a1902fa14","avatarUrl":"/avatars/c7f2487450ea954e2bca4fc5a6db8eb3.svg","isPro":false,"fullname":"张康宁","user":"zhangkangning","type":"user","name":"zhangkangning"},"summary":"Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.","upvotes":101,"discussionId":"6a0a776b75184a0d71e02633","projectPage":"https://deepexperience.github.io/MMSkills/","githubRepo":"https://github.com/DeepExperience/MMSkills","githubRepoAddedBy":"user","ai_summary":"Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.","ai_keywords":["multimodal procedural knowledge","visual agents","skill packages","state-conditioned packages","visual grounding","agentic trajectory-to-skill Generator","branch-loaded multimodal skill agent","runtime visual decision making","GUI benchmarks","game-based visual-agent benchmarks"],"githubStars":103,"organization":{"_id":"63ec8ce89d77b7eb70568340","name":"ShanghaiJiaotongUniversity","fullname":"Shanghai Jiaotong University 1(NOT OFFICIAL)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec8c599d77b7eb70567d94/aD8jb0IbftwEH_V1kffGG.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63db16330cc3bc12bc0b6f8f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63db16330cc3bc12bc0b6f8f/ld0JQIfX1SBlDVDOmw9VT.jpeg","isPro":false,"fullname":"Wenxiang Jiao","user":"wxjiao","type":"user"},{"_id":"67234cce4c24aa1a7a3110d5","avatarUrl":"/avatars/5fe71499c95ee53540164b844b4c423b.svg","isPro":false,"fullname":"Shuai SHAO","user":"ShaoShuai0605","type":"user"},{"_id":"6617c1de028adf787611adf9","avatarUrl":"/avatars/387ceeca152ae19ce574edffecf7f23c.svg","isPro":false,"fullname":"Xichen Zhang","user":"hkuzxc","type":"user"},{"_id":"656800a7456d7733de5f1d89","avatarUrl":"/avatars/910ace29b232523159f1c6bbd6e7e8ad.svg","isPro":false,"fullname":"Rong Shan","user":"CyberDancer","type":"user"},{"_id":"64e184d3e3d040e495ba41d3","avatarUrl":"/avatars/830a84cee4f3e277221f573155bb97eb.svg","isPro":false,"fullname":"Weiming Zhang","user":"Yevzh","type":"user"},{"_id":"6a0681abd0b691db447981cc","avatarUrl":"/avatars/b9077cec8bf47c43d5b1afc0c9f29fc3.svg","isPro":false,"fullname":"Yuehao Liu","user":"YuehaoLiu","type":"user"},{"_id":"6a0a7c2d81845c6caf40f51d","avatarUrl":"/avatars/4b0e132879c06d0700d4ce3b550583c5.svg","isPro":false,"fullname":"zyj","user":"Lorrainez","type":"user"},{"_id":"6773597e01fc2b369e1d6e55","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BuHFmjZ-5RLhZqLx1y0b7.png","isPro":false,"fullname":"wnc","user":"WncFht","type":"user"},{"_id":"6915df153836b1a65cdeb757","avatarUrl":"/avatars/89bcf5b90550dba53a6dbfb37872cac5.svg","isPro":false,"fullname":"Haoshuo Zhang","user":"SeaHoney","type":"user"},{"_id":"64c4ab0388373ea6200e1cf3","avatarUrl":"/avatars/8ad27c35d3def048bc4ff96c0510bba6.svg","isPro":false,"fullname":"qingyao li","user":"simonlqy","type":"user"},{"_id":"66b858351893d14ee829b054","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b858351893d14ee829b054/b2ASTdQP3vvq2epNO96Fm.jpeg","isPro":false,"fullname":"xyliu","user":"xiaoyuanliu","type":"user"},{"_id":"66b257b58188b0d6f716ccc6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b257b58188b0d6f716ccc6/SltvFc4SGFEcPPMtn1rIk.png","isPro":false,"fullname":"Zhou Ruqi","user":"zrq51","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"63ec8ce89d77b7eb70568340","name":"ShanghaiJiaotongUniversity","fullname":"Shanghai Jiaotong University 1(NOT OFFICIAL)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ec8c599d77b7eb70567d94/aD8jb0IbftwEH_V1kffGG.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.13527.md"}">
Papers
arxiv:2605.13527

MMSkills: Towards Multimodal Skills for General Visual Agents

Published on May 14
· Submitted by
张康宁
on May 18
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Multimodal procedural knowledge frameworks enable visual agents to leverage external reusable skills through structured representations combining text, state cards, and visual keyframes, improving decision-making in complex environments.

AI-generated summary

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

Community

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.13527
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.13527 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.13527 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers