Hugging Face Daily Papers · · 5 min read

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly<br>assumes human executors, making it difficult to use directly as the skills required by agents. To<br>bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously<br>improving them from trajectories observable to the agent. To evaluate the capability of existing<br>agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem.<br>We further propose MMG2Skill, a closed-loop framework that compiles guides into editable<br>skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and<br>revises the skills from trajectory-level root-cause feedback without using benchmark scores.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/UfVjWko3GHXzYa_FDxCZg.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65377c30e48353201e6fdda0/UfVjWko3GHXzYa_FDxCZg.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-04T02:23:43.350Z","author":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","fullname":"Jiaheng Liu","name":"CheeryLJH","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":27,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8889937400817871},"editors":["CheeryLJH"],"editorAvatarUrls":["/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01993","authors":[{"_id":"6a1fb4e0e292c1c78ecb14d6","name":"Xinyu Che","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14d7","name":"Junqi Xiong","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14d8","name":"Yunfei Ge","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14d9","name":"Xinping Lei","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14da","name":"Shihao Li","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14db","name":"Hang Yan","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14dc","name":"Han Li","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14dd","name":"Yuanxing Zhang","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14de","name":"Zhiqi Bai","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14df","name":"Jinhua Hao","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14e0","name":"Ming Sun","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14e1","name":"Han Li","hidden":false},{"_id":"6a1fb4e0e292c1c78ecb14e2","name":"Jiaheng Liu","hidden":false}],"publishedAt":"2026-06-01T09:50:40.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?","submittedOnDailyBy":{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user","name":"CheeryLJH"},"summary":"Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.","upvotes":12,"discussionId":"6a1fb4e0e292c1c78ecb14e6","githubRepo":"https://github.com/NJU-LINK/MMG2Skill","githubRepoAddedBy":"user","ai_summary":"MMG2Skill framework converts web-based procedural guides into executable skills through closed-loop learning, improving agent performance across GUI control, gameplay, and card play tasks.","ai_keywords":["guide-to-skill learning","vision-language model","closed-loop framework","trajectory-level root-cause feedback","macro-average gains","ablation studies","analyzer-based early stopping"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65377c30e48353201e6fdda0","avatarUrl":"/avatars/a8f803b6f2e598eaee9c52c0d2ddfc16.svg","isPro":false,"fullname":"Jiaheng Liu","user":"CheeryLJH","type":"user"},{"_id":"64b74b906ab5d14ca7f289cd","avatarUrl":"/avatars/b131b7c4ce5216708ca4a678f35ead0a.svg","isPro":false,"fullname":"Chenchen Zhang","user":"xxzcc","type":"user"},{"_id":"66a9a55d7cda19fabeedbb89","avatarUrl":"/avatars/8e7acdd3a9c3552fbeff882bf32f245e.svg","isPro":false,"fullname":"lxp","user":"lxpp","type":"user"},{"_id":"68355c5ec0003bc40230b3f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68355c5ec0003bc40230b3f2/fJjAPFtmAJskQJqxWUb-T.jpeg","isPro":false,"fullname":"jasmineWang","user":"Jessamine","type":"user"},{"_id":"685d5708f55e4e848a5243ae","avatarUrl":"/avatars/ac864f34d14da3d91914f2b440d8a073.svg","isPro":false,"fullname":"lester","user":"rongll","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"69bb58c8655827a4d117197e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/yLHTDPzCQXydlDIoaZt1_.png","isPro":false,"fullname":"Isabella BROWN","user":"avadlf","type":"user"},{"_id":"660165de9e1cf5eb41fe4b0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660165de9e1cf5eb41fe4b0a/rpNxle6Px04AFTAomec0k.jpeg","isPro":false,"fullname":"Qianqian Xie","user":"mistletoe111","type":"user"},{"_id":"68abfd1ba1f07af43fbbf3f1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/ivRfEWMAo1GQWw3x06LQp.png","isPro":false,"fullname":"jiahaowang","user":"wang-jiahao","type":"user"},{"_id":"69059dc64c9138632afde265","avatarUrl":"/avatars/3ab6775359128d43bd28de006b94bd51.svg","isPro":false,"fullname":"runzhe wen","user":"wrz123","type":"user"},{"_id":"67d6b82d8d5c7a132ce1d690","avatarUrl":"/avatars/350f16030c0ebff38e84440a36779ba7.svg","isPro":false,"fullname":"Gamma","user":"dat3133","type":"user"},{"_id":"67ebcef758bcf67c68beebaa","avatarUrl":"/avatars/9acad7868b6071ae3e56181e85918490.svg","isPro":false,"fullname":"平安","user":"Starivers","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01993.md"}">
Papers
arxiv:2606.01993

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Published on Jun 1
· Submitted by
Jiaheng Liu
on Jun 4
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

MMG2Skill framework converts web-based procedural guides into executable skills through closed-loop learning, improving agent performance across GUI control, gameplay, and card play tasks.

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

Community

Paper submitter about 7 hours ago

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly
assumes human executors, making it difficult to use directly as the skills required by agents. To
bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously
improving them from trajectories observable to the agent. To evaluate the capability of existing
agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem.
We further propose MMG2Skill, a closed-loop framework that compiles guides into editable
skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and
revises the skills from trajectory-level root-cause feedback without using benchmark scores.

image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.01993
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01993 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.01993 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01993 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers