Hugging Face Daily Papers · · 4 min read

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6747de57f8cab58c22ec94a2/MEH_Dch96YH7Sh1w7RZjJ.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6747de57f8cab58c22ec94a2/MEH_Dch96YH7Sh1w7RZjJ.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-05-22T02:46:43.939Z","author":{"_id":"6747de57f8cab58c22ec94a2","avatarUrl":"/avatars/5bae0341862fac24564781c0fa32aac5.svg","fullname":"Jinyang Wu","name":"Jinyang23","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.37051254510879517},"editors":["Jinyang23"],"editorAvatarUrls":["/avatars/5bae0341862fac24564781c0fa32aac5.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22177","authors":[{"_id":"6a0fbf72a53a61ce2e422c8d","name":"Jinyang Wu","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c8e","name":"Guocheng Zhai","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c8f","name":"Ruihan Jin","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c90","name":"Yuhao Shen","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c91","name":"Zhengxi Lu","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c92","name":"Fan Zhang","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c93","name":"Haoran Luo","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c94","name":"Zheng Lian","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c95","name":"Zhengqi Wen","hidden":false},{"_id":"6a0fbf72a53a61ce2e422c96","name":"Jianhua Tao","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles","submittedOnDailyBy":{"_id":"6747de57f8cab58c22ec94a2","avatarUrl":"/avatars/5bae0341862fac24564781c0fa32aac5.svg","isPro":false,"fullname":"Jinyang Wu","user":"Jinyang23","type":"user","name":"Jinyang23"},"summary":"The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.","upvotes":16,"discussionId":"6a0fbf72a53a61ce2e422c97","githubRepo":"https://github.com/jinyangwu/Maestro","githubRepoAddedBy":"user","ai_summary":"A reinforcement learning-driven orchestration framework dynamically composes expert models and skills for multimodal tasks, achieving superior performance with low computational overhead.","ai_keywords":["large language models","autonomous agents","Reinforcement Learning","multimodal tasks","hierarchical model-skill registry","policy optimization","ensemble learning","sequential decision-making","external expert invocation","outcome-based RL","computational efficiency"],"githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6747de57f8cab58c22ec94a2","avatarUrl":"/avatars/5bae0341862fac24564781c0fa32aac5.svg","isPro":false,"fullname":"Jinyang Wu","user":"Jinyang23","type":"user"},{"_id":"676127cf11b19ea602bb202a","avatarUrl":"/avatars/dfd802a24bd63e509728159ebb1769f6.svg","isPro":false,"fullname":"Zhengxi Lu","user":"LZXzju","type":"user"},{"_id":"6639ad487c0ab4fd9df1dde5","avatarUrl":"/avatars/8cc99f6ed8f8c1b2a14dde797a991a8c.svg","isPro":false,"fullname":"Fan Zhang","user":"Karl28","type":"user"},{"_id":"6841423dadeec0116d0ad66a","avatarUrl":"/avatars/90d2f8b44aed3ec8757860b6c5a7086e.svg","isPro":false,"fullname":"William Smith","user":"William288","type":"user"},{"_id":"6757c1bc1866a87cbc3860ed","avatarUrl":"/avatars/c34045fce8adec1941a1dd3e013aad55.svg","isPro":false,"fullname":"Liang Qiliang","user":"unknowncloudw","type":"user"},{"_id":"652f5c12c22d404ebfa126d2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652f5c12c22d404ebfa126d2/SoG5VlBelL3YkZZ41b1_o.jpeg","isPro":false,"fullname":"Zheng Lian","user":"zeroqiaoba","type":"user"},{"_id":"65d83362984cc240f2241e3a","avatarUrl":"/avatars/1f922987d7d69f553bb672c4d26ceef6.svg","isPro":false,"fullname":"Changpeng Yang","user":"thkelper","type":"user"},{"_id":"6708edcae69f6e30a816af9f","avatarUrl":"/avatars/c4daa9b0cb2f4bb2a7db0e78b22034cb.svg","isPro":false,"fullname":"Yao","user":"distant-yuan","type":"user"},{"_id":"69830bbb7f99218c71ed17d1","avatarUrl":"/avatars/fca02c2203284cc29694edbd505acee1.svg","isPro":false,"fullname":"Anastasia Volkova","user":"next-88","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6606d72ceea08fc29683bfd5","avatarUrl":"/avatars/76bd32d49b05330c6b328f4a7ad9baf0.svg","isPro":false,"fullname":"shuo yang","user":"shuo-yan","type":"user"},{"_id":"69045a2e9c3d523381fcf489","avatarUrl":"/avatars/8b1982e043a7ab28006b707dfff6c5dc.svg","isPro":false,"fullname":"JunlinLiu","user":"AaronLiu0702","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22177.md"}">
Papers
arxiv:2605.22177

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Published on May 21
· Submitted by
Jinyang Wu
on May 22
Authors:
,
,
,
,
,
,
,
,
,

Abstract

A reinforcement learning-driven orchestration framework dynamically composes expert models and skills for multimodal tasks, achieving superior performance with low computational overhead.

AI-generated summary

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

Community

Paper submitter about 9 hours ago

image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.22177
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22177 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22177 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers