EBench is a surgical diagnosis tool for robot foundation models. It provides not a leaderboard, but A <em>CAT scan</em> for your policy.</p>\n<p>Here's why the field needed this, and what it actually reveals about π0, π0.5, Qwen-RobotManip, and the rest:</p>\n<hr>\n<p>1/ The \"success rate\" era is over.</p>\n<p>Every robotics benchmark gives you a number. EBench gives you a <em>profile</em>.</p>\n<p>26 tasks, 5 dimensions: Operating Mode, Horizon, Precision, Atomic Skill, Scene. Plus 4 generalization axes: Object, Background, Instruction, Composition.</p>\n<p>Same model can look like a genius on one slice and a toddler on another. The aggregate score was hiding everything.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/At0-gfwSim9sudRQCkk_Y.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/At0-gfwSim9sudRQCkk_Y.png\" alt=\"image\"></a></p>\n<p>2/ The \"overfitting game\" is real, and EBench calls it out.</p>\n<p>They enforce strict train-test isolation at the <em>object level</em>. Validation-Train vs Validation-Unseen vs Test.</p>\n<p>Plot val-to-test migration curves and you immediately see who's actually generalizing vs who's memorizing the training distribution.</p>\n<p>π0.5 has the tightest val-test gap. That's why the community feels it's \"good at fine-tuning.\" The numbers finally explain the vibe.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/-eMv2JoMOmumBQkownQ1Z.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/-eMv2JoMOmumBQkownQ1Z.png\" alt=\"image\"></a></p>\n<p>3/ Qwen-RobotManip just took #1, but the story is structural, not just numerical.</p>\n<p>45.6% Test SR, 60.8% Test Score. But look at the five-dimensional breakdown:</p>\n<ul>\n<li>Mobile: 43.8%</li>\n<li>Dexterous: 50.0%</li>\n<li>Short Horizon: 50.2%</li>\n<li>Long Horizon: 33.1%</li>\n<li>Low Precision: 50.6%</li>\n<li>High Precision: 18.8% ← still the bottleneck</li>\n</ul>\n<p>It's not a single spike. It's a <em>shape</em>. And that shape tells you exactly where to optimize next.</p>\n<p><strong>Links:</strong></p>\n<ul>\n<li>📄 Paper: <a href=\"https://arxiv.org/pdf/2606.18239\" rel=\"nofollow\">https://arxiv.org/pdf/2606.18239</a></li>\n<li>💻 Code: <a href=\"https://github.com/InternRobotics/EBench\" rel=\"nofollow\">https://github.com/InternRobotics/EBench</a></li>\n<li>🏆 Eval Platform: <a href=\"https://internrobotics.shlab.org.cn/eval\" rel=\"nofollow\">https://internrobotics.shlab.org.cn/eval</a></li>\n</ul>\n","updatedAt":"2026-06-25T03:43:08.797Z","author":{"_id":"65d9f09bbcd15bc5cb255fed","avatarUrl":"/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg","fullname":"hanqingwang","name":"hanqing94","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8325209617614746},"editors":["hanqing94"],"editorAvatarUrls":["/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18239","authors":[{"_id":"6a3c9dacf3facdb67e9ff114","name":"Ning Gao","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff115","name":"Jinliang Zheng","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff116","name":"Xing Gao","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff117","name":"Haoxiang Ma","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff118","name":"Hanqing Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff119","name":"Yukai Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11a","name":"Jiantong Chen","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11b","name":"Zanxin Chen","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11c","name":"Shujie Zhang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11d","name":"Mingda Jia","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11e","name":"Xuekun Jiang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11f","name":"Zihou Zhu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff120","name":"Xinyu Li","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff121","name":"Shuai Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff122","name":"Hao Li","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff123","name":"Wenzhe Cai","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff124","name":"Yuqiang Yang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff125","name":"Xudong Xu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff126","name":"Zhaoyang Lyu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff127","name":"Yao Mu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff128","name":"Tai Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff129","name":"Jiangmiao Pang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff12a","name":"Jia Zeng","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff12b","name":"Weinan Zhang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff12c","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-06-20T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies","submittedOnDailyBy":{"_id":"65d9f09bbcd15bc5cb255fed","avatarUrl":"/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg","isPro":false,"fullname":"hanqingwang","user":"hanqing94","type":"user","name":"hanqing94"},"summary":"We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.","upvotes":12,"discussionId":"6a3c9dadf3facdb67e9ff12d","projectPage":"https://internrobotics.github.io/EBench-home/","githubRepo":"https://github.com/InternRobotics/EBench","githubRepoAddedBy":"user","ai_summary":"EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models.","ai_keywords":["generalist manipulation policies","simulation benchmark","capability dimensions","generalization dimensions","success-rate scalar","manipulation tasks","policy evaluation","distribution shift factors"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":95,"organization":{"_id":"6881c146ff13df8b65153273","name":"InternRobotics","fullname":"Intern Robotics","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d9f09bbcd15bc5cb255fed/REfA3nEK1_Y-PTfGn_5H1.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65d9f09bbcd15bc5cb255fed","avatarUrl":"/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg","isPro":false,"fullname":"hanqingwang","user":"hanqing94","type":"user"},{"_id":"6410213f928400b416424f6e","avatarUrl":"/avatars/4ce6a2a33d73119dc840217d7d053343.svg","isPro":false,"fullname":"Xudong Xu","user":"Sheldoooon","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"697c9c4e5a5297356751bd26","avatarUrl":"/avatars/5ff52094024d4e45534bd461116c08e1.svg","isPro":false,"fullname":"bobjones","user":"bobjones45","type":"user"},{"_id":"687ee38d2575dc8c85910bd3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/687ee38d2575dc8c85910bd3/BIdM26F4tbDnnSIR9C_b0.jpeg","isPro":false,"fullname":"JiantongChen","user":"JiantongChen","type":"user"},{"_id":"670ea2f235baa06da9f067ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/36GdGPmzXMY1cMsnFFSvY.png","isPro":false,"fullname":"Yukai Wang","user":"kew1046","type":"user"},{"_id":"669754ab97d9e730d5b757ca","avatarUrl":"/avatars/61479c7262dc790b33d4f801c9d65940.svg","isPro":false,"fullname":"zhuxueyue","user":"winkkkz","type":"user"},{"_id":"64c0afc06b2f05ae642e1918","avatarUrl":"/avatars/70f9a87d123ba65a5f931db028bb095b.svg","isPro":false,"fullname":"Jinliang Zheng","user":"2toINF","type":"user"},{"_id":"66935bdc5489e4f73c76bc7b","avatarUrl":"/avatars/129d1e86bbaf764b507501f4feb177db.svg","isPro":false,"fullname":"Abidoye Aanuoluwapo","user":"Aanuoluwapo65","type":"user"},{"_id":"66bc440c9c13cd4047b8cf92","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66bc440c9c13cd4047b8cf92/Z5LUEflHXPIT4qjjIM6B_.png","isPro":false,"fullname":"z","user":"shutzang","type":"user"},{"_id":"6870d8d4cccca937f5ebe551","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/z0WeD--0to2ZTuojyoxK5.png","isPro":false,"fullname":"ZihouZhu","user":"jandan138","type":"user"},{"_id":"649909f7f84c9448d6d76f7d","avatarUrl":"/avatars/196007b3831d03c61161f8e36bfd0364.svg","isPro":false,"fullname":"Xinyu Li","user":"CA7AX","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6881c146ff13df8b65153273","name":"InternRobotics","fullname":"Intern Robotics","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d9f09bbcd15bc5cb255fed/REfA3nEK1_Y-PTfGn_5H1.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18239.md","query":{}}">
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models.
We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.
Community
EBench is a surgical diagnosis tool for robot foundation models. It provides not a leaderboard, but A CAT scan for your policy.
Here's why the field needed this, and what it actually reveals about π0, π0.5, Qwen-RobotManip, and the rest:
1/ The "success rate" era is over.
Every robotics benchmark gives you a number. EBench gives you a profile.
26 tasks, 5 dimensions: Operating Mode, Horizon, Precision, Atomic Skill, Scene. Plus 4 generalization axes: Object, Background, Instruction, Composition.
Same model can look like a genius on one slice and a toddler on another. The aggregate score was hiding everything.

2/ The "overfitting game" is real, and EBench calls it out.
They enforce strict train-test isolation at the object level. Validation-Train vs Validation-Unseen vs Test.
Plot val-to-test migration curves and you immediately see who's actually generalizing vs who's memorizing the training distribution.
π0.5 has the tightest val-test gap. That's why the community feels it's "good at fine-tuning." The numbers finally explain the vibe.

3/ Qwen-RobotManip just took #1, but the story is structural, not just numerical.
45.6% Test SR, 60.8% Test Score. But look at the five-dimensional breakdown:
- Mobile: 43.8%
- Dexterous: 50.0%
- Short Horizon: 50.2%
- Long Horizon: 33.1%
- Low Precision: 50.6%
- High Precision: 18.8% ← still the bottleneck
It's not a single spike. It's a shape. And that shape tells you exactly where to optimize next.
Links:
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18239 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.18239 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.18239 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.