Hugging Face Daily Papers · June 25, 2026 · 5 min read

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

EBench is a surgical diagnosis tool for robot foundation models. It provides not a leaderboard, but A CAT scan for your policy.\nHere's why the field needed this, and what it actually reveals about π0, π0.5, Qwen-RobotManip, and the rest:\n<hr>\n1/ The \"success rate\" era is over.\nEvery robotics benchmark gives you a number. EBench gives you a profile.\n26 tasks, 5 dimensions: Operating Mode, Horizon, Precision, Atomic Skill, Scene. Plus 4 generalization axes: Object, Background, Instruction, Composition.\nSame model can look like a genius on one slice and a toddler on another. The aggregate score was hiding everything.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/At0-gfwSim9sudRQCkk_Y.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/At0-gfwSim9sudRQCkk_Y.png\" alt=\"image\"></a>\n2/ The \"overfitting game\" is real, and EBench calls it out.\nThey enforce strict train-test isolation at the object level. Validation-Train vs Validation-Unseen vs Test.\nPlot val-to-test migration curves and you immediately see who's actually generalizing vs who's memorizing the training distribution.\nπ0.5 has the tightest val-test gap. That's why the community feels it's \"good at fine-tuning.\" The numbers finally explain the vibe.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/-eMv2JoMOmumBQkownQ1Z.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65d9f09bbcd15bc5cb255fed/-eMv2JoMOmumBQkownQ1Z.png\" alt=\"image\"></a>\n3/ Qwen-RobotManip just took #1, but the story is structural, not just numerical.\n45.6% Test SR, 60.8% Test Score. But look at the five-dimensional breakdown:\n<ul>\n<li>Mobile: 43.8%</li>\n<li>Dexterous: 50.0%</li>\n<li>Short Horizon: 50.2%</li>\n<li>Long Horizon: 33.1%</li>\n<li>Low Precision: 50.6%</li>\n<li>High Precision: 18.8% ← still the bottleneck</li>\n</ul>\nIt's not a single spike. It's a shape. And that shape tells you exactly where to optimize next.\nLinks:\n<ul>\n<li>📄 Paper: <a href=\"https://arxiv.org/pdf/2606.18239\" rel=\"nofollow\">https://arxiv.org/pdf/2606.18239</a></li>\n<li>💻 Code: <a href=\"https://github.com/InternRobotics/EBench\" rel=\"nofollow\">https://github.com/InternRobotics/EBench</a></li>\n<li>🏆 Eval Platform: <a href=\"https://internrobotics.shlab.org.cn/eval\" rel=\"nofollow\">https://internrobotics.shlab.org.cn/eval</a></li>\n</ul>\n","updatedAt":"2026-06-25T03:43:08.797Z","author":{"_id":"65d9f09bbcd15bc5cb255fed","avatarUrl":"/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg","fullname":"hanqingwang","name":"hanqing94","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8325209617614746},"editors":["hanqing94"],"editorAvatarUrls":["/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18239","authors":[{"_id":"6a3c9dacf3facdb67e9ff114","name":"Ning Gao","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff115","name":"Jinliang Zheng","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff116","name":"Xing Gao","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff117","name":"Haoxiang Ma","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff118","name":"Hanqing Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff119","name":"Yukai Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11a","name":"Jiantong Chen","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11b","name":"Zanxin Chen","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11c","name":"Shujie Zhang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11d","name":"Mingda Jia","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11e","name":"Xuekun Jiang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff11f","name":"Zihou Zhu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff120","name":"Xinyu Li","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff121","name":"Shuai Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff122","name":"Hao Li","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff123","name":"Wenzhe Cai","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff124","name":"Yuqiang Yang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff125","name":"Xudong Xu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff126","name":"Zhaoyang Lyu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff127","name":"Yao Mu","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff128","name":"Tai Wang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff129","name":"Jiangmiao Pang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff12a","name":"Jia Zeng","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff12b","name":"Weinan Zhang","hidden":false},{"_id":"6a3c9dacf3facdb67e9ff12c","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-06-20T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies","submittedOnDailyBy":{"_id":"65d9f09bbcd15bc5cb255fed","avatarUrl":"/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg","isPro":false,"fullname":"hanqingwang","user":"hanqing94","type":"user","name":"hanqing94"},"summary":"We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.","upvotes":12,"discussionId":"6a3c9dadf3facdb67e9ff12d","projectPage":"https://internrobotics.github.io/EBench-home/","githubRepo":"https://github.com/InternRobotics/EBench","githubRepoAddedBy":"user","ai_summary":"EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models.","ai_keywords":["generalist manipulation policies","simulation benchmark","capability dimensions","generalization dimensions","success-rate scalar","manipulation tasks","policy evaluation","distribution shift factors"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":95,"organization":{"_id":"6881c146ff13df8b65153273","name":"InternRobotics","fullname":"Intern Robotics","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d9f09bbcd15bc5cb255fed/REfA3nEK1_Y-PTfGn_5H1.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65d9f09bbcd15bc5cb255fed","avatarUrl":"/avatars/4ae7c5366a6454e542bf3b3ddefa9c7e.svg","isPro":false,"fullname":"hanqingwang","user":"hanqing94","type":"user"},{"_id":"6410213f928400b416424f6e","avatarUrl":"/avatars/4ce6a2a33d73119dc840217d7d053343.svg","isPro":false,"fullname":"Xudong Xu","user":"Sheldoooon","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"697c9c4e5a5297356751bd26","avatarUrl":"/avatars/5ff52094024d4e45534bd461116c08e1.svg","isPro":false,"fullname":"bobjones","user":"bobjones45","type":"user"},{"_id":"687ee38d2575dc8c85910bd3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/687ee38d2575dc8c85910bd3/BIdM26F4tbDnnSIR9C_b0.jpeg","isPro":false,"fullname":"JiantongChen","user":"JiantongChen","type":"user"},{"_id":"670ea2f235baa06da9f067ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/36GdGPmzXMY1cMsnFFSvY.png","isPro":false,"fullname":"Yukai Wang","user":"kew1046","type":"user"},{"_id":"669754ab97d9e730d5b757ca","avatarUrl":"/avatars/61479c7262dc790b33d4f801c9d65940.svg","isPro":false,"fullname":"zhuxueyue","user":"winkkkz","type":"user"},{"_id":"64c0afc06b2f05ae642e1918","avatarUrl":"/avatars/70f9a87d123ba65a5f931db028bb095b.svg","isPro":false,"fullname":"Jinliang Zheng","user":"2toINF","type":"user"},{"_id":"66935bdc5489e4f73c76bc7b","avatarUrl":"/avatars/129d1e86bbaf764b507501f4feb177db.svg","isPro":false,"fullname":"Abidoye Aanuoluwapo","user":"Aanuoluwapo65","type":"user"},{"_id":"66bc440c9c13cd4047b8cf92","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66bc440c9c13cd4047b8cf92/Z5LUEflHXPIT4qjjIM6B_.png","isPro":false,"fullname":"z","user":"shutzang","type":"user"},{"_id":"6870d8d4cccca937f5ebe551","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/z0WeD--0to2ZTuojyoxK5.png","isPro":false,"fullname":"ZihouZhu","user":"jandan138","type":"user"},{"_id":"649909f7f84c9448d6d76f7d","avatarUrl":"/avatars/196007b3831d03c61161f8e36bfd0364.svg","isPro":false,"fullname":"Xinyu Li","user":"CA7AX","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6881c146ff13df8b65153273","name":"InternRobotics","fullname":"Intern Robotics","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d9f09bbcd15bc5cb255fed/REfA3nEK1_Y-PTfGn_5H1.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18239.md","query":{}}">

Papers

arxiv:2606.18239

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Published on Jun 20

· Submitted by

hanqingwang on Jun 25

Intern Robotics

Upvote

Authors:

Abstract

EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

View arXiv page View PDF Project page GitHub 95 Add to collection

Community

hanqing94

Paper submitter about 5 hours ago

EBench is a surgical diagnosis tool for robot foundation models. It provides not a leaderboard, but A CAT scan for your policy.

Here's why the field needed this, and what it actually reveals about π0, π0.5, Qwen-RobotManip, and the rest:

1/ The "success rate" era is over.

Every robotics benchmark gives you a number. EBench gives you a profile.

26 tasks, 5 dimensions: Operating Mode, Horizon, Precision, Atomic Skill, Scene. Plus 4 generalization axes: Object, Background, Instruction, Composition.

Same model can look like a genius on one slice and a toddler on another. The aggregate score was hiding everything.

2/ The "overfitting game" is real, and EBench calls it out.

They enforce strict train-test isolation at the object level. Validation-Train vs Validation-Unseen vs Test.

Plot val-to-test migration curves and you immediately see who's actually generalizing vs who's memorizing the training distribution.

π0.5 has the tightest val-test gap. That's why the community feels it's "good at fine-tuning." The numbers finally explain the vibe.

3/ Qwen-RobotManip just took #1, but the story is structural, not just numerical.

45.6% Test SR, 60.8% Test Score. But look at the five-dimensional breakdown:

Mobile: 43.8%
Dexterous: 50.0%
Short Horizon: 50.2%
Long Horizon: 33.1%
Low Precision: 50.6%
High Precision: 18.8% ← still the bottleneck

It's not a single spike. It's a shape. And that shape tells you exactly where to optimize next.

Links:

📄 Paper: https://arxiv.org/pdf/2606.18239
💻 Code: https://github.com/InternRobotics/EBench
🏆 Eval Platform: https://internrobotics.shlab.org.cn/eval

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18239

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18239 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18239 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18239 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers