Hugging Face Daily Papers · June 2, 2026 · 3 min read

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/63d3b5f1640bb0f77173baea/OOFGI2AmMjUx_VOMKo3vF.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/63d3b5f1640bb0f77173baea/OOFGI2AmMjUx_VOMKo3vF.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-02T02:59:50.920Z","author":{"_id":"63d3b5f1640bb0f77173baea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674819020331-noauth.jpeg","fullname":"yubin","name":"VLyb","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6140143275260925},"editors":["VLyb"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674819020331-noauth.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02277","authors":[{"_id":"6a1e46fa808ddbc3c7d43c72","name":"Bin Yu","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c73","name":"Yao Zhang","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c74","name":"Haishan Liu","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c75","name":"Shijie Lian","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c76","name":"Yuliang Wei","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c77","name":"Xiaopeng Lin","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c78","name":"Zhaolong Shen","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c79","name":"Changti Wu","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c7a","name":"Ruina Hu","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c7b","name":"Bailing Wang","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c7c","name":"Cong Huang","hidden":false},{"_id":"6a1e46fa808ddbc3c7d43c7d","name":"Kai Chen","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models","submittedOnDailyBy":{"_id":"63d3b5f1640bb0f77173baea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674819020331-noauth.jpeg","isPro":false,"fullname":"yubin","user":"VLyb","type":"user","name":"VLyb"},"summary":"Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.","upvotes":4,"discussionId":"6a1e46fa808ddbc3c7d43c7e","githubRepo":"https://github.com/ZGC-EmbodyAI/RoboSemanticBench","githubRepoAddedBy":"user","ai_summary":"RoboSemanticBench identifies a disconnect between semantic understanding and action prediction in vision-language-action models, where robots can grasp objects but fail to select semantically correct targets.","ai_keywords":["Vision-language-action models","semantic grounding","action prediction","embodied benchmark","robot fine-tuning","imitation learning","visual shortcuts","instruction-action shortcuts","semantic competence"],"githubStars":1,"organization":{"_id":"68896d3a716ee5bfb1428441","name":"ZGCA","fullname":"Zhongguancun Academy","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6854c3ab09a3ba7d16243875/aZ3tp3lZk1yQoXDwSklye.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63d3b5f1640bb0f77173baea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674819020331-noauth.jpeg","isPro":false,"fullname":"yubin","user":"VLyb","type":"user"},{"_id":"67a6128b42d4d2f92e1ceda4","avatarUrl":"/avatars/9710c9e1c8ba9b9944e5adbc7fa804ea.svg","isPro":false,"fullname":"hu","user":"Evanahu","type":"user"},{"_id":"6997198f83f0b84d31131224","avatarUrl":"/avatars/7f31ca3a5ce0f809d149b4ffabf5aada.svg","isPro":false,"fullname":"Qg2ppctxa2l","user":"qg2ppctxa2l","type":"user"},{"_id":"698304bc3ab1ff8c77629d87","avatarUrl":"/avatars/4c739a0fa49574a2875994a6318b413b.svg","isPro":false,"fullname":"Алексей Морозов","user":"cyber-scout","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68896d3a716ee5bfb1428441","name":"ZGCA","fullname":"Zhongguancun Academy","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6854c3ab09a3ba7d16243875/aZ3tp3lZk1yQoXDwSklye.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02277.md"}">

Papers

arxiv:2606.02277

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Published on Jun 1

· Submitted by

yubin on Jun 2

Zhongguancun Academy

Upvote

Authors:

Abstract

RoboSemanticBench identifies a disconnect between semantic understanding and action prediction in vision-language-action models, where robots can grasp objects but fail to select semantically correct targets.

AI-generated summary

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.

View arXiv page View PDF GitHub 1 Add to collection

Community

VLyb

Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.02277

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02277 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02277 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02277 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers