Hugging Face Daily Papers · · 4 min read

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<video src=\"https://cdn-uploads.huggingface.co/production/uploads/620b3bbb0668e435407c8d0a/6V1a0RvEBCACyf4_jgTjR.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>\n","updatedAt":"2026-06-12T04:19:02.937Z","author":{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","fullname":"Ningyu Zhang","name":"Ningyu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":50,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5285848379135132},"editors":["Ningyu"],"editorAvatarUrls":["/avatars/e0fccbb2577d76088e09f054c35cffbc.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13578","authors":[{"_id":"6a2b643d4957fcdd3aac05b8","name":"Baochang Ren","hidden":false},{"_id":"6a2b643d4957fcdd3aac05b9","name":"Xinjie Liu","hidden":false},{"_id":"6a2b643d4957fcdd3aac05ba","name":"Xi Chen","hidden":false},{"_id":"6a2b643d4957fcdd3aac05bb","name":"Yanshuo Liu","hidden":false},{"_id":"6a2b643d4957fcdd3aac05bc","name":"Chenxi Li","hidden":false},{"_id":"6a2b643d4957fcdd3aac05bd","name":"Daqi Gao","hidden":false},{"_id":"6a2b643d4957fcdd3aac05be","name":"Zeqin Su","hidden":false},{"_id":"6a2b643d4957fcdd3aac05bf","name":"Jintao Xing","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c0","name":"Zirui Xue","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c1","name":"Rui Li","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c2","name":"Xiangyu Zhao","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c3","name":"Shuofei Qiao","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c4","name":"Minting Pan","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c5","name":"Wangmeng Zuo","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c6","name":"Lei Bai","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c7","name":"Dongzhan Zhou","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c8","user":{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","isPro":false,"fullname":"Ningyu Zhang","user":"Ningyu","type":"user","name":"Ningyu"},"name":"Ningyu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:58:01.389Z","hidden":false},{"_id":"6a2b643d4957fcdd3aac05c9","name":"Huajun Chen","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.","upvotes":40,"discussionId":"6a2b643e4957fcdd3aac05ca","projectPage":"https://zjunlp.github.io/LabVLA/","githubRepo":"https://github.com/zjunlp/LabVLA","githubRepoAddedBy":"user","ai_summary":"LabVLA, a vision-language-action model trained with a two-stage approach combining action token pretraining and flow matching, demonstrates superior performance on laboratory automation tasks through simulated data generation and robot-specific learning.","ai_keywords":["Vision-Language-Action models","robotic execution","laboratory workflows","simulation-based workflow","data engine","LabVLA","two-stage recipe","FAST action token pretraining","Qwen3-VL-4B-Instruct","flow matching","DiT action expert","knowledge insulation","LabUtopia benchmark"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":39},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","isPro":false,"fullname":"Ningyu Zhang","user":"Ningyu","type":"user"},{"_id":"6663430fd71a4e1e6ccc802c","avatarUrl":"/avatars/bcb4d87840772f861cabc439c1699329.svg","isPro":false,"fullname":"Baochang Ren","user":"BaochangRen","type":"user"},{"_id":"6a0c1e5139d601217d9b3e8e","avatarUrl":"/avatars/bc27ca94a598dd902d591cbdee597f0c.svg","isPro":false,"fullname":"Leonardo Garate","user":"Opaquing","type":"user"},{"_id":"6441f1d2603214724ec0c1c2","avatarUrl":"/avatars/d3c4b759e6a5635e37ff715fae52e5ba.svg","isPro":false,"fullname":"Shumin Deng","user":"231sm","type":"user"},{"_id":"652bdbb77c5365f2d1228dfb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652bdbb77c5365f2d1228dfb/ImPwcK1dMr23MtJVI9C9I.jpeg","isPro":false,"fullname":"ZhongYi","user":"Blurblur02","type":"user"},{"_id":"6894787ff5129f7d7ccdab6c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/rHWO2OtFuu5fgK_r9I4hn.jpeg","isPro":false,"fullname":"chenxi","user":"chenxxxxxx123","type":"user"},{"_id":"66cd5c0ad0bdbf5d712cab41","avatarUrl":"/avatars/29115c021c6ecaae889711d20a24febd.svg","isPro":false,"fullname":"YLR9933","user":"YLR9933","type":"user"},{"_id":"6190ab805ca89a28e9f66873","avatarUrl":"/avatars/3c7ecc398fbf851acd2a132e947a92be.svg","isPro":false,"fullname":"Xin Xu","user":"XinXuNLPer","type":"user"},{"_id":"6a2b8b55915f77221b8067bd","avatarUrl":"/avatars/58c778894390ea34206cb57218c305af.svg","isPro":false,"fullname":"XingJintao","user":"Ifanobody","type":"user"},{"_id":"66abc6da92b9eb71fe476118","avatarUrl":"/avatars/6d1618f45cc76da80335ad926ad24552.svg","isPro":false,"fullname":"xy.r","user":"ShawnRu","type":"user"},{"_id":"67e817ceba868546ac409f92","avatarUrl":"/avatars/83d829730eddabac0fee910f020583eb.svg","isPro":false,"fullname":"lxj","user":"saddwawda","type":"user"},{"_id":"6a2b8bee1aa72de23add40b3","avatarUrl":"/avatars/0d36e148e7559a4577ced9da66d7f02f.svg","isPro":false,"fullname":"xu","user":"yidong111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13578.md","query":{}}">
Papers
arxiv:2606.13578

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Published on Jun 11
· Submitted by
taesiri
on Jun 12
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

LabVLA, a vision-language-action model trained with a two-stage approach combining action token pretraining and flow matching, demonstrates superior performance on laboratory automation tasks through simulated data generation and robot-specific learning.

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

Community

Paper author about 6 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13578
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13578 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13578 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers