Hugging Face Daily Papers · · 4 min read

Geometric Action Model for Robot Policy Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/686fb8a66ea5d5fb0a4953a9/Fw26Fdk9VDgVH4lhBw6Og.webp\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/686fb8a66ea5d5fb0a4953a9/Fw26Fdk9VDgVH4lhBw6Og.webp\" alt=\"teaser3\"></a></p>\n","updatedAt":"2026-06-16T05:19:17.171Z","author":{"_id":"686fb8a66ea5d5fb0a4953a9","avatarUrl":"/avatars/b4aa187b82dd04a5a7ece3b922d86657.svg","fullname":"Sunghwan Hong","name":"hongsunghwan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5287291407585144},"editors":["hongsunghwan"],"editorAvatarUrls":["/avatars/b4aa187b82dd04a5a7ece3b922d86657.svg"],"reactions":[{"reaction":"👍","users":["Ryoo72","nandometzger","SeonghuJeon","adelitzas","jiho31"],"count":5}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17046","authors":[{"_id":"6a30dc5ba0d4daae4286021d","user":{"_id":"64886f020a30bd0e7bd26dc3","avatarUrl":"/avatars/f02e7996a95968ba5cf9d5ba1de6d8b0.svg","isPro":false,"fullname":"Jisang Han","user":"onground","type":"user","name":"onground"},"name":"Jisang Han","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:05:40.104Z","hidden":false},{"_id":"6a30dc5ba0d4daae4286021e","user":{"_id":"67861f4658328c475597e540","avatarUrl":"/avatars/ff3d7b7912544cd0799d289e6c51db7a.svg","isPro":false,"fullname":"Seonghu Jeon","user":"SeonghuJeon","type":"user","name":"SeonghuJeon"},"name":"Seonghu Jeon","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:05:46.838Z","hidden":false},{"_id":"6a30dc5ba0d4daae4286021f","name":"Jaewoo Jung","hidden":false},{"_id":"6a30dc5ba0d4daae42860220","name":"René Zurbrügg","hidden":false},{"_id":"6a30dc5ba0d4daae42860221","user":{"_id":"67565151e704deb871d99c95","avatarUrl":"/avatars/b7f7b0f29f6879de3c9586fe7ba9d6b9.svg","isPro":false,"fullname":"honggyu An","user":"honggyuAn","type":"user","name":"honggyuAn"},"name":"Honggyu An","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:05:44.617Z","hidden":false},{"_id":"6a30dc5ba0d4daae42860222","name":"Tifanny Portela","hidden":false},{"_id":"6a30dc5ba0d4daae42860223","name":"Marco Hutter","hidden":false},{"_id":"6a30dc5ba0d4daae42860224","name":"Marc Pollefeys","hidden":false},{"_id":"6a30dc5ba0d4daae42860225","name":"Seungryong Kim","hidden":false},{"_id":"6a30dc5ba0d4daae42860226","user":{"_id":"686fb8a66ea5d5fb0a4953a9","avatarUrl":"/avatars/b4aa187b82dd04a5a7ece3b922d86657.svg","isPro":false,"fullname":"Sunghwan Hong","user":"hongsunghwan","type":"user","name":"hongsunghwan"},"name":"Sunghwan Hong","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:05:42.297Z","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Geometric Action Model for Robot Policy Learning","submittedOnDailyBy":{"_id":"686fb8a66ea5d5fb0a4953a9","avatarUrl":"/avatars/b4aa187b82dd04a5a7ece3b922d86657.svg","isPro":false,"fullname":"Sunghwan Hong","user":"hongsunghwan","type":"user","name":"hongsunghwan"},"summary":"Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.","upvotes":78,"discussionId":"6a30dc5ba0d4daae42860227","projectPage":"https://cvlab-kaist.github.io/Geometric-Action-Model/","githubRepo":"https://github.com/cvlab-kaist/Geometric-Action-Model","githubRepoAddedBy":"user","ai_summary":"A geometric action model leverages pretrained geometric foundation models to enable language-conditioned manipulation policies with improved accuracy, robustness, and efficiency in 3D physical environments.","ai_keywords":["vision-language-action models","video world-action models","geometric foundation models","language-conditioned manipulation policy","causal future predictor","latent tokens","action decoding","temporal world modeling","3D physical world","contact-rich manipulation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":27,"organization":{"_id":"63dd02d5f37111482523565a","name":"ETHZurich","fullname":"ETH Zürich","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675428475378-63dcff68a8877129a1574f33.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67861f4658328c475597e540","avatarUrl":"/avatars/ff3d7b7912544cd0799d289e6c51db7a.svg","isPro":false,"fullname":"Seonghu Jeon","user":"SeonghuJeon","type":"user"},{"_id":"63198bf1615c77c25d63e9ab","avatarUrl":"/avatars/6d591f87366e9990fed3c221dafdfae0.svg","isPro":false,"fullname":"Yunsung Lee","user":"Maangeek","type":"user"},{"_id":"686fb8a66ea5d5fb0a4953a9","avatarUrl":"/avatars/b4aa187b82dd04a5a7ece3b922d86657.svg","isPro":false,"fullname":"Sunghwan Hong","user":"hongsunghwan","type":"user"},{"_id":"6752b6315281c3cae4b0783f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xmcyVEl2xBhk3G5_7dmpz.png","isPro":false,"fullname":"Paul Hyunbin Cho","user":"paulcho98","type":"user"},{"_id":"674e9fd3644d6056e583166c","avatarUrl":"/avatars/31dce6e8c1ff256372889c796c67e651.svg","isPro":false,"fullname":"Jinhyuk Jang","user":"JinhyukJang","type":"user"},{"_id":"668e24b961b6eff5a8ad2d34","avatarUrl":"/avatars/f3a4e2e766110e687010a702ed96d7af.svg","isPro":false,"fullname":"chloe","user":"chloe1929","type":"user"},{"_id":"66f3f780169b85adf963508d","avatarUrl":"/avatars/495ce69c478c332b5fafe897bf1ee80e.svg","isPro":false,"fullname":"Jaeyeong Kim","user":"jy9394","type":"user"},{"_id":"64cb5884d469fc2cf83bdd76","avatarUrl":"/avatars/10e63cf62d8200beef3e31846796e398.svg","isPro":false,"fullname":"JisooKim","user":"Jiiiiiisoo","type":"user"},{"_id":"6752ac9be0c39c0eaf6ba90d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cbByUmYoPVUAr35MWQeVm.png","isPro":false,"fullname":"lee","user":"lshlsh","type":"user"},{"_id":"6752b5ebebb87145beedaecb","avatarUrl":"/avatars/1de059e88dad6fe070cb22ba96d32914.svg","isPro":false,"fullname":"Seungryong Kim","user":"seungryongkim","type":"user"},{"_id":"651277c2b6ffd31931db5290","avatarUrl":"/avatars/8495b84e8aed407da07908ee829e0510.svg","isPro":false,"fullname":"JihoPark","user":"jiho31","type":"user"},{"_id":"67136aa856c26d6294bc7dad","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/APivpGKA0OHK3EPV6AQqL.png","isPro":false,"fullname":"JungJae Lee","user":"Jerry112","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"63dd02d5f37111482523565a","name":"ETHZurich","fullname":"ETH Zürich","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675428475378-63dcff68a8877129a1574f33.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17046.md","query":{}}">
Papers
arxiv:2606.17046

Geometric Action Model for Robot Policy Learning

Published on Jun 15
· Submitted by
Sunghwan Hong
on Jun 16
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

A geometric action model leverages pretrained geometric foundation models to enable language-conditioned manipulation policies with improved accuracy, robustness, and efficiency in 3D physical environments.

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

Community

Paper author Paper submitter about 8 hours ago

teaser3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.17046
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17046 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17046 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17046 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers