<video src=\"https://cdn-uploads.huggingface.co/production/uploads/60f1abe7544c2adfd699860c/Cn9DWMibNRPtMkkUUvmhg.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-05-29T16:22:43.734Z","author":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","fullname":"AK","name":"akhaliq","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":9587,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583856921041-5dd96eb166059660ed1ee413.png","fullname":"Hugging Face","name":"huggingface","type":"org","isHf":true,"details":"The AI community building the future.","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5007597804069519},"editors":["akhaliq"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a410aa233d2ba7da34228","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:44:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment](https://huggingface.co/papers/2604.10677) (2026)\n* [Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning](https://huggingface.co/papers/2605.29577) (2026)\n* [UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling](https://huggingface.co/papers/2604.19734) (2026)\n* [TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation](https://huggingface.co/papers/2605.05714) (2026)\n* [OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation](https://huggingface.co/papers/2605.25829) (2026)\n* [GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization](https://huggingface.co/papers/2605.12369) (2026)\n* [BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances](https://huggingface.co/papers/2604.23249) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.10677\">LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29577\">Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.19734\">UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.05714\">TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25829\">OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12369\">GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.23249\">BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:44:42.702Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6946402192115784},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30350","authors":[{"_id":"6a19bd20808ddbc3c7d42dc2","name":"Jusuk Lee","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dc3","name":"Seungjae Lee","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dc4","name":"Jonghun Shin","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dc5","name":"Hoseong Jung","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dc6","name":"Sungha Kim","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dc7","name":"Daesol Cho","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dc8","name":"H. Jin Kim","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dc9","name":"Jia-Bin Huang","hidden":false},{"_id":"6a19bd20808ddbc3c7d42dca","name":"Furong Huang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation","submittedOnDailyBy":{"_id":"60f1abe7544c2adfd699860c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674929746905-60f1abe7544c2adfd699860c.jpeg","isPro":false,"fullname":"AK","user":"akhaliq","type":"user","name":"akhaliq"},"summary":"Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.","upvotes":5,"discussionId":"6a19bd20808ddbc3c7d42dcb","ai_summary":"DynaFLIP is a dynamics-aware multimodal pre-training framework that enhances robot manipulation by integrating motion understanding into visual perception through image-language-3D flow triplets and geometric regularization techniques.","ai_keywords":["dynamics-aware multimodal pre-training","visual encoders","image-language-3D flow triplets","shared hyperspherical space","simplex volume","cosine regularizer","contrastive objective","control-relevant regions","visual backbones","variational language-action models"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64cbc3e2a257a3212c00a115","avatarUrl":"/avatars/836e61be4aeda2080ddf2db9f2626cc6.svg","isPro":false,"fullname":"Furong Huang Lab at UMD","user":"furongh-lab","type":"user"},{"_id":"638f26bb3783be5e1d04a86b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f26bb3783be5e1d04a86b/iLDzwTKPAQcZJv7s6ZLcp.jpeg","isPro":false,"fullname":"Sy-Tuyen Ho","user":"hosytuyen","type":"user"},{"_id":"67d553ca188cb393f7bb4cbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/5P9fCuIBwGQGlHTGoCgZq.png","isPro":false,"fullname":"Jusuk Lee","user":"jlee-larr","type":"user"},{"_id":"68cb936bbe60dd93cf5b7a0b","avatarUrl":"/avatars/35c5006e3e87aceb89eb13ff18b15fc8.svg","isPro":false,"fullname":"Jonghun Shin","user":"jhshin00","type":"user"},{"_id":"648961d150c003881f1a10c3","avatarUrl":"/avatars/1eb3784c39f7ced2e952d11a410933ae.svg","isPro":false,"fullname":"Harshita Sharma","user":"hdsharma","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30350.md"}">
DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
Published on May 28
· Submitted by AK on May 29 Abstract
DynaFLIP is a dynamics-aware multimodal pre-training framework that enhances robot manipulation by integrating motion understanding into visual perception through image-language-3D flow triplets and geometric regularization techniques.
AI-generated summary
Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30350 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30350 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.