Hugging Face Daily Papers · · 4 min read

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Open-sourced.</p>\n","updatedAt":"2026-06-03T02:13:34.094Z","author":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/ZQVw6cdpC2WsJ46aW4iyh.png","fullname":"HuggingFace Zhang","name":"SteveZeyuZhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9639630913734436},"editors":["SteveZeyuZhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/ZQVw6cdpC2WsJ46aW4iyh.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01788","authors":[{"_id":"6a1f8d8ee292c1c78ecb12f9","name":"Junlin Long","hidden":false},{"_id":"6a1f8d8ee292c1c78ecb12fa","name":"Zeyu Zhang","hidden":false},{"_id":"6a1f8d8ee292c1c78ecb12fb","name":"Xu Deng","hidden":false},{"_id":"6a1f8d8ee292c1c78ecb12fc","name":"Yiran Wang","hidden":false},{"_id":"6a1f8d8ee292c1c78ecb12fd","name":"Yue Yang","hidden":false},{"_id":"6a1f8d8ee292c1c78ecb12fe","name":"Luke Borgnolo","hidden":false},{"_id":"6a1f8d8ee292c1c78ecb12ff","name":"Maxwell Twelftree","hidden":false},{"_id":"6a1f8d8ee292c1c78ecb1300","name":"Yang Zhao","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64ec877bb93654d4ca5c92e9/Xg6fPBSeFFQ66-jFpMt4l.mp4"],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps","submittedOnDailyBy":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/ZQVw6cdpC2WsJ46aW4iyh.png","isPro":false,"fullname":"HuggingFace Zhang","user":"SteveZeyuZhang","type":"user","name":"SteveZeyuZhang"},"summary":"Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.","upvotes":5,"discussionId":"6a1f8d8fe292c1c78ecb1301","projectPage":"https://aigeeksgroup.github.io/PlatonicNav/","githubRepo":"https://github.com/AIGeeksGroup/PlatonicNav","githubRepoAddedBy":"user","ai_summary":"A training-free framework for embodied navigation that uses a vision-only approach to create semantic maps and ground language goals through blind matching without paired vision-language data.","ai_keywords":["embodied visual navigation","vision-and-language navigation","object goal navigation","Platonic Representation Hypothesis","vision-only ObjNav","cross-modal ObjNav","VLN","Platonic Topological Map","self-supervised visual encoder","blind matching","semantic manifold","visual encoder","cross-modal supervision","CLIP","large vision-language models"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"68c103bb5abd7dc5e568fc39","name":"Maincode","fullname":"Maincode","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68c1011be8c51ea149d66457/GGoAgWphaJUaf-D6kE2sd.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/ZQVw6cdpC2WsJ46aW4iyh.png","isPro":false,"fullname":"HuggingFace Zhang","user":"SteveZeyuZhang","type":"user"},{"_id":"69a04f5fc7c1bf741f785faa","avatarUrl":"/avatars/18124e74fe4155028bb2da83a30be409.svg","isPro":false,"fullname":"Jalen Loong","user":"dontKnow23456","type":"user"},{"_id":"6a1f9013cb1c13a67e9912bb","avatarUrl":"/avatars/ae0dd2a565e31db6dd29cb7388f66088.svg","isPro":false,"fullname":"Xu Deng","user":"Ethan-Xu-Deng","type":"user"},{"_id":"6615494716917dfdc645c44e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6615494716917dfdc645c44e/GGzgDi_WTW1Ci4CaDJd8I.jpeg","isPro":true,"fullname":"Daniel Fox","user":"FlameF0X","type":"user"},{"_id":"65ed9294732db7ff45e37437","avatarUrl":"/avatars/c0050f4d21d9b70f7b9b6df3c6c1dc07.svg","isPro":false,"fullname":"Yiran Wang","user":"yiranranranra","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68c103bb5abd7dc5e568fc39","name":"Maincode","fullname":"Maincode","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68c1011be8c51ea149d66457/GGoAgWphaJUaf-D6kE2sd.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01788.md"}">
Papers
arxiv:2606.01788

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

Published on Jun 1
· Submitted by
HuggingFace Zhang
on Jun 3
Authors:
,
,
,
,
,
,
,

Abstract

A training-free framework for embodied navigation that uses a vision-only approach to create semantic maps and ground language goals through blind matching without paired vision-language data.

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.

Community

Open-sourced.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.01788
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01788 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.01788 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01788 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers