Hugging Face Daily Papers · · 3 min read

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Project Page: <a href=\"https://Cornell-VAILab.github.io/SceneAligner\" rel=\"nofollow\">https://Cornell-VAILab.github.io/SceneAligner</a></p>\n","updatedAt":"2026-05-22T03:18:50.273Z","author":{"_id":"69cc24f9a245f2c5f7128866","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69cc24f9a245f2c5f7128866/ODYV_8ao0qPa67jagmXJw.jpeg","fullname":"Junhyeong Cho","name":"jhcho99","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3799143433570862},"editors":["jhcho99"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/69cc24f9a245f2c5f7128866/ODYV_8ao0qPa67jagmXJw.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22581","authors":[{"_id":"6a0fca27a53a61ce2e422d16","name":"Junhyeong Cho","hidden":false},{"_id":"6a0fca27a53a61ce2e422d17","name":"Ruojin Cai","hidden":false},{"_id":"6a0fca27a53a61ce2e422d18","name":"Hadar Averbuch-Elor","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"SceneAligner: 3D-Grounded Floorplan Localization in the Wild","submittedOnDailyBy":{"_id":"69cc24f9a245f2c5f7128866","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69cc24f9a245f2c5f7128866/ODYV_8ao0qPa67jagmXJw.jpeg","isPro":false,"fullname":"Junhyeong Cho","user":"jhcho99","type":"user","name":"jhcho99"},"summary":"Many public buildings provide floorplans with a \"you are here\" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.","upvotes":2,"discussionId":"6a0fca27a53a61ce2e422d19","projectPage":"https://Cornell-VAILab.github.io/SceneAligner","githubRepo":"https://github.com/Cornell-VAILab/SceneAligner","githubRepoAddedBy":"user","ai_summary":"Deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.","ai_keywords":["3D scene reconstruction","2D similarity transform","2D foundation model","cross-modal correspondences","density map","floorplan localization","gravity-aligned","semantic alignment","structural consistency"],"githubStars":1,"organization":{"_id":"681dd2e9a61bb228fae1702b","name":"cornell","fullname":"Cornell University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652303d0974423bd3ef70468/4ZbVAynBI2QThFWmlWE-b.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69cc24f9a245f2c5f7128866","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69cc24f9a245f2c5f7128866/ODYV_8ao0qPa67jagmXJw.jpeg","isPro":false,"fullname":"Junhyeong Cho","user":"jhcho99","type":"user"},{"_id":"699eb1eb66c089e52b5d8793","avatarUrl":"/avatars/032453bcb0370359bb61fd09ebc566b6.svg","isPro":false,"fullname":"Ethan Nguyen","user":"ethannguyen79","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"681dd2e9a61bb228fae1702b","name":"cornell","fullname":"Cornell University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652303d0974423bd3ef70468/4ZbVAynBI2QThFWmlWE-b.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22581.md"}">
Papers
arxiv:2605.22581

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Published on May 21
· Submitted by
Junhyeong Cho
on May 22
Authors:
,
,

Abstract

Deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.

AI-generated summary

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.22581
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22581 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22581 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22581 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers