Hugging Face Daily Papers · · 4 min read

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Interesting benchmark for testing whether foundation models can actively navigate to a target viewpoint, rather than just passively understand images. The low zero-shot success rates make TVRBench a nice stress test for embodied spatial intelligence, and the strong gains from visual-action SFT suggest that mapping visual discrepancies to actions is still a key bottleneck.</p>\n","updatedAt":"2026-06-02T03:29:14.546Z","author":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","fullname":"zhumuzhi","name":"Z-MU-Z","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8887277841567993},"editors":["Z-MU-Z"],"editorAvatarUrls":["/avatars/37d4fefbcc19f091dccffefec9706de2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01247","authors":[{"_id":"6a1e4699808ddbc3c7d43c67","name":"Liyang Li","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c68","name":"Muzhi Zhu","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c69","name":"Zhiyue Zhao","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6a","name":"Hengyu Zhao","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6b","name":"Ke Liu","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6c","name":"Linhao Zhong","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6d","name":"Hao Chen","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6e","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-05-31T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?","submittedOnDailyBy":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user","name":"Z-MU-Z"},"summary":"Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.","upvotes":14,"discussionId":"6a1e469a808ddbc3c7d43c6f","githubRepo":"https://github.com/aim-uofa/TVRBench","githubRepoAddedBy":"user","ai_summary":"Target Viewpoint Reproduction task challenges foundation models to actively adjust 3D viewpoints to match target images, revealing limitations in visual history processing and embodied movement mapping, with a unified post-training framework improving success rates through various training methods.","ai_keywords":["Target Viewpoint Reproduction","TVRBench","embodied AI","visual history","spatial intelligence","post-training framework","expert-trajectory SFT","rationale-supervised CoT-SFT","offline Single-turn GRPO","on-policy Multi-turn GRPO"],"githubStars":7,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6652f4dff72c9a37ceab9825","avatarUrl":"/avatars/ad05f253f9fd647af7249ba90c5e5b78.svg","isPro":false,"fullname":"lee","user":"llysglz","type":"user"},{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user"},{"_id":"6320349dfb307b12b2e7b735","avatarUrl":"/avatars/8d65531d6c601b57689c2b0de73d580f.svg","isPro":false,"fullname":"Jin-Chuan Shi","user":"Chrisss","type":"user"},{"_id":"6512a55c4151fb1fa722c4f7","avatarUrl":"/avatars/b89f646cc80e1849258b9209b3bdeb65.svg","isPro":false,"fullname":"AI explorer","user":"ToexploreAI","type":"user"},{"_id":"698612c46c9345fe291cc8ed","avatarUrl":"/avatars/d05fa66f89ba130b1d2c69fe303a303e.svg","isPro":false,"fullname":"D","user":"baiyeD","type":"user"},{"_id":"6549ab205018913069fb8eab","avatarUrl":"/avatars/39bd776521c7cdfa09376564616dd84a.svg","isPro":false,"fullname":"chencong","user":"Chencong1","type":"user"},{"_id":"67669a4188c785f6a0bd1ed1","avatarUrl":"/avatars/d01b0b3c04b76ca1ee0c3a216e31ffae.svg","isPro":false,"fullname":"Yanzhen Zhou","user":"YanzhenZhou","type":"user"},{"_id":"69bcf64d9267623e872fa10c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/IpX9Bi5xOy-h5KJAZqN3L.png","isPro":false,"fullname":"马紫怡","user":"samuelharris5","type":"user"},{"_id":"64d60375d7e30889c65e8cf4","avatarUrl":"/avatars/640f7c570fc45194557ce7931bdfe87f.svg","isPro":false,"fullname":"Huanyi Zheng","user":"zhyya","type":"user"},{"_id":"64c78c6c1c23fb9a2bba4369","avatarUrl":"/avatars/a4276fd8bfb9af0ecab5c86d9744de55.svg","isPro":false,"fullname":"SII-Zane","user":"zzytql","type":"user"},{"_id":"61e52be53d6dbb1da842316a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e52be53d6dbb1da842316a/gx0WGPcOCClXPymoKglc4.jpeg","isPro":false,"fullname":"Börje Karlsson","user":"tellarin","type":"user"},{"_id":"65488f9528b7019eae5e09be","avatarUrl":"/avatars/61b6ef7a374e5b9573ee1c65355f4f8f.svg","isPro":false,"fullname":"Ke Liu","user":"KriskLiu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01247.md"}">
Papers
arxiv:2606.01247

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Published on May 31
· Submitted by
zhumuzhi
on Jun 2
Authors:
,
,
,
,
,
,
,

Abstract

Target Viewpoint Reproduction task challenges foundation models to actively adjust 3D viewpoints to match target images, revealing limitations in visual history processing and embodied movement mapping, with a unified post-training framework improving success rates through various training methods.

AI-generated summary

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.

Community

Paper submitter about 7 hours ago

Interesting benchmark for testing whether foundation models can actively navigate to a target viewpoint, rather than just passively understand images. The low zero-shot success rates make TVRBench a nice stress test for embodied spatial intelligence, and the strong gains from visual-action SFT suggest that mapping visual discrepancies to actions is still a key bottleneck.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.01247
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01247 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers