Hugging Face Daily Papers · June 2, 2026 · 4 min read

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Interesting benchmark for testing whether foundation models can actively navigate to a target viewpoint, rather than just passively understand images. The low zero-shot success rates make TVRBench a nice stress test for embodied spatial intelligence, and the strong gains from visual-action SFT suggest that mapping visual discrepancies to actions is still a key bottleneck.</p>\n","updatedAt":"2026-06-02T03:29:14.546Z","author":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","fullname":"zhumuzhi","name":"Z-MU-Z","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8887277841567993},"editors":["Z-MU-Z"],"editorAvatarUrls":["/avatars/37d4fefbcc19f091dccffefec9706de2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01247","authors":[{"_id":"6a1e4699808ddbc3c7d43c67","name":"Liyang Li","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c68","name":"Muzhi Zhu","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c69","name":"Zhiyue Zhao","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6a","name":"Hengyu Zhao","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6b","name":"Ke Liu","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6c","name":"Linhao Zhong","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6d","name":"Hao Chen","hidden":false},{"_id":"6a1e4699808ddbc3c7d43c6e","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-05-31T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?","submittedOnDailyBy":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user","name":"Z-MU-Z"},"summary":"Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.","upvotes":14,"discussionId":"6a1e469a808ddbc3c7d43c6f","githubRepo":"https://github.com/aim-uofa/TVRBench","githubRepoAddedBy":"user","ai_summary":"Target Viewpoint Reproduction task challenges foundation models to actively adjust 3D viewpoints to match target images, revealing limitations in visual history processing and embodied movement mapping, with a unified post-training framework improving success rates through various training methods.","ai_keywords":["Target Viewpoint Reproduction","TVRBench","embodied AI","visual history","spatial intelligence","post-training framework","expert-trajectory SFT","rationale-supervised CoT-SFT","offline Single-turn GRPO","on-policy Multi-turn GRPO"],"githubStars":7,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6652f4dff72c9a37ceab9825","avatarUrl":"/avatars/ad05f253f9fd647af7249ba90c5e5b78.svg","isPro":false,"fullname":"lee","user":"llysglz","type":"user"},{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user"},{"_id":"6320349dfb307b12b2e7b735","avatarUrl":"/avatars/8d65531d6c601b57689c2b0de73d580f.svg","isPro":false,"fullname":"Jin-Chuan Shi","user":"Chrisss","type":"user"},{"_id":"6512a55c4151fb1fa722c4f7","avatarUrl":"/avatars/b89f646cc80e1849258b9209b3bdeb65.svg","isPro":false,"fullname":"AI explorer","user":"ToexploreAI","type":"user"},{"_id":"698612c46c9345fe291cc8ed","avatarUrl":"/avatars/d05fa66f89ba130b1d2c69fe303a303e.svg","isPro":false,"fullname":"D","user":"baiyeD","type":"user"},{"_id":"6549ab205018913069fb8eab","avatarUrl":"/avatars/39bd776521c7cdfa09376564616dd84a.svg","isPro":false,"fullname":"chencong","user":"Chencong1","type":"user"},{"_id":"67669a4188c785f6a0bd1ed1","avatarUrl":"/avatars/d01b0b3c04b76ca1ee0c3a216e31ffae.svg","isPro":false,"fullname":"Yanzhen Zhou","user":"YanzhenZhou","type":"user"},{"_id":"69bcf64d9267623e872fa10c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/IpX9Bi5xOy-h5KJAZqN3L.png","isPro":false,"fullname":"马紫怡","user":"samuelharris5","type":"user"},{"_id":"64d60375d7e30889c65e8cf4","avatarUrl":"/avatars/640f7c570fc45194557ce7931bdfe87f.svg","isPro":false,"fullname":"Huanyi Zheng","user":"zhyya","type":"user"},{"_id":"64c78c6c1c23fb9a2bba4369","avatarUrl":"/avatars/a4276fd8bfb9af0ecab5c86d9744de55.svg","isPro":false,"fullname":"SII-Zane","user":"zzytql","type":"user"},{"_id":"61e52be53d6dbb1da842316a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61e52be53d6dbb1da842316a/gx0WGPcOCClXPymoKglc4.jpeg","isPro":false,"fullname":"Börje Karlsson","user":"tellarin","type":"user"},{"_id":"65488f9528b7019eae5e09be","avatarUrl":"/avatars/61b6ef7a374e5b9573ee1c65355f4f8f.svg","isPro":false,"fullname":"Ke Liu","user":"KriskLiu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01247.md"}">

Papers

arxiv:2606.01247

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Published on May 31

· Submitted by

zhumuzhi on Jun 2

Zhejiang University

Upvote

Authors:

Abstract

Target Viewpoint Reproduction task challenges foundation models to actively adjust 3D viewpoints to match target images, revealing limitations in visual history processing and embodied movement mapping, with a unified post-training framework improving success rates through various training methods.

AI-generated summary

Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.