Hugging Face Daily Papers · May 25, 2026 · 4 min read

PhotoFlow: Agentic 3D Virtual Photography Missions

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

PhotoFlow is an agentic framework for language-conditioned virtual photography in controllable 3D scenes. Given a Blender scene and a natural-language photography intent, PhotoFlow searches for an executable camera state, including camera pose, look-at target, lens, aperture, and aspect ratio, then renders the final photograph.</p>\n","updatedAt":"2026-05-25T04:38:39.361Z","author":{"_id":"6938f4de790b5cd0f6df6462","avatarUrl":"/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg","fullname":"Zhihang Zhong","name":"Zuica96","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8796589374542236},"editors":["Zuica96"],"editorAvatarUrls":["/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23771","authors":[{"_id":"6a13d2094d9e8d8602d2031c","name":"Jiarui Guo","hidden":false},{"_id":"6a13d2094d9e8d8602d2031d","name":"Haojia Wei","hidden":false},{"_id":"6a13d2094d9e8d8602d2031e","name":"Yiming Zhang","hidden":false},{"_id":"6a13d2094d9e8d8602d2031f","name":"Yifei Liu","hidden":false},{"_id":"6a13d2094d9e8d8602d20320","name":"Yuning Gong","hidden":false},{"_id":"6a13d2094d9e8d8602d20321","name":"Hongjie Zhang","hidden":false},{"_id":"6a13d2094d9e8d8602d20322","name":"Xue Yang","hidden":false},{"_id":"6a13d2094d9e8d8602d20323","name":"Zhihang Zhong","hidden":false}],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"PhotoFlow: Agentic 3D Virtual Photography Missions","submittedOnDailyBy":{"_id":"6938f4de790b5cd0f6df6462","avatarUrl":"/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg","isPro":false,"fullname":"Zhihang Zhong","user":"Zuica96","type":"user","name":"Zuica96"},"summary":"Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.","upvotes":19,"discussionId":"6a13d2094d9e8d8602d20324","projectPage":"https://visionary-laboratory.github.io/PhotoFlow/","githubRepo":"https://github.com/Visionary-Laboratory/PhotoFlow","githubRepoAddedBy":"user","ai_summary":"A Director-Reviewer-Reflector agent named PhotoFlow enables language-conditioned virtual photography by combining 3D spatial understanding with aesthetic judgment in arbitrary Blender scenes.","ai_keywords":["vision-language models","spatial agent","3D spatial understanding","aesthetic judgment","PhotoFlow","Director-Reviewer-Reflector agent","photographic blueprint","camera parameters","visual critique","region memory","dead-zone suppression","high-explore relocation","VPhotoBench","Blender scenes","language-conditioned photography","LLM-centered spatial agent"],"githubStars":2,"organization":{"_id":"6938f59934ae2fe5939d023c","name":"Visionary-Laboratoary","fullname":"Visionary-Laboratoary","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6938f4de790b5cd0f6df6462/e5oOSNUpzMTOQislDkn9n.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6938f4de790b5cd0f6df6462","avatarUrl":"/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg","isPro":false,"fullname":"Zhihang Zhong","user":"Zuica96","type":"user"},{"_id":"660691330be1fbe3b9e4c33d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660691330be1fbe3b9e4c33d/TxrDFH_cRu3AlpMC3xmhv.jpeg","isPro":false,"fullname":"ZiYang Gong","user":"Cusyoung","type":"user"},{"_id":"68f0ae4dadf6dbe2e5c6d82c","avatarUrl":"/avatars/a4fb5e7e823248a7c11ff8fe06161854.svg","isPro":false,"fullname":"renji","user":"renjianzhexuejia","type":"user"},{"_id":"666808f6f0e3bf0881c780e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Q-a0SMB96CeTbayaQtqPg.jpeg","isPro":false,"fullname":"yuezhou","user":"erenzhou","type":"user"},{"_id":"673c5ba0a852d378895abac7","avatarUrl":"/avatars/b2bb08ce7ffbd251004098b3d1cdc7ae.svg","isPro":false,"fullname":"Yifei Liu","user":"kaikai23","type":"user"},{"_id":"686631b464da2306a623f273","avatarUrl":"/avatars/b7fa5f326c530f0737ac2914c2356c34.svg","isPro":false,"fullname":"Xiaolin Liu","user":"CZKQH","type":"user"},{"_id":"66324e48cd63149d1e11b1ad","avatarUrl":"/avatars/f40aba47c795e958a11064923e70cf9f.svg","isPro":false,"fullname":"Xueying Li","user":"Leexy0311","type":"user"},{"_id":"6715c3af0e07a4d25c228d72","avatarUrl":"/avatars/c583d1fd08083a1b147ded5497ca6ddd.svg","isPro":false,"fullname":"Charles Yang","user":"CharlesYeung001","type":"user"},{"_id":"649cf4ecdd87dd9ef76fe020","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/M7RpD_AcNewA2xADhhyCB.jpeg","isPro":false,"fullname":"Xuehui Wang","user":"huiserwang","type":"user"},{"_id":"684ae9d6feb9b2b0f0f5b0a5","avatarUrl":"/avatars/2e758da539314420bde4c1edd81df6b1.svg","isPro":false,"fullname":"liu","user":"zhanwang","type":"user"},{"_id":"685bf944c4c535b876f82448","avatarUrl":"/avatars/267f2129a629a762482e7ae1d6e8e1da.svg","isPro":false,"fullname":"mingqian","user":"Mingqian-233","type":"user"},{"_id":"692e3ccdcba266011e08a005","avatarUrl":"/avatars/0a308490d13d024e69027057b92a56ef.svg","isPro":false,"fullname":"haiyaojin","user":"carriejin","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6938f59934ae2fe5939d023c","name":"Visionary-Laboratoary","fullname":"Visionary-Laboratoary","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6938f4de790b5cd0f6df6462/e5oOSNUpzMTOQislDkn9n.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23771.md"}">

Papers

arxiv:2605.23771

PhotoFlow: Agentic 3D Virtual Photography Missions

Published on May 22

· Submitted by

Zhihang Zhong on May 25

Visionary-Laboratoary

Upvote

Authors:

Abstract

A Director-Reviewer-Reflector agent named PhotoFlow enables language-conditioned virtual photography by combining 3D spatial understanding with aesthetic judgment in arbitrary Blender scenes.

AI-generated summary

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.