PhotoFlow is an agentic framework for language-conditioned virtual photography in controllable 3D scenes. Given a Blender scene and a natural-language photography intent, PhotoFlow searches for an executable camera state, including camera pose, look-at target, lens, aperture, and aspect ratio, then renders the final photograph.</p>\n","updatedAt":"2026-05-25T04:38:39.361Z","author":{"_id":"6938f4de790b5cd0f6df6462","avatarUrl":"/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg","fullname":"Zhihang Zhong","name":"Zuica96","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8796589374542236},"editors":["Zuica96"],"editorAvatarUrls":["/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23771","authors":[{"_id":"6a13d2094d9e8d8602d2031c","name":"Jiarui Guo","hidden":false},{"_id":"6a13d2094d9e8d8602d2031d","name":"Haojia Wei","hidden":false},{"_id":"6a13d2094d9e8d8602d2031e","name":"Yiming Zhang","hidden":false},{"_id":"6a13d2094d9e8d8602d2031f","name":"Yifei Liu","hidden":false},{"_id":"6a13d2094d9e8d8602d20320","name":"Yuning Gong","hidden":false},{"_id":"6a13d2094d9e8d8602d20321","name":"Hongjie Zhang","hidden":false},{"_id":"6a13d2094d9e8d8602d20322","name":"Xue Yang","hidden":false},{"_id":"6a13d2094d9e8d8602d20323","name":"Zhihang Zhong","hidden":false}],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"PhotoFlow: Agentic 3D Virtual Photography Missions","submittedOnDailyBy":{"_id":"6938f4de790b5cd0f6df6462","avatarUrl":"/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg","isPro":false,"fullname":"Zhihang Zhong","user":"Zuica96","type":"user","name":"Zuica96"},"summary":"Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.","upvotes":19,"discussionId":"6a13d2094d9e8d8602d20324","projectPage":"https://visionary-laboratory.github.io/PhotoFlow/","githubRepo":"https://github.com/Visionary-Laboratory/PhotoFlow","githubRepoAddedBy":"user","ai_summary":"A Director-Reviewer-Reflector agent named PhotoFlow enables language-conditioned virtual photography by combining 3D spatial understanding with aesthetic judgment in arbitrary Blender scenes.","ai_keywords":["vision-language models","spatial agent","3D spatial understanding","aesthetic judgment","PhotoFlow","Director-Reviewer-Reflector agent","photographic blueprint","camera parameters","visual critique","region memory","dead-zone suppression","high-explore relocation","VPhotoBench","Blender scenes","language-conditioned photography","LLM-centered spatial agent"],"githubStars":2,"organization":{"_id":"6938f59934ae2fe5939d023c","name":"Visionary-Laboratoary","fullname":"Visionary-Laboratoary","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6938f4de790b5cd0f6df6462/e5oOSNUpzMTOQislDkn9n.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6938f4de790b5cd0f6df6462","avatarUrl":"/avatars/4f22f0499d96bb749af7e8dba2b0b533.svg","isPro":false,"fullname":"Zhihang Zhong","user":"Zuica96","type":"user"},{"_id":"660691330be1fbe3b9e4c33d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660691330be1fbe3b9e4c33d/TxrDFH_cRu3AlpMC3xmhv.jpeg","isPro":false,"fullname":"ZiYang Gong","user":"Cusyoung","type":"user"},{"_id":"68f0ae4dadf6dbe2e5c6d82c","avatarUrl":"/avatars/a4fb5e7e823248a7c11ff8fe06161854.svg","isPro":false,"fullname":"renji","user":"renjianzhexuejia","type":"user"},{"_id":"666808f6f0e3bf0881c780e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Q-a0SMB96CeTbayaQtqPg.jpeg","isPro":false,"fullname":"yuezhou","user":"erenzhou","type":"user"},{"_id":"673c5ba0a852d378895abac7","avatarUrl":"/avatars/b2bb08ce7ffbd251004098b3d1cdc7ae.svg","isPro":false,"fullname":"Yifei Liu","user":"kaikai23","type":"user"},{"_id":"686631b464da2306a623f273","avatarUrl":"/avatars/b7fa5f326c530f0737ac2914c2356c34.svg","isPro":false,"fullname":"Xiaolin Liu","user":"CZKQH","type":"user"},{"_id":"66324e48cd63149d1e11b1ad","avatarUrl":"/avatars/f40aba47c795e958a11064923e70cf9f.svg","isPro":false,"fullname":"Xueying Li","user":"Leexy0311","type":"user"},{"_id":"6715c3af0e07a4d25c228d72","avatarUrl":"/avatars/c583d1fd08083a1b147ded5497ca6ddd.svg","isPro":false,"fullname":"Charles Yang","user":"CharlesYeung001","type":"user"},{"_id":"649cf4ecdd87dd9ef76fe020","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/M7RpD_AcNewA2xADhhyCB.jpeg","isPro":false,"fullname":"Xuehui Wang","user":"huiserwang","type":"user"},{"_id":"684ae9d6feb9b2b0f0f5b0a5","avatarUrl":"/avatars/2e758da539314420bde4c1edd81df6b1.svg","isPro":false,"fullname":"liu","user":"zhanwang","type":"user"},{"_id":"685bf944c4c535b876f82448","avatarUrl":"/avatars/267f2129a629a762482e7ae1d6e8e1da.svg","isPro":false,"fullname":"mingqian","user":"Mingqian-233","type":"user"},{"_id":"692e3ccdcba266011e08a005","avatarUrl":"/avatars/0a308490d13d024e69027057b92a56ef.svg","isPro":false,"fullname":"haiyaojin","user":"carriejin","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6938f59934ae2fe5939d023c","name":"Visionary-Laboratoary","fullname":"Visionary-Laboratoary","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6938f4de790b5cd0f6df6462/e5oOSNUpzMTOQislDkn9n.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23771.md"}">
PhotoFlow: Agentic 3D Virtual Photography Missions
Abstract
A Director-Reviewer-Reflector agent named PhotoFlow enables language-conditioned virtual photography by combining 3D spatial understanding with aesthetic judgment in arbitrary Blender scenes.
AI-generated summary
Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.
Community
PhotoFlow is an agentic framework for language-conditioned virtual photography in controllable 3D scenes. Given a Blender scene and a natural-language photography intent, PhotoFlow searches for an executable camera state, including camera pose, look-at target, lens, aperture, and aspect ratio, then renders the final photograph.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.23771 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.23771 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.23771 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.