ShutterMuse is a unified multimodal large language model for capture-time photography guidance. It supports:</p>\n<ul>\n<li>Photographer-side guidance: keep, refine, or reject the current framing, with a composition box when refinement is needed.</li>\n<li>Subject-side guidance: recommend scene-conditioned portrait poses with COCO-17 keypoints and visibility states.</li>\n</ul>\n","updatedAt":"2026-06-25T06:57:25.117Z","author":{"_id":"647469b9a51711a3b58bda2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg","fullname":"Yixiao Fang","name":"fangyixiao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9065771102905273},"editors":["fangyixiao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25763","authors":[{"_id":"6a3cb91cf3facdb67e9ff240","name":"Jiayu Li","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff241","name":"Yixiao Fang","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff242","name":"Tianyu Hu","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff243","name":"Wei Cheng","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff244","name":"Ping Huang","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff245","name":"Zheheng Fan","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff246","name":"Gang Yu","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff247","name":"Xingjun Ma","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/647469b9a51711a3b58bda2b/H971Df6p2IGq8gXpvjuJG.mp4"],"publishedAt":"2026-06-24T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"ShutterMuse: Capture-Time Photography Guidance with MLLMs","submittedOnDailyBy":{"_id":"647469b9a51711a3b58bda2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg","isPro":false,"fullname":"Yixiao Fang","user":"fangyixiao","type":"user","name":"fangyixiao"},"summary":"Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language models (MLLMs) underexplored. To address this gap, we introduce CaptureGuide-Bench, a benchmark with two complementary tasks: photographer-side composition decision and refinement, and subject-side scene-conditioned pose recommendation. Our evaluation reveals limitations: general-purpose MLLMs can make composition decisions but lack precise refinement localization, while specialized aesthetic cropping models localize crops effectively but are limited to refinement; neither provides actionable pose guidance. To support model development, we further construct CaptureGuide-Dataset, comprising 130K samples with textual rationales and structured visual annotations, and develop ShutterMuse, a unified MLLM trained with supervised and reinforcement fine-tuning. Experiments on CaptureGuide-Bench show that ShutterMuse achieves the best overall photographer-side performance among evaluated baselines and competitive subject-side pose recommendation with substantially lower inference cost, demonstrating the potential of MLLMs as interactive assistants for photography during image capture.","upvotes":33,"discussionId":"6a3cb91cf3facdb67e9ff248","projectPage":"https://lijayutnt.github.io/ShutterMuse/","githubRepo":"https://github.com/lijayuTnT/ShutterMuse","githubRepoAddedBy":"user","ai_summary":"Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture.","ai_keywords":["multimodal large language models","aesthetic cropping","visual annotations","supervised fine-tuning","reinforcement fine-tuning","photographer-side composition","subject-side pose recommendation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":11},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a26a58972e3d7254eb93bf8","avatarUrl":"/avatars/f16576a32a62b4c66ff1741411e2174c.svg","isPro":false,"fullname":"Lee","user":"ShutterMuse","type":"user"},{"_id":"647469b9a51711a3b58bda2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg","isPro":false,"fullname":"Yixiao Fang","user":"fangyixiao","type":"user"},{"_id":"64b914c8ace99c0723ad83a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b914c8ace99c0723ad83a9/B4gxNByeVY_xaOcjwiN1j.jpeg","isPro":false,"fullname":"Wei Cheng","user":"wchengad","type":"user"},{"_id":"669dcdb39a4bf63e08f70e9f","avatarUrl":"/avatars/1ecaf09e4c4f5355ea25a0606ff8f7e2.svg","isPro":false,"fullname":"yulatu","user":"yulatu","type":"user"},{"_id":"6343de25e01a38440ef02d5e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6343de25e01a38440ef02d5e/eumgZKT6vfTzINC6cYUrL.jpeg","isPro":false,"fullname":"xz","user":"frankzeng","type":"user"},{"_id":"6675854966c4fa6d0cee4d50","avatarUrl":"/avatars/aa6041a97985078e82cc89bfbade9828.svg","isPro":false,"fullname":"xuanyang zhang","user":"xuanyangz","type":"user"},{"_id":"64ca05b4f7f4ccb5ea6e43aa","avatarUrl":"/avatars/c909613715eaf5fd43ae6cd95ae2b9a4.svg","isPro":false,"fullname":"Charles","user":"SCFW","type":"user"},{"_id":"66cf5df9a181ad5423f6a3fb","avatarUrl":"/avatars/72b85e87e55865ba389b5015d2e20ea7.svg","isPro":false,"fullname":"Shuhan Wu","user":"LizzyWu","type":"user"},{"_id":"68fa28bd3ccb95f78051210e","avatarUrl":"/avatars/bf3e04e78cd48c10f24341cfc96126c7.svg","isPro":false,"fullname":"linghuyuhangyuan","user":"linghuyuhangyuan123","type":"user"},{"_id":"6555c84691e52f423173def5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6555c84691e52f423173def5/fzD9xvDWy2MYVGAqmTC2I.png","isPro":false,"fullname":"leixinlin","user":"leixinlin","type":"user"},{"_id":"655101623fe6c0b1f8b58987","avatarUrl":"/avatars/4d36a4988e6011fec3ceac2b59938c3a.svg","isPro":false,"fullname":"Jiabin Hua","user":"Ammmob","type":"user"},{"_id":"62f361e6231737ed2d741740","avatarUrl":"/avatars/f4a1053f9d9b3e703d138bc9753742c1.svg","isPro":false,"fullname":"huyaoqi","user":"yaoqi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25763.md","query":{}}">
ShutterMuse: Capture-Time Photography Guidance with MLLMs
Abstract
Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture.
Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language models (MLLMs) underexplored. To address this gap, we introduce CaptureGuide-Bench, a benchmark with two complementary tasks: photographer-side composition decision and refinement, and subject-side scene-conditioned pose recommendation. Our evaluation reveals limitations: general-purpose MLLMs can make composition decisions but lack precise refinement localization, while specialized aesthetic cropping models localize crops effectively but are limited to refinement; neither provides actionable pose guidance. To support model development, we further construct CaptureGuide-Dataset, comprising 130K samples with textual rationales and structured visual annotations, and develop ShutterMuse, a unified MLLM trained with supervised and reinforcement fine-tuning. Experiments on CaptureGuide-Bench show that ShutterMuse achieves the best overall photographer-side performance among evaluated baselines and competitive subject-side pose recommendation with substantially lower inference cost, demonstrating the potential of MLLMs as interactive assistants for photography during image capture.
Community
ShutterMuse is a unified multimodal large language model for capture-time photography guidance. It supports:
- Photographer-side guidance: keep, refine, or reject the current framing, with a composition box when refinement is needed.
- Subject-side guidance: recommend scene-conditioned portrait poses with COCO-17 keypoints and visibility states.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.