Hugging Face Daily Papers · June 25, 2026 · 3 min read

ShutterMuse: Capture-Time Photography Guidance with MLLMs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

ShutterMuse is a unified multimodal large language model for capture-time photography guidance. It supports:</p>\n<ul>\n<li>Photographer-side guidance: keep, refine, or reject the current framing, with a composition box when refinement is needed.</li>\n<li>Subject-side guidance: recommend scene-conditioned portrait poses with COCO-17 keypoints and visibility states.</li>\n</ul>\n","updatedAt":"2026-06-25T06:57:25.117Z","author":{"_id":"647469b9a51711a3b58bda2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg","fullname":"Yixiao Fang","name":"fangyixiao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9065771102905273},"editors":["fangyixiao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25763","authors":[{"_id":"6a3cb91cf3facdb67e9ff240","name":"Jiayu Li","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff241","name":"Yixiao Fang","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff242","name":"Tianyu Hu","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff243","name":"Wei Cheng","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff244","name":"Ping Huang","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff245","name":"Zheheng Fan","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff246","name":"Gang Yu","hidden":false},{"_id":"6a3cb91cf3facdb67e9ff247","name":"Xingjun Ma","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/647469b9a51711a3b58bda2b/H971Df6p2IGq8gXpvjuJG.mp4"],"publishedAt":"2026-06-24T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"ShutterMuse: Capture-Time Photography Guidance with MLLMs","submittedOnDailyBy":{"_id":"647469b9a51711a3b58bda2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg","isPro":false,"fullname":"Yixiao Fang","user":"fangyixiao","type":"user","name":"fangyixiao"},"summary":"Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language models (MLLMs) underexplored. To address this gap, we introduce CaptureGuide-Bench, a benchmark with two complementary tasks: photographer-side composition decision and refinement, and subject-side scene-conditioned pose recommendation. Our evaluation reveals limitations: general-purpose MLLMs can make composition decisions but lack precise refinement localization, while specialized aesthetic cropping models localize crops effectively but are limited to refinement; neither provides actionable pose guidance. To support model development, we further construct CaptureGuide-Dataset, comprising 130K samples with textual rationales and structured visual annotations, and develop ShutterMuse, a unified MLLM trained with supervised and reinforcement fine-tuning. Experiments on CaptureGuide-Bench show that ShutterMuse achieves the best overall photographer-side performance among evaluated baselines and competitive subject-side pose recommendation with substantially lower inference cost, demonstrating the potential of MLLMs as interactive assistants for photography during image capture.","upvotes":33,"discussionId":"6a3cb91cf3facdb67e9ff248","projectPage":"https://lijayutnt.github.io/ShutterMuse/","githubRepo":"https://github.com/lijayuTnT/ShutterMuse","githubRepoAddedBy":"user","ai_summary":"Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture.","ai_keywords":["multimodal large language models","aesthetic cropping","visual annotations","supervised fine-tuning","reinforcement fine-tuning","photographer-side composition","subject-side pose recommendation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":11},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a26a58972e3d7254eb93bf8","avatarUrl":"/avatars/f16576a32a62b4c66ff1741411e2174c.svg","isPro":false,"fullname":"Lee","user":"ShutterMuse","type":"user"},{"_id":"647469b9a51711a3b58bda2b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647469b9a51711a3b58bda2b/yeDf8Sa8IDEQyney1dGC9.jpeg","isPro":false,"fullname":"Yixiao Fang","user":"fangyixiao","type":"user"},{"_id":"64b914c8ace99c0723ad83a9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b914c8ace99c0723ad83a9/B4gxNByeVY_xaOcjwiN1j.jpeg","isPro":false,"fullname":"Wei Cheng","user":"wchengad","type":"user"},{"_id":"669dcdb39a4bf63e08f70e9f","avatarUrl":"/avatars/1ecaf09e4c4f5355ea25a0606ff8f7e2.svg","isPro":false,"fullname":"yulatu","user":"yulatu","type":"user"},{"_id":"6343de25e01a38440ef02d5e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6343de25e01a38440ef02d5e/eumgZKT6vfTzINC6cYUrL.jpeg","isPro":false,"fullname":"xz","user":"frankzeng","type":"user"},{"_id":"6675854966c4fa6d0cee4d50","avatarUrl":"/avatars/aa6041a97985078e82cc89bfbade9828.svg","isPro":false,"fullname":"xuanyang zhang","user":"xuanyangz","type":"user"},{"_id":"64ca05b4f7f4ccb5ea6e43aa","avatarUrl":"/avatars/c909613715eaf5fd43ae6cd95ae2b9a4.svg","isPro":false,"fullname":"Charles","user":"SCFW","type":"user"},{"_id":"66cf5df9a181ad5423f6a3fb","avatarUrl":"/avatars/72b85e87e55865ba389b5015d2e20ea7.svg","isPro":false,"fullname":"Shuhan Wu","user":"LizzyWu","type":"user"},{"_id":"68fa28bd3ccb95f78051210e","avatarUrl":"/avatars/bf3e04e78cd48c10f24341cfc96126c7.svg","isPro":false,"fullname":"linghuyuhangyuan","user":"linghuyuhangyuan123","type":"user"},{"_id":"6555c84691e52f423173def5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6555c84691e52f423173def5/fzD9xvDWy2MYVGAqmTC2I.png","isPro":false,"fullname":"leixinlin","user":"leixinlin","type":"user"},{"_id":"655101623fe6c0b1f8b58987","avatarUrl":"/avatars/4d36a4988e6011fec3ceac2b59938c3a.svg","isPro":false,"fullname":"Jiabin Hua","user":"Ammmob","type":"user"},{"_id":"62f361e6231737ed2d741740","avatarUrl":"/avatars/f4a1053f9d9b3e703d138bc9753742c1.svg","isPro":false,"fullname":"huyaoqi","user":"yaoqi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25763.md","query":{}}">

Papers

arxiv:2606.25763

ShutterMuse: Capture-Time Photography Guidance with MLLMs

Published on Jun 24

· Submitted by

Yixiao Fang on Jun 25

#3 Paper of the day

Upvote

Authors:

Abstract

Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-hoc crop prediction and overlook subject-side recommendations, leaving the capture-time guidance capabilities of multimodal large language models (MLLMs) underexplored. To address this gap, we introduce CaptureGuide-Bench, a benchmark with two complementary tasks: photographer-side composition decision and refinement, and subject-side scene-conditioned pose recommendation. Our evaluation reveals limitations: general-purpose MLLMs can make composition decisions but lack precise refinement localization, while specialized aesthetic cropping models localize crops effectively but are limited to refinement; neither provides actionable pose guidance. To support model development, we further construct CaptureGuide-Dataset, comprising 130K samples with textual rationales and structured visual annotations, and develop ShutterMuse, a unified MLLM trained with supervised and reinforcement fine-tuning. Experiments on CaptureGuide-Bench show that ShutterMuse achieves the best overall photographer-side performance among evaluated baselines and competitive subject-side pose recommendation with substantially lower inference cost, demonstrating the potential of MLLMs as interactive assistants for photography during image capture.

View arXiv page View PDF Project page GitHub 11 Add to collection

Community

fangyixiao

Paper submitter about 2 hours ago

ShutterMuse is a unified multimodal large language model for capture-time photography guidance. It supports:

Photographer-side guidance: keep, refine, or reject the current framing, with a composition box when refinement is needed.
Subject-side guidance: recommend scene-conditioned portrait poses with COCO-17 keypoints and visibility states.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.25763

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

ShutterMuse: Capture-Time Photography Guidance with MLLMs

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers