Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.</p>\n","updatedAt":"2026-05-26T13:34:14.754Z","author":{"_id":"6a1010c083c556fdf771096d","avatarUrl":"/avatars/0a18b45380bc9abbde7d44a9f74e5247.svg","fullname":"Iason Skylitsis","name":"iasonsky","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7996718883514404},"editors":["iasonsky"],"editorAvatarUrls":["/avatars/0a18b45380bc9abbde7d44a9f74e5247.svg"],"reactions":[],"isReport":false}},{"id":"6a15ff43c80ac06f69901350","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T20:14:59.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"the image aligner, trained with infoNCE and cross-attention reconstruction to map image tokens into the text-embedding space, is the most interesting bit here. that design is what lets the inference-time fusion work without touching the diffusion model, avoiding retraining entirely. what happens if the reference image has occlusions or unusual perspective shifts would that hurt the prompt fidelity more than it harms the visual alignment? btw the arxivlens breakdown helped me parse the method details, and it might be worth adding a small ablation on fusion variants in future work https://arxivlens.com/PaperView/Details/injecting-image-guidance-into-text-conditioned-diffusion-models-at-inference-975-b7274edd","html":"<p>the image aligner, trained with infoNCE and cross-attention reconstruction to map image tokens into the text-embedding space, is the most interesting bit here. that design is what lets the inference-time fusion work without touching the diffusion model, avoiding retraining entirely. what happens if the reference image has occlusions or unusual perspective shifts would that hurt the prompt fidelity more than it harms the visual alignment? btw the arxivlens breakdown helped me parse the method details, and it might be worth adding a small ablation on fusion variants in future work <a href=\"https://arxivlens.com/PaperView/Details/injecting-image-guidance-into-text-conditioned-diffusion-models-at-inference-975-b7274edd\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/injecting-image-guidance-into-text-conditioned-diffusion-models-at-inference-975-b7274edd</a></p>\n","updatedAt":"2026-05-26T20:14:59.535Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8283120393753052},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25191","authors":[{"_id":"6a15a092b57a1823d5708efd","name":"Agata Żywot","hidden":false},{"_id":"6a15a092b57a1823d5708efe","name":"Iason Skylitsis","hidden":false},{"_id":"6a15a092b57a1823d5708eff","name":"Thijmen Nijdam","hidden":false},{"_id":"6a15a092b57a1823d5708f00","name":"Zoe Tzifa-Kratira","hidden":false},{"_id":"6a15a092b57a1823d5708f01","name":"Derck Prinzhorn","hidden":false},{"_id":"6a15a092b57a1823d5708f02","name":"Konrad Szewczyk","hidden":false},{"_id":"6a15a092b57a1823d5708f03","name":"Aritra Bhowmik","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference","submittedOnDailyBy":{"_id":"6a1010c083c556fdf771096d","avatarUrl":"/avatars/0a18b45380bc9abbde7d44a9f74e5247.svg","isPro":false,"fullname":"Iason Skylitsis","user":"iasonsky","type":"user","name":"iasonsky"},"summary":"Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.","upvotes":3,"discussionId":"6a15a092b57a1823d5708f04","githubRepo":"https://github.com/thijmennijdam/stable-diffusion-v2","githubRepoAddedBy":"user","ai_summary":"Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining.","ai_keywords":["text-to-image diffusion models","Stable Diffusion","visual guidance","CLIP image features","text embedding space","InfoNCE","cross-attention reconstruction","fusion strategy","Prompt-Noise Optimization","CLIP score","LPIPS"],"githubStars":0,"organization":{"_id":"6274e45cbe455dadd1063972","name":"uva","fullname":"University of Amsterdam","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651827662266-6273a78c3d70b36612a8bd9e.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a1010c083c556fdf771096d","avatarUrl":"/avatars/0a18b45380bc9abbde7d44a9f74e5247.svg","isPro":false,"fullname":"Iason Skylitsis","user":"iasonsky","type":"user"},{"_id":"690f2dd9389ec9e63ba66c30","avatarUrl":"/avatars/59310f5bf71088f2efecb4e677379a66.svg","isPro":false,"fullname":"Agata Żywot","user":"azywot","type":"user"},{"_id":"672cb0b97a98baaffaa2d947","avatarUrl":"/avatars/6c7c1edfb08b72fbfcd054b818d6d191.svg","isPro":false,"fullname":"Thijmen Nijdam","user":"thijmennijdam","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6274e45cbe455dadd1063972","name":"uva","fullname":"University of Amsterdam","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1651827662266-6273a78c3d70b36612a8bd9e.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25191.md"}">
Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference
Abstract
Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining.
AI-generated summary
Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.
Community
Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.
the image aligner, trained with infoNCE and cross-attention reconstruction to map image tokens into the text-embedding space, is the most interesting bit here. that design is what lets the inference-time fusion work without touching the diffusion model, avoiding retraining entirely. what happens if the reference image has occlusions or unusual perspective shifts would that hurt the prompt fidelity more than it harms the visual alignment? btw the arxivlens breakdown helped me parse the method details, and it might be worth adding a small ablation on fusion variants in future work https://arxivlens.com/PaperView/Details/injecting-image-guidance-into-text-conditioned-diffusion-models-at-inference-975-b7274edd
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.25191 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.25191 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.25191 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.