Hugging Face Daily Papers · · 4 min read

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

SAGA uses a frozen multimodal LLM as the reward model for training a retrieval vision encoder. Think RLVR, but aimed at the encoder's representation rather than LLM reasoning.</p>\n<p>We show the MLLM an image pair, ask same class or different, and reward correct verdicts with GRPO. Advantages cancel on the attributes the two images share and concentrate on the ones that differ, so one binary reward becomes dense attribute-level gradients on the encoder, with no attribute labels.</p>\n<p>The MLLM is dropped at inference, so zero deployment overhead. +3 to 6 R@1 over SOTA on CUB, Cars, Aircraft, iNat-Aves.<br> Feedback welcome!</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/66309b73aa7a730de8c02bfc/72Z38z8dZiZhORSxGzTO4.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66309b73aa7a730de8c02bfc/72Z38z8dZiZhORSxGzTO4.png\" alt=\"teaser\"></a></p>\n","updatedAt":"2026-06-17T19:43:04.086Z","author":{"_id":"66309b73aa7a730de8c02bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66309b73aa7a730de8c02bfc/Blv47q5CVRtYIgeXDTyPB.jpeg","fullname":"Shubhang Bhatnagar","name":"shubhangb","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8206188678741455},"editors":["shubhangb"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66309b73aa7a730de8c02bfc/Blv47q5CVRtYIgeXDTyPB.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15134","authors":[{"_id":"6a32e3e159127a45e2c1c452","name":"Shubhang Bhatnagar","hidden":false},{"_id":"6a32e3e159127a45e2c1c453","name":"Dheeraj Baiju","hidden":false},{"_id":"6a32e3e159127a45e2c1c454","name":"Narendra Ahuja","hidden":false}],"publishedAt":"2026-06-13T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings","submittedOnDailyBy":{"_id":"66309b73aa7a730de8c02bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66309b73aa7a730de8c02bfc/Blv47q5CVRtYIgeXDTyPB.jpeg","isPro":false,"fullname":"Shubhang Bhatnagar","user":"shubhangb","type":"user","name":"shubhangb"},"summary":"Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.","upvotes":1,"discussionId":"6a32e3e259127a45e2c1c455","projectPage":"https://shubhangb97.github.io/saga/","ai_summary":"SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance.","ai_keywords":["vision encoders","multimodal large language model","class-label supervision","Group Relative Policy Optimization","attribute-aware perception","metric-learning loss","attention-distillation loss","zero-shot image retrieval","Recall@1"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"60212a089f64108326fac7c2","name":"illinois","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1612786274096-6021121cfb1b47827d667074.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66309b73aa7a730de8c02bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66309b73aa7a730de8c02bfc/Blv47q5CVRtYIgeXDTyPB.jpeg","isPro":false,"fullname":"Shubhang Bhatnagar","user":"shubhangb","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60212a089f64108326fac7c2","name":"illinois","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1612786274096-6021121cfb1b47827d667074.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15134.md","query":{}}">
Papers
arxiv:2606.15134

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Published on Jun 13
· Submitted by
Shubhang Bhatnagar
on Jun 17
Authors:
,
,

Abstract

SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance.

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

Community

Paper submitter about 5 hours ago

SAGA uses a frozen multimodal LLM as the reward model for training a retrieval vision encoder. Think RLVR, but aimed at the encoder's representation rather than LLM reasoning.

We show the MLLM an image pair, ask same class or different, and reward correct verdicts with GRPO. Advantages cancel on the attributes the two images share and concentrate on the ones that differ, so one binary reward becomes dense attribute-level gradients on the encoder, with no attribute labels.

The MLLM is dropped at inference, so zero deployment overhead. +3 to 6 R@1 over SOTA on CUB, Cars, Aircraft, iNat-Aves.
Feedback welcome!

teaser

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.15134
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15134 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15134 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15134 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers