Hugging Face Daily Papers · · 5 min read

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Can we make caption RL more diagnosable than a single scalar reward?</p>\n<p>In ClaimDiff-RL, we argue that long-form image captions should not be judged only as whole sequences. A dense caption is made of many local visual claims—objects, counts, colors, spatial relations, OCR text, and fine-grained details. Instead of directly asking a judge for one holistic score, ClaimDiff-RL compares an actor caption with a reference caption, identifies atomic visual differences, verifies each difference against the image, and assigns side-specific typed errors.</p>\n<p>This turns hallucinations, missing facts, and correct extra details into separately measurable reward signals. Interestingly, we find that holistic rewards can reduce hallucination by encouraging conservative under-captioning, while claim-difference rewards expose a more controllable faithfulness–coverage frontier.</p>\n<p>Curious to hear what the community thinks: should future multimodal RL rewards move from holistic scores toward verifiable claim-level supervision?</p>\n","updatedAt":"2026-05-26T02:58:54.564Z","author":{"_id":"63daf95a9f2687298a110386","avatarUrl":"/avatars/7e922ca76689de8e0e0c4af350115246.svg","fullname":"Tianle LI","name":"tianleliphoebe","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8819760084152222},"editors":["tianleliphoebe"],"editorAvatarUrls":["/avatars/7e922ca76689de8e0e0c4af350115246.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20278","authors":[{"_id":"6a150af6b57a1823d5708a93","name":"Tianle Li","hidden":false},{"_id":"6a150af6b57a1823d5708a94","name":"Xuyang Shen","hidden":false},{"_id":"6a150af6b57a1823d5708a95","name":"Yan Ma","hidden":false},{"_id":"6a150af6b57a1823d5708a96","name":"Rongxin Guo","hidden":false},{"_id":"6a150af6b57a1823d5708a97","name":"Shaoxiang Chen","hidden":false},{"_id":"6a150af6b57a1823d5708a98","name":"Jiacheng Chen","hidden":false},{"_id":"6a150af6b57a1823d5708a99","name":"Haochen Wang","hidden":false},{"_id":"6a150af6b57a1823d5708a9a","name":"Hongyang Tang","hidden":false},{"_id":"6a150af6b57a1823d5708a9b","name":"Yucong Zhou","hidden":false},{"_id":"6a150af6b57a1823d5708a9c","name":"Yu Cheng","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison","submittedOnDailyBy":{"_id":"63daf95a9f2687298a110386","avatarUrl":"/avatars/7e922ca76689de8e0e0c4af350115246.svg","isPro":false,"fullname":"Tianle LI","user":"tianleliphoebe","type":"user","name":"tianleliphoebe"},"summary":"Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.","upvotes":0,"discussionId":"6a150af6b57a1823d5708a9d","githubRepo":"https://github.com/ltl3A87/ClaimDiff-RL","githubRepoAddedBy":"user","ai_summary":"ClaimDiff-RL addresses the reward granularity issue in long-form image captioning by using reference-conditioned atomic claim differences as reward units, enabling separate measurement and tuning of hallucination and omission errors.","ai_keywords":["long-form image captioning","reinforcement learning","reward granularity","hallucination","visual claims","atomic claim differences","multimodal judge","reference-conditioned","verifiable claim differences","factuality","coverage","holistic scalar rewards","claim-level rewards"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20278.md"}">
Papers
arxiv:2605.20278

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Published on May 24
· Submitted by
Tianle LI
on May 26
Authors:
,
,
,
,
,
,
,
,
,

Abstract

ClaimDiff-RL addresses the reward granularity issue in long-form image captioning by using reference-conditioned atomic claim differences as reward units, enabling separate measurement and tuning of hallucination and omission errors.

AI-generated summary

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

Community

Can we make caption RL more diagnosable than a single scalar reward?

In ClaimDiff-RL, we argue that long-form image captions should not be judged only as whole sequences. A dense caption is made of many local visual claims—objects, counts, colors, spatial relations, OCR text, and fine-grained details. Instead of directly asking a judge for one holistic score, ClaimDiff-RL compares an actor caption with a reference caption, identifies atomic visual differences, verifies each difference against the image, and assigns side-specific typed errors.

This turns hallucinations, missing facts, and correct extra details into separately measurable reward signals. Interestingly, we find that holistic rewards can reduce hallucination by encouraging conservative under-captioning, while claim-difference rewards expose a more controllable faithfulness–coverage frontier.

Curious to hear what the community thinks: should future multimodal RL rewards move from holistic scores toward verifiable claim-level supervision?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20278
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.20278 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.20278 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20278 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers