Hugging Face Daily Papers · June 19, 2026 · 4 min read

Context-Aware RL for Agentic and Multimodal LLMs

#multimodal #agents #reasoning #benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Context-Aware RL for Agentic and Multimodal LLMs\n👉 LLMs often fail not because the answer is impossible, but because they miss the one decisive clue hidden in a long trace or image.\n🔥 We introduce ContextRL: RL that teaches models to identify which context actually supports an answer.\n✅ +2.2% on 5 agentic benchmarks ✅ +1.8% across 12 VQA benchmarks ✅ Works for coding agents & multimodal reasoning ✅ Same contrastive data, but better objective — not data augmentation\n🧠 The key idea: don’t only reward the final answer. Reward the model for grounding it in the right evidence.\n","updatedAt":"2026-06-19T15:56:14.428Z","author":{"_id":"6625f568f5c285535ccc8a71","avatarUrl":"/avatars/6928da32796c7b128dea167823ccbd0d.svg","fullname":"py xu","name":"xupy21","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8535414338111877},"editors":["xupy21"],"editorAvatarUrls":["/avatars/6928da32796c7b128dea167823ccbd0d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17053","authors":[{"_id":"6a3453f24c5c5e0d69bf1b15","user":{"_id":"6625f568f5c285535ccc8a71","avatarUrl":"/avatars/6928da32796c7b128dea167823ccbd0d.svg","isPro":false,"fullname":"py xu","user":"xupy21","type":"user","name":"xupy21"},"name":"Peiyang Xu","status":"claimed_verified","statusLastChangedAt":"2026-06-19T14:19:57.617Z","hidden":false},{"_id":"6a3453f24c5c5e0d69bf1b16","name":"Bangzheng Li","hidden":false},{"_id":"6a3453f24c5c5e0d69bf1b17","name":"Sijia Liu","hidden":false},{"_id":"6a3453f24c5c5e0d69bf1b18","name":"Karthik R. Narasimhan","hidden":false},{"_id":"6a3453f24c5c5e0d69bf1b19","name":"Pramod Viswanath","hidden":false},{"_id":"6a3453f24c5c5e0d69bf1b1a","name":"Prateek Mittal","hidden":false},{"_id":"6a3453f24c5c5e0d69bf1b1b","name":"Xingyu Fu","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"Context-Aware RL for Agentic and Multimodal LLMs","submittedOnDailyBy":{"_id":"6625f568f5c285535ccc8a71","avatarUrl":"/avatars/6928da32796c7b128dea167823ccbd0d.svg","isPro":false,"fullname":"py xu","user":"xupy21","type":"user","name":"xupy21"},"summary":"Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.","upvotes":4,"discussionId":"6a3453f24c5c5e0d69bf1b1c","projectPage":"https://xupy2003.github.io/ContextRL_Website/","githubRepo":"https://github.com/xupy2003/ContextAwareRL","githubRepoAddedBy":"user","ai_summary":"ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks.","ai_keywords":["reinforcement learning","indirect auxiliary objective","fine-grained grounding","contrastive context data","long-horizon reasoning","multimodal reasoning","visual question answering","data augmentation baselines"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"64374111a701a7e744c02b0e","name":"princetonu","fullname":"Princeton University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/b3xXusq8Zz3ej8Z6fRTSZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6625f568f5c285535ccc8a71","avatarUrl":"/avatars/6928da32796c7b128dea167823ccbd0d.svg","isPro":false,"fullname":"py xu","user":"xupy21","type":"user"},{"_id":"6513717d749380c079b72bda","avatarUrl":"/avatars/86221220d4d2da5eb50c5e4f40548e29.svg","isPro":true,"fullname":"Li","user":"vincentleebang","type":"user"},{"_id":"651d2b485d3519c0b7595af7","avatarUrl":"/avatars/00ce2ecbc35e22a90f72b9015299aa29.svg","isPro":false,"fullname":"Sijia Liu","user":"sijial430","type":"user"},{"_id":"6336091b2db86a181ccd6054","avatarUrl":"/avatars/829f69436225d05d2c2136bc90f640d7.svg","isPro":false,"fullname":"Xingyu Fu","user":"Fiaa","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64374111a701a7e744c02b0e","name":"princetonu","fullname":"Princeton University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/b3xXusq8Zz3ej8Z6fRTSZ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17053.md","query":{}}">

Papers

arxiv:2606.17053

Context-Aware RL for Agentic and Multimodal LLMs

Published on Jun 15

· Submitted by

py xu on Jun 19

Princeton University

Upvote

Authors:

Peiyang Xu ,

Abstract

ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

xupy21

Paper author Paper submitter about 6 hours ago

Context-Aware RL for Agentic and Multimodal LLMs

👉 LLMs often fail not because the answer is impossible, but because they miss the one decisive clue hidden in a long trace or image.

🔥 We introduce ContextRL: RL that teaches models to identify which context actually supports an answer.

✅ +2.2% on 5 agentic benchmarks
✅ +1.8% across 12 VQA benchmarks
✅ Works for coding agents & multimodal reasoning
✅ Same contrastive data, but better objective — not data augmentation

🧠 The key idea: don’t only reward the final answer. Reward the model for grounding it in the right evidence.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.17053

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17053 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17053 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17053 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Context-Aware RL for Agentic and Multimodal LLMs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers