Hugging Face Daily Papers · May 14, 2026 · 3 min read

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Not just think! Agents need to explore the environment at test-time!</p>\n","updatedAt":"2026-05-14T16:23:11.770Z","author":{"_id":"68ad8ff044c8254ac4b6ad6e","avatarUrl":"/avatars/a566b764e100edf1e2d633623e32f5cf.svg","fullname":"Xingyuan Hua","name":"hansenhua","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9361950159072876},"editors":["hansenhua"],"editorAvatarUrls":["/avatars/a566b764e100edf1e2d633623e32f5cf.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08978","authors":[{"_id":"6a04467f86b054ce2fa410f9","name":"Xingyuan Hua","hidden":false},{"_id":"6a04467f86b054ce2fa410fa","name":"Sheng Yue","hidden":false},{"_id":"6a04467f86b054ce2fa410fb","name":"Ju Ren","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization","submittedOnDailyBy":{"_id":"68ad8ff044c8254ac4b6ad6e","avatarUrl":"/avatars/a566b764e100edf1e2d633623e32f5cf.svg","isPro":false,"fullname":"Xingyuan Hua","user":"hansenhua","type":"user","name":"hansenhua"},"summary":"Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.","upvotes":0,"discussionId":"6a04468086b054ce2fa410fc","githubRepo":"https://github.com/HansenHua/EAPO-ICML26","githubRepoAddedBy":"user","ai_summary":"Agents use variational inference to evaluate exploratory actions and selectively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.","ai_keywords":["reinforcement learning","variational inference","exploration-aware","fine-grained reward function","exploration-aware grouping mechanism","informational gaps"],"githubStars":1,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08978.md"}">

Papers

arxiv:2605.08978

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Published on May 12

· Submitted by

Xingyuan Hua on May 14

Tsinghua University

Upvote

Authors:

Abstract

Agents use variational inference to evaluate exploratory actions and selectively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.

AI-generated summary

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.

View arXiv page View PDF GitHub 1 Add to collection

Community

hansenhua

Paper submitter about 10 hours ago

Not just think! Agents need to explore the environment at test-time!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.08978

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08978 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08978 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers