Hugging Face Daily Papers · · 4 min read

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

EEVEE studies test-time prompt learning for LLM agents in more realistic settings, where tasks arrive as heterogeneous streams from multiple datasets and domains.<br>Instead of optimizing a single prompt for a fixed benchmark, EEVEE introduces a router-prompt co-evolution framework that clusters incoming tasks and assigns them to suitable prompt configurations. This helps reduce cross-dataset interference while preserving test-time adaptation ability.<br>The paper reports strong gains across multiple benchmarks, making it a useful step toward self-improving agents that can adapt continuously in the real world.</p>\n","updatedAt":"2026-06-10T03:23:18.660Z","author":{"_id":"638b13c0c1d591879698f4e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638b13c0c1d591879698f4e2/X8X4EWMXuzhBpG62wO2xS.jpeg","fullname":"Shilong Liu","name":"ShilongLiu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":41,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8862330913543701},"editors":["ShilongLiu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/638b13c0c1d591879698f4e2/X8X4EWMXuzhBpG62wO2xS.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11182","authors":[{"_id":"6a28c637e7d78ea7587e5323","name":"Weixian Xu","hidden":false},{"_id":"6a28c637e7d78ea7587e5324","name":"Shilong Liu","hidden":false},{"_id":"6a28c637e7d78ea7587e5325","name":"Mengdi Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/638b13c0c1d591879698f4e2/v6BGz1bd1N9Q1_XDxHKQM.png","https://cdn-uploads.huggingface.co/production/uploads/638b13c0c1d591879698f4e2/DYn1EotoqIFbtxEluDtQF.mp4"],"publishedAt":"2026-06-09T17:57:16.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents","submittedOnDailyBy":{"_id":"638b13c0c1d591879698f4e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638b13c0c1d591879698f4e2/X8X4EWMXuzhBpG62wO2xS.jpeg","isPro":false,"fullname":"Shilong Liu","user":"ShilongLiu","type":"user","name":"ShilongLiu"},"summary":"In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.","upvotes":15,"discussionId":"6a28c637e7d78ea7587e5326","projectPage":"https://princeton-ai2-lab.github.io/EEVEE/","githubRepo":"https://github.com/Princeton-AI2-Lab/EEVEE","githubRepoAddedBy":"user","ai_summary":"EEVEE is a novel test-time prompt learning framework for LLM agents that handles heterogeneous data streams through task clustering and co-evolving router-prompt optimization.","ai_keywords":["test-time prompt learning","LLM agents","multi-dataset","cross-dataset interference","router","prompt configurations","router-prompt co-evolution","task clusters","heterogeneous data streams"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"69081a9c8b3b900d6e63602f","name":"princeton-ai","fullname":"Princeton AI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/647bf082aba7062fe5c51ca9/Xh9rZKOFsWasVQXJwjmVt.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"638b13c0c1d591879698f4e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638b13c0c1d591879698f4e2/X8X4EWMXuzhBpG62wO2xS.jpeg","isPro":false,"fullname":"Shilong Liu","user":"ShilongLiu","type":"user"},{"_id":"62145614b670cb63a38075ba","avatarUrl":"/avatars/5e33debde75ae6c87640f63c48c560c6.svg","isPro":false,"fullname":"MenghaoGuo","user":"MenghaoGuo","type":"user"},{"_id":"65f546a926f86cf337e4a671","avatarUrl":"/avatars/73305537e056609fd4bc73a7f02156a7.svg","isPro":false,"fullname":"Weixian Xu","user":"HZxCzar","type":"user"},{"_id":"65029ac634ddd2032740dd82","avatarUrl":"/avatars/ede116f7069d505b690e5b4a1576d4ef.svg","isPro":false,"fullname":"liuyixiu","user":"liuyx0903","type":"user"},{"_id":"656d41258a37acfa3f1f284a","avatarUrl":"/avatars/520e72488441bd3eb35f152fbb6a9ba8.svg","isPro":false,"fullname":"feng li","user":"fenly","type":"user"},{"_id":"67f5c7f215e30a165ee73334","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6bmu8qdzZtvQK37I4RiXH.png","isPro":false,"fullname":"Nanxi Li","user":"andyc03","type":"user"},{"_id":"65bece65ef796a0c7f08a50b","avatarUrl":"/avatars/5fb1ea7451ee01dd017807c2cd9afd34.svg","isPro":false,"fullname":"Chenggong Zhang","user":"Alex4l","type":"user"},{"_id":"6936b4859d96c509e7875513","avatarUrl":"/avatars/9f672440e7ac0c58bb5e69a4b5377793.svg","isPro":false,"fullname":"HU QINZHE","user":"Hugo0713","type":"user"},{"_id":"69508ccc1dc550a1ad338170","avatarUrl":"/avatars/2a7342d8e225255aa641e3c66724e79c.svg","isPro":false,"fullname":"Jichen Feng","user":"Teslamax","type":"user"},{"_id":"69fbf1e601bf3983fd19737b","avatarUrl":"/avatars/1fc4a84353cf05340bd90dcb8dbc1419.svg","isPro":false,"fullname":"memeye","user":"MemEye2026","type":"user"},{"_id":"662a471e94baa018b00c0f5c","avatarUrl":"/avatars/62a67a2ee6e4b9a7124f8b02b9b3f280.svg","isPro":false,"fullname":"Zhixuan Liang","user":"Liang-ZX","type":"user"},{"_id":"6891b2f2a3d7c2e38a4196a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/jLCkmZl8K6LSYe-6vwW22.png","isPro":false,"fullname":"Qingyue Jiao","user":"qingyuejiao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69081a9c8b3b900d6e63602f","name":"princeton-ai","fullname":"Princeton AI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/647bf082aba7062fe5c51ca9/Xh9rZKOFsWasVQXJwjmVt.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11182.md"}">
Papers
arxiv:2606.11182

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Published on Jun 9
· Submitted by
Shilong Liu
on Jun 10
Authors:
,
,

Abstract

EEVEE is a novel test-time prompt learning framework for LLM agents that handles heterogeneous data streams through task clustering and co-evolving router-prompt optimization.

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

Community

Paper submitter about 14 hours ago

EEVEE studies test-time prompt learning for LLM agents in more realistic settings, where tasks arrive as heterogeneous streams from multiple datasets and domains.
Instead of optimizing a single prompt for a fixed benchmark, EEVEE introduces a router-prompt co-evolution framework that clusters incoming tasks and assigns them to suitable prompt configurations. This helps reduce cross-dataset interference while preserving test-time adaptation ability.
The paper reports strong gains across multiple benchmarks, making it a useful step toward self-improving agents that can adapt continuously in the real world.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.11182
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11182 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11182 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11182 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers