Hugging Face Daily Papers · · 4 min read

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

TL;DR: Adding a new task to a VLA policy usually means collecting teleop demos and fine-tuning per task. We replace that target-side cost with retrieval: train the policy once, freeze it, and add new tasks at deployment just by appending cheap human-hand demos to a retrieval pool.</p>\n","updatedAt":"2026-06-16T03:13:57.189Z","author":{"_id":"6363d4f6123a5d5cd4a9e205","avatarUrl":"/avatars/857c98dd7656e5908249f6cc6c6e38c3.svg","fullname":"Park","name":"Jeongeun","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8640599846839905},"editors":["Jeongeun"],"editorAvatarUrls":["/avatars/857c98dd7656e5908249f6cc6c6e38c3.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15631","authors":[{"_id":"6a30be5da0d4daae4285fe9d","name":"Jeongeun Park","hidden":false},{"_id":"6a30be5da0d4daae4285fe9e","name":"Juhan Park","hidden":false},{"_id":"6a30be5da0d4daae4285fe9f","name":"Taekyung Kim","hidden":false},{"_id":"6a30be5da0d4daae4285fea0","name":"Sungjoon Choi","hidden":false},{"_id":"6a30be5da0d4daae4285fea1","name":"Dongyoon Han","hidden":false},{"_id":"6a30be5da0d4daae4285fea2","name":"Sangdoo Yun","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6363d4f6123a5d5cd4a9e205/lT8xQ1jwbmp1xnaRj3i0g.mp4"],"publishedAt":"2026-06-14T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time","submittedOnDailyBy":{"_id":"6363d4f6123a5d5cd4a9e205","avatarUrl":"/avatars/857c98dd7656e5908249f6cc6c6e38c3.svg","isPro":true,"fullname":"Park","user":"Jeongeun","type":"user","name":"Jeongeun"},"summary":"Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.","upvotes":9,"discussionId":"6a30be5ea0d4daae4285fea3","projectPage":"https://recap-robot.github.io/","ai_summary":"Retrieval-augmented vision-language-action policies eliminate per-task fine-tuning costs by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation.","ai_keywords":["vision-language-action policy","teleoperated demonstrations","per-task fine-tuning","retrieval-augmented policy","frozen policy","retrieval pool","cross-embodiment generalization","video-generation-based world-action model","future-image objective","Cosmos Policy"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"64ffe603efd273eec7768bde","name":"naver-ai","fullname":"NAVER AI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ff1755b75685dd7a46e146/Zj2bxgq31oSqwVrw16IE_.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"660f8cc1a61244f3df3d4426","avatarUrl":"/avatars/45d59766122bb3482f6dd7f9d98aa87a.svg","isPro":false,"fullname":"Dongyoon Han","user":"calintz","type":"user"},{"_id":"648ac415718bc0670a5a5f56","avatarUrl":"/avatars/27189e289b808ef01689ff2abb7a56bf.svg","isPro":false,"fullname":"Sangdoo Yun","user":"oodgnas","type":"user"},{"_id":"67456be0e394055d2ac7bfe6","avatarUrl":"/avatars/037e74bfa9e12b4a0475a3e42589ef76.svg","isPro":false,"fullname":"Taekyung Kim","user":"tkkim93","type":"user"},{"_id":"65cccd8c80d3c4b865d3b262","avatarUrl":"/avatars/e6f6d8f06dd54e1e7b6d686835a9c075.svg","isPro":false,"fullname":"Na Min An","user":"namin0202","type":"user"},{"_id":"64b9feed96676e40d0fa89a7","avatarUrl":"/avatars/2154a6ceb87677ad2c9d9620de5b18ec.svg","isPro":true,"fullname":"Byeongho Heo","user":"bhheo","type":"user"},{"_id":"672b104f7ec01ef4af5894d5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cvO8rs7Yr5UYld1TR1-yg.png","isPro":false,"fullname":"Junha Park","user":"tunatone","type":"user"},{"_id":"61839579a658bd70108e88c0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61839579a658bd70108e88c0/IKyBDiFi4qH1Qy3nJkW-R.jpeg","isPro":false,"fullname":"Geonmo Gu","user":"Geonmo","type":"user"},{"_id":"69a66c6c20822b4afdb771fc","avatarUrl":"/avatars/52e504ace4b87b4fb87cfa181b1ae0ae.svg","isPro":true,"fullname":"KAIST-CVLAB","user":"kaistcvlab","type":"user"},{"_id":"651ed7b5755e92f7f12738b5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/0m82Fl5EAW4MQRjBQa94x.jpeg","isPro":false,"fullname":"Jin-Hwa Kim","user":"jnhwkim","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64ffe603efd273eec7768bde","name":"naver-ai","fullname":"NAVER AI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ff1755b75685dd7a46e146/Zj2bxgq31oSqwVrw16IE_.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15631.md","query":{}}">
Papers
arxiv:2606.15631

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

Published on Jun 14
· Submitted by
Park
on Jun 16
Authors:
,
,
,
,
,

Abstract

Retrieval-augmented vision-language-action policies eliminate per-task fine-tuning costs by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation.

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

Community

Paper submitter about 10 hours ago

TL;DR: Adding a new task to a VLA policy usually means collecting teleop demos and fine-tuning per task. We replace that target-side cost with retrieval: train the policy once, freeze it, and add new tasks at deployment just by appending cheap human-hand demos to a retrieval pool.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.15631
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15631 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15631 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15631 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers