RL4IL addresses a genuinely underexplored problem in imitation learning: what happens when sensors fail at deployment time? Most IL methods silently assume all modalities are always available, which is unrealistic for real robot deployments. Our key insight is that instead of retraining the policy for every possible dropout pattern, we can retrieve the right behaviour from a frozen demonstration library using a learned RL ranking policy.<br>A few highlights that might interest the community:</p>\n<p>The PPO policy operates over BFS-augmented candidate sets, giving it a richer and more label-diverse pool than plain kNN<br>Soft cross-attention fusion over top-K ranked candidates consistently outperforms hard argmax selection, especially under noisy retrieval<br>Zero-shot missing-modality handling at inference — no retraining needed when a camera fails<br>On LIBERO benchmarks, RL4IL reaches up to 0.733 success rate under complete camera dropout, where the strongest prior method (DisDP) reaches only 0.295</p>\n<p>Happy to discuss the retrieval design, the imputation pipeline, or the LIBERO experimental setup with anyone interested in robust robot learning.</p>\n","updatedAt":"2026-06-18T19:51:28.436Z","author":{"_id":"69638fe3051f80b625bdde49","avatarUrl":"/avatars/e44e3b26cc5308a8cbf1c8b61d03e50e.svg","fullname":"Hassan Ismkhan","name":"ism-hassan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8550625443458557},"editors":["ism-hassan"],"editorAvatarUrls":["/avatars/e44e3b26cc5308a8cbf1c8b61d03e50e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15514","authors":[{"_id":"6a31af59bc818ff14e453c2c","user":{"_id":"69638fe3051f80b625bdde49","avatarUrl":"/avatars/e44e3b26cc5308a8cbf1c8b61d03e50e.svg","isPro":false,"fullname":"Hassan Ismkhan","user":"ism-hassan","type":"user","name":"ism-hassan"},"name":"Hassan Ismkhan","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:24:40.706Z","hidden":false},{"_id":"6a31af59bc818ff14e453c2d","name":"Hamid Bouchahcia","hidden":false}],"publishedAt":"2026-06-13T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities","submittedOnDailyBy":{"_id":"69638fe3051f80b625bdde49","avatarUrl":"/avatars/e44e3b26cc5308a8cbf1c8b61d03e50e.svg","isPro":false,"fullname":"Hassan Ismkhan","user":"ism-hassan","type":"user","name":"ism-hassan"},"summary":"Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera","upvotes":0,"discussionId":"6a31af59bc818ff14e453c2e","githubRepo":"https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera","githubRepoAddedBy":"user","ai_summary":"RL4IL enables robust robotic manipulation under sensor dropout by using reinforcement learning to retrieve relevant demonstrations and cross-attention fusion to impute missing modalities without retraining.","ai_keywords":["reinforcement learning","imitation learning","Breadth-First Search","Proximal Policy Optimization","cross-attention fusion","soft cross-attention","soft imputation head","sensor dropout","missing-modality scenarios","kNN retrieval"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15514.md","query":{}}">
Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities
Abstract
RL4IL enables robust robotic manipulation under sensor dropout by using reinforcement learning to retrieve relevant demonstrations and cross-attention fusion to impute missing modalities without retraining.
Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera
Community
RL4IL addresses a genuinely underexplored problem in imitation learning: what happens when sensors fail at deployment time? Most IL methods silently assume all modalities are always available, which is unrealistic for real robot deployments. Our key insight is that instead of retraining the policy for every possible dropout pattern, we can retrieve the right behaviour from a frozen demonstration library using a learned RL ranking policy.
A few highlights that might interest the community:
The PPO policy operates over BFS-augmented candidate sets, giving it a richer and more label-diverse pool than plain kNN
Soft cross-attention fusion over top-K ranked candidates consistently outperforms hard argmax selection, especially under noisy retrieval
Zero-shot missing-modality handling at inference — no retraining needed when a camera fails
On LIBERO benchmarks, RL4IL reaches up to 0.733 success rate under complete camera dropout, where the strongest prior method (DisDP) reaches only 0.295
Happy to discuss the retrieval design, the imputation pipeline, or the LIBERO experimental setup with anyone interested in robust robot learning.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.15514 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.15514 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.15514 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.