Hugging Face Daily Papers · · 5 min read

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: <a href=\"https://github.com/elmma/mllm-reroute/\" rel=\"nofollow\">https://github.com/elmma/mllm-reroute/</a></p>\n","updatedAt":"2026-06-11T14:21:57.351Z","author":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","fullname":"Yu-Lun Liu","name":"yulunliu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8865078687667847},"editors":["yulunliu"],"editorAvatarUrls":["/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12412","authors":[{"_id":"6a2ac45afdec76e893e7621e","name":"Cheng-Yu Yang","hidden":false},{"_id":"6a2ac45afdec76e893e7621f","name":"Shao-Yuan Lo","hidden":false},{"_id":"6a2ac45afdec76e893e76220","name":"Yu-Lun Liu","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models","submittedOnDailyBy":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user","name":"yulunliu"},"summary":"Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/","upvotes":14,"discussionId":"6a2ac45afdec76e893e76221","githubRepo":"https://github.com/elmma/mllm-reroute","githubRepoAddedBy":"user","ai_summary":"Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages.","ai_keywords":["vision-language models","visual tokens","decoder inference","attention computation","KV-cache memory","rank-and-remove paradigm","visual-token reduction","decoder blocks","routing stages","attention-score ranking","token reduction","grounding-sensitive queries"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user"},{"_id":"66dee2379fda73ea1235a73c","avatarUrl":"/avatars/540310c2284bee2534fa72ee678dc694.svg","isPro":false,"fullname":"KuanLin Chen","user":"GoluckySir","type":"user"},{"_id":"64cdecee2f1f9578a0e701c8","avatarUrl":"/avatars/95a51dd4e1b7b9366ebcbd6028ad148b.svg","isPro":false,"fullname":"Yi-Ruei Liu","user":"Shigon","type":"user"},{"_id":"68a41489d9b513a884bca475","avatarUrl":"/avatars/e4caeb16f3c4e7c36835cf26c8cb0d2c.svg","isPro":false,"fullname":"You-Zhe Xie","user":"YouZhe0305","type":"user"},{"_id":"6672fe26c33b5004b69a1d6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Ff8cOS6Y0TPUSihx_hOMe.png","isPro":false,"fullname":"YouZhe","user":"YouZhe","type":"user"},{"_id":"64ea1e12925565abda02b17b","avatarUrl":"/avatars/b2bc33d95a147c6c8cf6b54672eb5a97.svg","isPro":false,"fullname":"Cheng-De Fan","user":"fansam39","type":"user"},{"_id":"69e73ad3d2e8af27eadced83","avatarUrl":"/avatars/839f9d6f064a6163bee161e60c0324fb.svg","isPro":false,"fullname":"TobyLin","user":"TobyLin000","type":"user"},{"_id":"67173302fd698e5b2a9c91dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7_5xqQEShNkkKpGbjFIjG.png","isPro":false,"fullname":"Bo-Hsu Ke","user":"Hentci","type":"user"},{"_id":"666afb91e936f6cbcfc8b50c","avatarUrl":"/avatars/a618c074c9e11e6b9444d0e366efbbdf.svg","isPro":false,"fullname":"LIN, CHIN-YANG","user":"linjohnss","type":"user"},{"_id":"684afb68f144221f28256461","avatarUrl":"/avatars/48c3d76057f78e1ca4abb2b121a2d089.svg","isPro":false,"fullname":"Zhenjun Zhao","user":"rickyeric","type":"user"},{"_id":"67178582bc4492cad19a1f14","avatarUrl":"/avatars/f2481c0c70a857a862d887beb05c428e.svg","isPro":false,"fullname":"Yi-Chuan Huang","user":"YiChuanH","type":"user"},{"_id":"687e104385ef4f79e80c0704","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/687e104385ef4f79e80c0704/LM8kekQLImzw3s8fBzGWM.jpeg","isPro":false,"fullname":"Sean","user":"Sean20405","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12412.md"}">
Papers
arxiv:2606.12412

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Published on Jun 10
· Submitted by
Yu-Lun Liu
on Jun 11
Authors:
,
,

Abstract

Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages.

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

Community

Paper submitter about 6 hours ago

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12412
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12412 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12412 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12412 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers