Hugging Face Daily Papers · June 11, 2026 · 4 min read

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

#model-release #multimodal #agents #inference

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

ReVision removes visually redundant patches across consecutive screenshots, reducing token usage and enabling models to handle longer histories, leading to improved performance in computer-use agents.</p>\n","updatedAt":"2026-06-11T18:56:17.962Z","author":{"_id":"60e32baedc56466240084155","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e32baedc56466240084155/Q-cJh3Q3-vvMbe749Gt5B.jpeg","fullname":"Amirhossein Abaskohi","name":"AmirhosseinAbaskohi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8902086615562439},"editors":["AmirhosseinAbaskohi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60e32baedc56466240084155/Q-cJh3Q3-vvMbe749Gt5B.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.11212","authors":[{"_id":"6a27c1f46dde1c5ef75bd22b","name":"Amirhossein Abaskohi","hidden":false},{"_id":"6a27c1f46dde1c5ef75bd22c","name":"Yuhang He","hidden":false},{"_id":"6a27c1f46dde1c5ef75bd22d","name":"Peter West","hidden":false},{"_id":"6a27c1f46dde1c5ef75bd22e","name":"Giuseppe Carenini","hidden":false},{"_id":"6a27c1f46dde1c5ef75bd22f","name":"Pranit Chawla","hidden":false},{"_id":"6a27c1f46dde1c5ef75bd230","name":"Vibhav Vineet","hidden":false}],"publishedAt":"2026-06-05T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","submittedOnDailyBy":{"_id":"60e32baedc56466240084155","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e32baedc56466240084155/Q-cJh3Q3-vvMbe749Gt5B.jpeg","isPro":false,"fullname":"Amirhossein Abaskohi","user":"AmirhosseinAbaskohi","type":"user","name":"AmirhosseinAbaskohi"},"summary":"Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.","upvotes":3,"discussionId":"6a27c1f46dde1c5ef75bd231","ai_summary":"ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates.","ai_keywords":["computer-use agents","visual tokens","multimodal language models","patch selector","visual patches","consecutive screenshots","spatial structure","token usage","success rate"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60e32baedc56466240084155","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e32baedc56466240084155/Q-cJh3Q3-vvMbe749Gt5B.jpeg","isPro":false,"fullname":"Amirhossein Abaskohi","user":"AmirhosseinAbaskohi","type":"user"},{"_id":"63c8527becdb7c9fdd9cacc6","avatarUrl":"/avatars/c8a3f5e1e5159ae5ead41bd9fc2b9b34.svg","isPro":false,"fullname":"Vibhav Vineet","user":"vibhav-vineet","type":"user"},{"_id":"6228ede94323cef93a956b24","avatarUrl":"/avatars/2f01099f102889f2a621a68dcd61b6b6.svg","isPro":false,"fullname":"AmirHossein DabiriAghdam","user":"AmirHossein1378","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.11212.md"}">

Papers

arxiv:2605.11212

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Published on Jun 5

· Submitted by

Amirhossein Abaskohi on Jun 11

Microsoft Research

Upvote

Authors:

Abstract

ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.

View arXiv page View PDF Add to collection

Community

AmirhosseinAbaskohi

Paper submitter about 1 hour ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.11212

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.11212 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.11212 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.11212 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers