Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers.<br>ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.</p>\n<p>Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.</p>\n<p>ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:</p>\n<ul>\n<li>+2.07 mIoU on Cityscapes</li>\n<li>+4.17 <a href=\"mailto:
[email protected]\" rel=\"nofollow\">
[email protected]</a> on SPair-71k</li>\n</ul>\n<p>The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:</p>\n<p><a href=\"https://vitup.papers.discuna.com/\" rel=\"nofollow\">https://vitup.papers.discuna.com/</a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/63cb65799f78909f9f862428/oV5inyzEKSm0f5uzcpzoE.jpeg\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/63cb65799f78909f9f862428/oV5inyzEKSm0f5uzcpzoE.jpeg\" alt=\"Group 531 (2)\"></a></p>\n","updatedAt":"2026-06-18T13:28:35.349Z","author":{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","fullname":"Wandel","name":"Krispin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8168303966522217},"editors":["Krispin"],"editorAvatarUrls":["/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.14024","authors":[{"_id":"6a33ab2759127a45e2c1c720","user":{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","isPro":false,"fullname":"Wandel","user":"Krispin","type":"user","name":"Krispin"},"name":"Krispin Wandel","status":"claimed_verified","statusLastChangedAt":"2026-06-18T11:26:15.000Z","hidden":false},{"_id":"6a33ab2759127a45e2c1c721","name":"Jingchuan Wang","hidden":false},{"_id":"6a33ab2759127a45e2c1c722","name":"Hesheng Wang","hidden":false}],"publishedAt":"2026-06-12T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"ViT-Up: Faithful Feature Upsampling for Vision Transformers","submittedOnDailyBy":{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","isPro":false,"fullname":"Wandel","user":"Krispin","type":"user","name":"Krispin"},"summary":"Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17
[email protected] on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09
[email protected], demonstrating that ViT-Up scales favorably with backbone capacity.","upvotes":4,"discussionId":"6a33ab2759127a45e2c1c723","projectPage":"https://vitup.papers.discuna.com/","githubRepo":"https://github.com/krispinwandel/vit-up","githubRepoAddedBy":"user","ai_summary":"ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods.","ai_keywords":["Vision Transformers","self-attention","dense prediction tasks","semantic segmentation","depth estimation","feature upsamplers","image guidance","hidden states","feature prediction","backbone feature space"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","isPro":false,"fullname":"Wandel","user":"Krispin","type":"user"},{"_id":"64145bd022f884f63d9bac58","avatarUrl":"/avatars/35320f62f554451d4c851e770f728d14.svg","isPro":false,"fullname":"Nils Wandel","user":"nwandel","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.14024.md","query":{}}">
ViT-Up: Faithful Feature Upsampling for Vision Transformers
Published on Jun 12
· Submitted by Wandel on Jun 18 Abstract
ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods.
Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 [email protected], demonstrating that ViT-Up scales favorably with backbone capacity.
Community
Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers.
ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.
Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.
ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:
The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:
https://vitup.papers.discuna.com/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.14024 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.14024 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.