Hugging Face Daily Papers · June 18, 2026 · 4 min read

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers. ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.\nPretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.\nViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:\n<ul>\n<li>+2.07 mIoU on Cityscapes</li>\n<li>+4.17 <a href=\"mailto:[email protected]\" rel=\"nofollow\">[email protected]</a> on SPair-71k</li>\n</ul>\nThe project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:\n<a href=\"https://vitup.papers.discuna.com/\" rel=\"nofollow\">https://vitup.papers.discuna.com/</a>\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/63cb65799f78909f9f862428/oV5inyzEKSm0f5uzcpzoE.jpeg\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/63cb65799f78909f9f862428/oV5inyzEKSm0f5uzcpzoE.jpeg\" alt=\"Group 531 (2)\"></a>\n","updatedAt":"2026-06-18T13:28:35.349Z","author":{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","fullname":"Wandel","name":"Krispin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8168303966522217},"editors":["Krispin"],"editorAvatarUrls":["/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.14024","authors":[{"_id":"6a33ab2759127a45e2c1c720","user":{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","isPro":false,"fullname":"Wandel","user":"Krispin","type":"user","name":"Krispin"},"name":"Krispin Wandel","status":"claimed_verified","statusLastChangedAt":"2026-06-18T11:26:15.000Z","hidden":false},{"_id":"6a33ab2759127a45e2c1c721","name":"Jingchuan Wang","hidden":false},{"_id":"6a33ab2759127a45e2c1c722","name":"Hesheng Wang","hidden":false}],"publishedAt":"2026-06-12T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"ViT-Up: Faithful Feature Upsampling for Vision Transformers","submittedOnDailyBy":{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","isPro":false,"fullname":"Wandel","user":"Krispin","type":"user","name":"Krispin"},"summary":"Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 [email protected], demonstrating that ViT-Up scales favorably with backbone capacity.","upvotes":4,"discussionId":"6a33ab2759127a45e2c1c723","projectPage":"https://vitup.papers.discuna.com/","githubRepo":"https://github.com/krispinwandel/vit-up","githubRepoAddedBy":"user","ai_summary":"ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods.","ai_keywords":["Vision Transformers","self-attention","dense prediction tasks","semantic segmentation","depth estimation","feature upsamplers","image guidance","hidden states","feature prediction","backbone feature space"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63cb65799f78909f9f862428","avatarUrl":"/avatars/341dfe31f8133b2c2d5a6b203f73c5bb.svg","isPro":false,"fullname":"Wandel","user":"Krispin","type":"user"},{"_id":"64145bd022f884f63d9bac58","avatarUrl":"/avatars/35320f62f554451d4c851e770f728d14.svg","isPro":false,"fullname":"Nils Wandel","user":"nwandel","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.14024.md","query":{}}">

Papers

arxiv:2606.14024

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Published on Jun 12

· Submitted by

Wandel on Jun 18

Shanghai Jiao Tong University

Upvote

Authors:

Krispin Wandel ,

Abstract

ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 [email protected], demonstrating that ViT-Up scales favorably with backbone capacity.

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

Krispin

Paper author Paper submitter about 3 hours ago

Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers.
ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.

Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.

ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:

+2.07 mIoU on Cityscapes
+4.17 [email protected] on SPair-71k

The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:

https://vitup.papers.discuna.com/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.14024

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14024 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14024 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers