Hugging Face Daily Papers · June 2, 2026 · 3 min read

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

PARCEL is a vision-language model bridge architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.</p>\n","updatedAt":"2026-06-02T11:26:58.123Z","author":{"_id":"63402b30e670ff9cf63d8caa","avatarUrl":"/avatars/0aee84d132a78d4ec71663836a57a245.svg","fullname":"Alessio Tonioni","name":"Alessiot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8855276107788086},"editors":["Alessiot"],"editorAvatarUrls":["/avatars/0aee84d132a78d4ec71663836a57a245.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30126","authors":[{"_id":"6a1dc835808ddbc3c7d43a3a","user":{"_id":"6716b08bdcf40031c2b0f911","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6716b08bdcf40031c2b0f911/wQRZe0vUKhuVG9g7jrXk7.jpeg","isPro":false,"fullname":"Selim Kuzucu","user":"selimkuzucu","type":"user","name":"selimkuzucu"},"name":"Selim Kuzucu","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:09:50.150Z","hidden":false},{"_id":"6a1dc835808ddbc3c7d43a3b","name":"Alessio Tonioni","hidden":false},{"_id":"6a1dc835808ddbc3c7d43a3c","name":"Vasile Lup","hidden":false},{"_id":"6a1dc835808ddbc3c7d43a3d","name":"Bernt Schiele","hidden":false},{"_id":"6a1dc835808ddbc3c7d43a3e","name":"Federico Tombari","hidden":false},{"_id":"6a1dc835808ddbc3c7d43a3f","name":"Muhammad Ferjad Naeem","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding","submittedOnDailyBy":{"_id":"63402b30e670ff9cf63d8caa","avatarUrl":"/avatars/0aee84d132a78d4ec71663836a57a245.svg","isPro":false,"fullname":"Alessio Tonioni","user":"Alessiot","type":"user","name":"Alessiot"},"summary":"Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the \"train once, deploy anywhere\" paradigm.","upvotes":6,"discussionId":"6a1dc835808ddbc3c7d43a40","projectPage":"https://parcel-elastic-inference.github.io/","ai_summary":"PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.","ai_keywords":["visual tokenization","elastic visual-token compression","spatial-only compression","query-only compression","nested pooling","nested query resampling","Pool-Conditioned Query Resampling","visual tokenization architecture","feature extraction","visual-token budgets"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6716b08bdcf40031c2b0f911","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6716b08bdcf40031c2b0f911/wQRZe0vUKhuVG9g7jrXk7.jpeg","isPro":false,"fullname":"Selim Kuzucu","user":"selimkuzucu","type":"user"},{"_id":"67be0641350496a87b514d8c","avatarUrl":"/avatars/0595f6c52d0b01cbf5434784ac61da43.svg","isPro":false,"fullname":"Nazir Nayal","user":"nazirnayal98","type":"user"},{"_id":"67bf33a512368ec2fad4fe29","avatarUrl":"/avatars/ea5c03744ec1c2bcc0e6c13efc8f7ddc.svg","isPro":false,"fullname":"Muhammad Ferjad Naeem","user":"ferjad","type":"user"},{"_id":"683ee23a4b3362122dc4a0cc","avatarUrl":"/avatars/f336f6a5ad57eb82286fe314ff38b97f.svg","isPro":false,"fullname":"Anurag","user":"anurag4446","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30126.md"}">

Papers

arxiv:2605.30126

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Published on May 28

· Submitted by

Alessio Tonioni on Jun 2

Google

Upvote

Authors:

Selim Kuzucu ,

Abstract

PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

View arXiv page View PDF Project page Add to collection

Community

Alessiot

Paper submitter about 15 hours ago

PARCEL is a vision-language model bridge architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30126

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30126 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30126 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30126 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers