Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.</p>\n","updatedAt":"2026-05-29T12:56:06.276Z","author":{"_id":"67d5848f179ad2756600eca3","avatarUrl":"/avatars/158168a753271b6e024e1fbdf52c9e73.svg","fullname":"Junhan ZHU","name":"Alrightlone","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.885019063949585},"editors":["Alrightlone"],"editorAvatarUrls":["/avatars/158168a753271b6e024e1fbdf52c9e73.svg"],"reactions":[{"reaction":"❤️","users":["graenys","lhcctbu"],"count":2}],"isReport":false}},{"id":"6a1a41891061848fe7ebef03","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:46:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models](https://huggingface.co/papers/2604.01881) (2026)\n* [OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models](https://huggingface.co/papers/2605.12056) (2026)\n* [DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs](https://huggingface.co/papers/2605.19322) (2026)\n* [Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models](https://huggingface.co/papers/2604.11240) (2026)\n* [LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs](https://huggingface.co/papers/2605.17260) (2026)\n* [OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models](https://huggingface.co/papers/2605.18041) (2026)\n* [KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models](https://huggingface.co/papers/2604.03414) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.01881\">HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12056\">OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.19322\">DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.11240\">Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17260\">LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18041\">OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.03414\">KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:46:49.417Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6640141010284424},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1ab64ecae70ed8b82d6f94","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-30T10:05:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Made an audio walkthrough of this paper for anyone who wants to skim it on the go:\nhttps://researchpod.app/episode/57dec96c-caab-4c62-adf2-d5f11fd14bb5\n\nGenerated automatically by ResearchPod — happy to take feedback from the authors.","html":"<p>Made an audio walkthrough of this paper for anyone who wants to skim it on the go:<br><a href=\"https://researchpod.app/episode/57dec96c-caab-4c62-adf2-d5f11fd14bb5\" rel=\"nofollow\">https://researchpod.app/episode/57dec96c-caab-4c62-adf2-d5f11fd14bb5</a></p>\n<p>Generated automatically by ResearchPod — happy to take feedback from the authors.</p>\n","updatedAt":"2026-05-30T10:05:02.472Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8279150724411011},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30010","authors":[{"_id":"6a1925b456b4bb14ec65d095","name":"Hesong Wang","hidden":false},{"_id":"6a1925b456b4bb14ec65d096","name":"Xin Jin","hidden":false},{"_id":"6a1925b456b4bb14ec65d097","name":"Lu Lu","hidden":false},{"_id":"6a1925b456b4bb14ec65d098","name":"Chenhaowen Li","hidden":false},{"_id":"6a1925b456b4bb14ec65d099","name":"Jian Chen","hidden":false},{"_id":"6a1925b456b4bb14ec65d09a","name":"Qiang Liu","hidden":false},{"_id":"6a1925b456b4bb14ec65d09b","name":"Huan Wang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"EarlyTom: Early Token Compression Completes Fast Video Understanding","submittedOnDailyBy":{"_id":"67d5848f179ad2756600eca3","avatarUrl":"/avatars/158168a753271b6e024e1fbdf52c9e73.svg","isPro":false,"fullname":"Junhan ZHU","user":"Alrightlone","type":"user","name":"Alrightlone"},"summary":"Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.","upvotes":24,"discussionId":"6a1925b456b4bb14ec65d09c","projectPage":"https://viridisgreen.github.io/EarlyTom/","githubRepo":"https://github.com/viridisGreen/EarlyTom","githubRepoAddedBy":"user","ai_summary":"EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy.","ai_keywords":["video large language models","visual tokens","time-to-first-token","vision encoder","token compression","LLaVA-OneVision-7B","FLOPs","training-free","spatial token selection"],"githubStars":15},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66966286ad7167254c4bb5d6","avatarUrl":"/avatars/1a3136918a74d7ce778dcee0ca93c411.svg","isPro":false,"fullname":"Kele Shao","user":"cokeshao","type":"user"},{"_id":"66dbea44946bce6c94afac80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66dbea44946bce6c94afac80/MWL4AJEqs8XUVyEAX3QqN.png","isPro":false,"fullname":"Haolei Bai","user":"DeadlyKitt3n","type":"user"},{"_id":"640f7083208821a59b74c757","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678735253848-640f7083208821a59b74c757.jpeg","isPro":false,"fullname":"Siyuan Li","user":"Lupin1998","type":"user"},{"_id":"69fc65bcbc971ccdb29e5f7a","avatarUrl":"/avatars/3e00a141c9eb2da8ca92f0930441e007.svg","isPro":false,"fullname":"lishumeng","user":"lishumeng","type":"user"},{"_id":"683f2e9fa073d45457ce420d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/g2WWieHoqAeG8gb1qWL5J.png","isPro":false,"fullname":"Jason Lee","user":"Jerry-98","type":"user"},{"_id":"67fcbcc3e0a50d74b83ddcd9","avatarUrl":"/avatars/0da0673c652885fc2824919a386698fc.svg","isPro":false,"fullname":"Zicheng Kong","user":"TMFK","type":"user"},{"_id":"66def1e3ba8b9dac859dbd64","avatarUrl":"/avatars/84797ac61013046db3a495d5033f9d32.svg","isPro":false,"fullname":"Zhenxin Ai","user":"kunkk","type":"user"},{"_id":"67a4a26d5e65aa63c6d30e68","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67a4a26d5e65aa63c6d30e68/GtodlJGw-_IL2DTXQTucz.jpeg","isPro":false,"fullname":"Sicheng Feng","user":"FSCCS","type":"user"},{"_id":"67a30368542aaa92858088d0","avatarUrl":"/avatars/09636a7737d24a88f0fbbb7d727c8eb4.svg","isPro":false,"fullname":"Pan","user":"Sssplendid","type":"user"},{"_id":"67d5848f179ad2756600eca3","avatarUrl":"/avatars/158168a753271b6e024e1fbdf52c9e73.svg","isPro":false,"fullname":"Junhan ZHU","user":"Alrightlone","type":"user"},{"_id":"68ca5817bd3cb4d7b59bca4f","avatarUrl":"/avatars/327b66f9d9633731cc273ee3b23d3a08.svg","isPro":false,"fullname":"Yumou Liu","user":"kkyrios","type":"user"},{"_id":"6624ce8248e016b5ea4ba952","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6624ce8248e016b5ea4ba952/WbpwxXABS5gomQJtSiiCo.jpeg","isPro":false,"fullname":"wenjiedu","user":"Kurt232","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30010.md"}">
EarlyTom: Early Token Compression Completes Fast Video Understanding
Abstract
EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy.
AI-generated summary
Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.
Community
Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30010 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30010 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30010 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.