Hugging Face Daily Papers · May 14, 2026 · 6 min read

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.</p>\n","updatedAt":"2026-05-14T02:49:37.850Z","author":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","fullname":"Zhaowei Wang","name":"ZhaoweiWang","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8687973618507385},"editors":["ZhaoweiWang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.13831","authors":[{"_id":"6a0535d5b1a8cbabc9f0874e","user":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":true,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user","name":"ZhaoweiWang"},"name":"Zhaowei Wang","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:41.780Z","hidden":false},{"_id":"6a0535d5b1a8cbabc9f0874f","name":"Lishu Luo","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08750","name":"Haodong Duan","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08751","name":"Weiwei Liu","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08752","name":"Sijin Wu","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08753","name":"Ji Luo","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08754","name":"Shen Yan","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08755","name":"Shuai Peng","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08756","name":"Sihang Yuan","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08757","name":"Chaoyi Huang","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08758","name":"Yi Lin","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08759","name":"Yangqiu Song","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","submittedOnDailyBy":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":true,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user","name":"ZhaoweiWang"},"summary":"Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.","upvotes":73,"discussionId":"6a0535d6b1a8cbabc9f0875a","ai_summary":"Long-context continued pre-training enhances vision-language models' ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design.","ai_keywords":["long-context modeling","large vision-language models","continued pre-training","long-document VQA","sequence-length distribution","retrieval-heavy mixtures","instruction-formatted data","multimodal needle retrieval","vision-text compression","long-video understanding"],"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68954ff742413a5846d9d80a","avatarUrl":"/avatars/1751b607a0a65ba63400d34487f3b75c.svg","isPro":false,"fullname":"ElvinDu","user":"ElvinDu518","type":"user"},{"_id":"6349214f8146350b3a4c5cdf","avatarUrl":"/avatars/cfd24caac9a87efb528d0f4c375932bc.svg","isPro":false,"fullname":"Dongzhi Jiang","user":"CaraJ","type":"user"},{"_id":"63ee1379190ddd6214efd73a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676546883247-noauth.png","isPro":false,"fullname":"HAODONG DUAN","user":"KennyUTC","type":"user"},{"_id":"67d3a681536175c33a85c7ed","avatarUrl":"/avatars/7fd5969c7eed2127f45bf5a9f4189280.svg","isPro":false,"fullname":"zliang","user":"youliang1233214","type":"user"},{"_id":"66cbea69dd40cac2974e56ad","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66cbea69dd40cac2974e56ad/qLa_VympzzW5OAztQ7KEo.jpeg","isPro":false,"fullname":"Dawei Hu","user":"bigfoxtail","type":"user"},{"_id":"643b866bff50448bcfc7d1d1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/q12v-BatVKimoi8q-coi-.jpeg","isPro":true,"fullname":"Jialong Wu","user":"manchery","type":"user"},{"_id":"653e5d31ffd60206c8b64bb5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653e5d31ffd60206c8b64bb5/bgztraPC27L6culMlJw4s.png","isPro":false,"fullname":"Xinchen Zhang","user":"comin","type":"user"},{"_id":"650b1f74566d0426cf3fe4f3","avatarUrl":"/avatars/a56850532cc9ec2533a716f3e3fadf19.svg","isPro":false,"fullname":"zhu","user":"jiayukkr","type":"user"},{"_id":"69fc242b6006845f35cacf02","avatarUrl":"/avatars/176c0e82dba3929d8317a159aee63b9a.svg","isPro":false,"fullname":"unknown","user":"unknown12423124","type":"user"},{"_id":"648312243b7fe59c876c0dca","avatarUrl":"/avatars/c26ad76cd213529e4670bb599b8199bb.svg","isPro":false,"fullname":"weize","user":"weizechen","type":"user"},{"_id":"628ece6054698ce61d1e7be3","avatarUrl":"/avatars/c6ab33843fd0d8ef003650c1094214c0.svg","isPro":false,"fullname":"Ao Wang","user":"jameslahm","type":"user"},{"_id":"65c463c07d10ab35a7828bc3","avatarUrl":"/avatars/b59b0ab8809cb891bf6f0d29f8d002f7.svg","isPro":false,"fullname":"miao kehao","user":"mi1k7","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.13831.md"}">

Papers

arxiv:2605.13831

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Published on May 13

· Submitted by

Zhaowei Wang on May 14

ByteDance Seed

Upvote

Authors:

Zhaowei Wang ,

Abstract

Long-context continued pre-training enhances vision-language models' ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design.

AI-generated summary

View arXiv page View PDF Add to collection

Community

ZhaoweiWang

Paper author Paper submitter about 23 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.13831

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.13831 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.13831 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.13831 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers