Hugging Face Daily Papers · · 6 min read

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.</p>\n","updatedAt":"2026-05-14T02:49:37.850Z","author":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","fullname":"Zhaowei Wang","name":"ZhaoweiWang","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8687973618507385},"editors":["ZhaoweiWang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.13831","authors":[{"_id":"6a0535d5b1a8cbabc9f0874e","user":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":true,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user","name":"ZhaoweiWang"},"name":"Zhaowei Wang","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:41.780Z","hidden":false},{"_id":"6a0535d5b1a8cbabc9f0874f","name":"Lishu Luo","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08750","name":"Haodong Duan","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08751","name":"Weiwei Liu","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08752","name":"Sijin Wu","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08753","name":"Ji Luo","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08754","name":"Shen Yan","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08755","name":"Shuai Peng","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08756","name":"Sihang Yuan","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08757","name":"Chaoyi Huang","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08758","name":"Yi Lin","hidden":false},{"_id":"6a0535d5b1a8cbabc9f08759","name":"Yangqiu Song","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","submittedOnDailyBy":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":true,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user","name":"ZhaoweiWang"},"summary":"Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.","upvotes":73,"discussionId":"6a0535d6b1a8cbabc9f0875a","ai_summary":"Long-context continued pre-training enhances vision-language models' ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design.","ai_keywords":["long-context modeling","large vision-language models","continued pre-training","long-document VQA","sequence-length distribution","retrieval-heavy mixtures","instruction-formatted data","multimodal needle retrieval","vision-text compression","long-video understanding"],"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68954ff742413a5846d9d80a","avatarUrl":"/avatars/1751b607a0a65ba63400d34487f3b75c.svg","isPro":false,"fullname":"ElvinDu","user":"ElvinDu518","type":"user"},{"_id":"6349214f8146350b3a4c5cdf","avatarUrl":"/avatars/cfd24caac9a87efb528d0f4c375932bc.svg","isPro":false,"fullname":"Dongzhi Jiang","user":"CaraJ","type":"user"},{"_id":"63ee1379190ddd6214efd73a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676546883247-noauth.png","isPro":false,"fullname":"HAODONG DUAN","user":"KennyUTC","type":"user"},{"_id":"67d3a681536175c33a85c7ed","avatarUrl":"/avatars/7fd5969c7eed2127f45bf5a9f4189280.svg","isPro":false,"fullname":"zliang","user":"youliang1233214","type":"user"},{"_id":"66cbea69dd40cac2974e56ad","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66cbea69dd40cac2974e56ad/qLa_VympzzW5OAztQ7KEo.jpeg","isPro":false,"fullname":"Dawei Hu","user":"bigfoxtail","type":"user"},{"_id":"643b866bff50448bcfc7d1d1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/q12v-BatVKimoi8q-coi-.jpeg","isPro":true,"fullname":"Jialong Wu","user":"manchery","type":"user"},{"_id":"653e5d31ffd60206c8b64bb5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/653e5d31ffd60206c8b64bb5/bgztraPC27L6culMlJw4s.png","isPro":false,"fullname":"Xinchen Zhang","user":"comin","type":"user"},{"_id":"650b1f74566d0426cf3fe4f3","avatarUrl":"/avatars/a56850532cc9ec2533a716f3e3fadf19.svg","isPro":false,"fullname":"zhu","user":"jiayukkr","type":"user"},{"_id":"69fc242b6006845f35cacf02","avatarUrl":"/avatars/176c0e82dba3929d8317a159aee63b9a.svg","isPro":false,"fullname":"unknown","user":"unknown12423124","type":"user"},{"_id":"648312243b7fe59c876c0dca","avatarUrl":"/avatars/c26ad76cd213529e4670bb599b8199bb.svg","isPro":false,"fullname":"weize","user":"weizechen","type":"user"},{"_id":"628ece6054698ce61d1e7be3","avatarUrl":"/avatars/c6ab33843fd0d8ef003650c1094214c0.svg","isPro":false,"fullname":"Ao Wang","user":"jameslahm","type":"user"},{"_id":"65c463c07d10ab35a7828bc3","avatarUrl":"/avatars/b59b0ab8809cb891bf6f0d29f8d002f7.svg","isPro":false,"fullname":"miao kehao","user":"mi1k7","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.13831.md"}">
Papers
arxiv:2605.13831

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Published on May 13
· Submitted by
Zhaowei Wang
on May 14
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Long-context continued pre-training enhances vision-language models' ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design.

AI-generated summary

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

Community

Paper author Paper submitter about 23 hours ago

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.13831
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.13831 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.13831 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.13831 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers