` / `<tool_call>` / `<answer>` closures collapse under temperature sampling.\n**Tool Necessity Gap** — with a 64-frame overview, \"skip-tool\" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.\n\nWe propose **PARA-GRPO** (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.\n\nFully open: paper, code, weights, data\n📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 https://huggingface.co/ParaVT · 🌐 evolvinglmms-lab.github.io/ParaVT","html":"<p>Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool calls <em>sequentially</em>, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.</p>\n<p><strong>ParaVT</strong> is the first multi-agent end-to-end RL-trained framework for <strong>Para</strong>llel <strong>V</strong>ideo <strong>T</strong>ool calling. A main agent emits multiple temporal-window crops in a <em>single</em> turn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.</p>\n<p>But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this the <strong>Tool Prior Paradox</strong>:</p>\n<p><strong>Format Fragility</strong> — SFT-learned <code><think></code> / <code><tool_call></code> / <code><answer></code> closures collapse under temperature sampling.<br><strong>Tool Necessity Gap</strong> — with a 64-frame overview, \"skip-tool\" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.</p>\n<p>We propose <strong>PARA-GRPO</strong> (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.</p>\n<p>Fully open: paper, code, weights, data<br>📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 <a href=\"https://huggingface.co/ParaVT\">https://huggingface.co/ParaVT</a> · 🌐 evolvinglmms-lab.github.io/ParaVT</p>\n","updatedAt":"2026-05-26T03:00:31.629Z","author":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","fullname":"Zuhao Yang","name":"mwxely","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7891132831573486},"editors":["mwxely"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20342","authors":[{"_id":"6a0e9112164dbbc68a26c627","user":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user","name":"mwxely"},"name":"Zuhao Yang","status":"claimed_verified","statusLastChangedAt":"2026-05-21T19:21:35.402Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c628","user":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user","name":"kcz358"},"name":"Kaichen Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:52:10.658Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c629","name":"Sudong Wang","hidden":false},{"_id":"6a0e9112164dbbc68a26c62a","name":"Keming Wu","hidden":false},{"_id":"6a0e9112164dbbc68a26c62b","user":{"_id":"654f99f74c8874c64d4e5664","avatarUrl":"/avatars/e9da0d688f91ae49db91d0ebebb3782a.svg","isPro":false,"fullname":"Zhongyu Yang","user":"yzzyu","type":"user","name":"yzzyu"},"name":"Zhongyu Yang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:52:12.386Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c62c","name":"Bo Li","hidden":false},{"_id":"6a0e9112164dbbc68a26c62d","name":"Xiaojuan Qi","hidden":false},{"_id":"6a0e9112164dbbc68a26c62e","name":"Shijian Lu","hidden":false},{"_id":"6a0e9112164dbbc68a26c62f","name":"Xingxuan Li","hidden":false},{"_id":"6a0e9112164dbbc68a26c630","name":"Lidong Bing","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning","submittedOnDailyBy":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user","name":"mwxely"},"summary":"Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.","upvotes":27,"discussionId":"6a0e9113164dbbc68a26c631","projectPage":"https://evolvinglmms-lab.github.io/ParaVT/","githubRepo":"https://github.com/EvolvingLMMs-Lab/ParaVT","githubRepoAddedBy":"user","ai_summary":"ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.","ai_keywords":["multimodal models","reinforcement learning","video-processing tools","parallel video tool calling","tool prior paradox","PARAGRPO","structural tokens","frame-budget randomization","long-video understanding","Qwen3-VL"],"githubStars":10,"organization":{"_id":"6583eb89bed3689928f5d845","name":"lmms-lab","fullname":"LMMs-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/0sliNO9xGhOjVWw20A1Ge.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user"},{"_id":"66bf00ca5b4e241fe266059d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66bf00ca5b4e241fe266059d/VoWPC_C4zoeT6dS699t7L.png","isPro":false,"fullname":"Keming Wu","user":"wukeming11","type":"user"},{"_id":"649aa367c6cf3cc95bc1b7f6","avatarUrl":"/avatars/4bf5446c261eab08fc06caebf4c5779a.svg","isPro":false,"fullname":"Yifei Shen","user":"yshenaw","type":"user"},{"_id":"66100bacac50abb8d56dece6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66100bacac50abb8d56dece6/fd-4VMpb_1nl903yAIK4K.jpeg","isPro":false,"fullname":"Ding Yue","user":"dingyue1011","type":"user"},{"_id":"646a11791556443f24b582e9","avatarUrl":"/avatars/119ad9a19fd69448403fafa9ad8fcb6f.svg","isPro":false,"fullname":"Zonglin Yang","user":"ZonglinY","type":"user"},{"_id":"66d94fa1c81167fc5e50781a","avatarUrl":"/avatars/df61cce3e4298b5c3e0c4f81b7281b25.svg","isPro":false,"fullname":"Wenhao Li","user":"wenhaoli-xmu","type":"user"},{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user"},{"_id":"6527b7280ae663e384eb8499","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527b7280ae663e384eb8499/73yF3eu2cUx7jVZrhXnXx.jpeg","isPro":false,"fullname":"Senqiao Yang","user":"Senqiao","type":"user"},{"_id":"638f1803c67af472d317a922","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f1803c67af472d317a922/9BMVXqHa-AsdZPmBprcbd.jpeg","isPro":false,"fullname":"siyue zhang","user":"siyue","type":"user"},{"_id":"6690f58e2f9f6f9c88e91031","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6690f58e2f9f6f9c88e91031/QQ_VoEh7NlE6BUvii08zk.png","isPro":true,"fullname":"Sudong Wang","user":"xiao45791","type":"user"},{"_id":"655c70d331c4978366d4b2e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655c70d331c4978366d4b2e6/X-KjTNkxtzeYu9ngBOh_C.jpeg","isPro":false,"fullname":"yiyexy","user":"yiyexy","type":"user"},{"_id":"69d3fe4dc599864cfc5bddb5","avatarUrl":"/avatars/a720ed7e30f00414e6ed5590f989b0db.svg","isPro":false,"fullname":"s z","user":"simon77818","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6583eb89bed3689928f5d845","name":"lmms-lab","fullname":"LMMs-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/0sliNO9xGhOjVWw20A1Ge.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20342.md"}">
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
Abstract
ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.
AI-generated summary
Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.
Community
Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool calls sequentially, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.
ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling. A main agent emits multiple temporal-window crops in a single turn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.
But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this the Tool Prior Paradox:
Format Fragility — SFT-learned <think> / <tool_call> / <answer> closures collapse under temperature sampling.
Tool Necessity Gap — with a 64-frame overview, "skip-tool" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.
We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.
Fully open: paper, code, weights, data
📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 https://huggingface.co/ParaVT · 🌐 evolvinglmms-lab.github.io/ParaVT
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.