Hugging Face Daily Papers · · 7 min read

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

` / `<tool_call>` / `<answer>` closures collapse under temperature sampling.\n**Tool Necessity Gap** — with a 64-frame overview, \"skip-tool\" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.\n\nWe propose **PARA-GRPO** (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.\n\nFully open: paper, code, weights, data\n📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 https://huggingface.co/ParaVT · 🌐 evolvinglmms-lab.github.io/ParaVT","html":"<p>Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool calls <em>sequentially</em>, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.</p>\n<p><strong>ParaVT</strong> is the first multi-agent end-to-end RL-trained framework for <strong>Para</strong>llel <strong>V</strong>ideo <strong>T</strong>ool calling. A main agent emits multiple temporal-window crops in a <em>single</em> turn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.</p>\n<p>But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this the <strong>Tool Prior Paradox</strong>:</p>\n<p><strong>Format Fragility</strong> — SFT-learned <code>&lt;think&gt;</code> / <code>&lt;tool_call&gt;</code> / <code>&lt;answer&gt;</code> closures collapse under temperature sampling.<br><strong>Tool Necessity Gap</strong> — with a 64-frame overview, \"skip-tool\" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.</p>\n<p>We propose <strong>PARA-GRPO</strong> (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.</p>\n<p>Fully open: paper, code, weights, data<br>📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 <a href=\"https://huggingface.co/ParaVT\">https://huggingface.co/ParaVT</a> · 🌐 evolvinglmms-lab.github.io/ParaVT</p>\n","updatedAt":"2026-05-26T03:00:31.629Z","author":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","fullname":"Zuhao Yang","name":"mwxely","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7891132831573486},"editors":["mwxely"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20342","authors":[{"_id":"6a0e9112164dbbc68a26c627","user":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user","name":"mwxely"},"name":"Zuhao Yang","status":"claimed_verified","statusLastChangedAt":"2026-05-21T19:21:35.402Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c628","user":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user","name":"kcz358"},"name":"Kaichen Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:52:10.658Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c629","name":"Sudong Wang","hidden":false},{"_id":"6a0e9112164dbbc68a26c62a","name":"Keming Wu","hidden":false},{"_id":"6a0e9112164dbbc68a26c62b","user":{"_id":"654f99f74c8874c64d4e5664","avatarUrl":"/avatars/e9da0d688f91ae49db91d0ebebb3782a.svg","isPro":false,"fullname":"Zhongyu Yang","user":"yzzyu","type":"user","name":"yzzyu"},"name":"Zhongyu Yang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:52:12.386Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c62c","name":"Bo Li","hidden":false},{"_id":"6a0e9112164dbbc68a26c62d","name":"Xiaojuan Qi","hidden":false},{"_id":"6a0e9112164dbbc68a26c62e","name":"Shijian Lu","hidden":false},{"_id":"6a0e9112164dbbc68a26c62f","name":"Xingxuan Li","hidden":false},{"_id":"6a0e9112164dbbc68a26c630","name":"Lidong Bing","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning","submittedOnDailyBy":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user","name":"mwxely"},"summary":"Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.","upvotes":27,"discussionId":"6a0e9113164dbbc68a26c631","projectPage":"https://evolvinglmms-lab.github.io/ParaVT/","githubRepo":"https://github.com/EvolvingLMMs-Lab/ParaVT","githubRepoAddedBy":"user","ai_summary":"ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.","ai_keywords":["multimodal models","reinforcement learning","video-processing tools","parallel video tool calling","tool prior paradox","PARAGRPO","structural tokens","frame-budget randomization","long-video understanding","Qwen3-VL"],"githubStars":10,"organization":{"_id":"6583eb89bed3689928f5d845","name":"lmms-lab","fullname":"LMMs-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/0sliNO9xGhOjVWw20A1Ge.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user"},{"_id":"66bf00ca5b4e241fe266059d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66bf00ca5b4e241fe266059d/VoWPC_C4zoeT6dS699t7L.png","isPro":false,"fullname":"Keming Wu","user":"wukeming11","type":"user"},{"_id":"649aa367c6cf3cc95bc1b7f6","avatarUrl":"/avatars/4bf5446c261eab08fc06caebf4c5779a.svg","isPro":false,"fullname":"Yifei Shen","user":"yshenaw","type":"user"},{"_id":"66100bacac50abb8d56dece6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66100bacac50abb8d56dece6/fd-4VMpb_1nl903yAIK4K.jpeg","isPro":false,"fullname":"Ding Yue","user":"dingyue1011","type":"user"},{"_id":"646a11791556443f24b582e9","avatarUrl":"/avatars/119ad9a19fd69448403fafa9ad8fcb6f.svg","isPro":false,"fullname":"Zonglin Yang","user":"ZonglinY","type":"user"},{"_id":"66d94fa1c81167fc5e50781a","avatarUrl":"/avatars/df61cce3e4298b5c3e0c4f81b7281b25.svg","isPro":false,"fullname":"Wenhao Li","user":"wenhaoli-xmu","type":"user"},{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user"},{"_id":"6527b7280ae663e384eb8499","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527b7280ae663e384eb8499/73yF3eu2cUx7jVZrhXnXx.jpeg","isPro":false,"fullname":"Senqiao Yang","user":"Senqiao","type":"user"},{"_id":"638f1803c67af472d317a922","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f1803c67af472d317a922/9BMVXqHa-AsdZPmBprcbd.jpeg","isPro":false,"fullname":"siyue zhang","user":"siyue","type":"user"},{"_id":"6690f58e2f9f6f9c88e91031","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6690f58e2f9f6f9c88e91031/QQ_VoEh7NlE6BUvii08zk.png","isPro":true,"fullname":"Sudong Wang","user":"xiao45791","type":"user"},{"_id":"655c70d331c4978366d4b2e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655c70d331c4978366d4b2e6/X-KjTNkxtzeYu9ngBOh_C.jpeg","isPro":false,"fullname":"yiyexy","user":"yiyexy","type":"user"},{"_id":"69d3fe4dc599864cfc5bddb5","avatarUrl":"/avatars/a720ed7e30f00414e6ed5590f989b0db.svg","isPro":false,"fullname":"s z","user":"simon77818","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6583eb89bed3689928f5d845","name":"lmms-lab","fullname":"LMMs-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/0sliNO9xGhOjVWw20A1Ge.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20342.md"}">
Papers
arxiv:2605.20342

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Published on May 19
· Submitted by
Zuhao Yang
on May 26
Authors:
,
,
,
,
,
,

Abstract

ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.

AI-generated summary

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

Community

Paper author Paper submitter about 5 hours ago

Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool calls sequentially, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.

ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling. A main agent emits multiple temporal-window crops in a single turn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.

But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this the Tool Prior Paradox:

Format Fragility — SFT-learned <think> / <tool_call> / <answer> closures collapse under temperature sampling.
Tool Necessity Gap — with a 64-frame overview, "skip-tool" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.

We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.

Fully open: paper, code, weights, data
📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 https://huggingface.co/ParaVT · 🌐 evolvinglmms-lab.github.io/ParaVT

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20342
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers