Hugging Face Daily Papers · May 26, 2026 · 7 min read

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

` / `<tool_call>` / `<answer>` closures collapse under temperature sampling.\n**Tool Necessity Gap** — with a 64-frame overview, \"skip-tool\" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.\n\nWe propose **PARA-GRPO** (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.\n\nFully open: paper, code, weights, data\n📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 https://huggingface.co/ParaVT · 🌐 evolvinglmms-lab.github.io/ParaVT","html":"Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool calls sequentially, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.\nParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling. A main agent emits multiple temporal-window crops in a single turn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.\nBut applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this the Tool Prior Paradox:\nFormat Fragility — SFT-learned <code><think></code> / <code><tool_call></code> / <code><answer></code> closures collapse under temperature sampling. Tool Necessity Gap — with a 64-frame overview, \"skip-tool\" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.\nWe propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.\nFully open: paper, code, weights, data 📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 <a href=\"https://huggingface.co/ParaVT\">https://huggingface.co/ParaVT</a> · 🌐 evolvinglmms-lab.github.io/ParaVT\n","updatedAt":"2026-05-26T03:00:31.629Z","author":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","fullname":"Zuhao Yang","name":"mwxely","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7891132831573486},"editors":["mwxely"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20342","authors":[{"_id":"6a0e9112164dbbc68a26c627","user":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user","name":"mwxely"},"name":"Zuhao Yang","status":"claimed_verified","statusLastChangedAt":"2026-05-21T19:21:35.402Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c628","user":{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user","name":"kcz358"},"name":"Kaichen Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:52:10.658Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c629","name":"Sudong Wang","hidden":false},{"_id":"6a0e9112164dbbc68a26c62a","name":"Keming Wu","hidden":false},{"_id":"6a0e9112164dbbc68a26c62b","user":{"_id":"654f99f74c8874c64d4e5664","avatarUrl":"/avatars/e9da0d688f91ae49db91d0ebebb3782a.svg","isPro":false,"fullname":"Zhongyu Yang","user":"yzzyu","type":"user","name":"yzzyu"},"name":"Zhongyu Yang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:52:12.386Z","hidden":false},{"_id":"6a0e9112164dbbc68a26c62c","name":"Bo Li","hidden":false},{"_id":"6a0e9112164dbbc68a26c62d","name":"Xiaojuan Qi","hidden":false},{"_id":"6a0e9112164dbbc68a26c62e","name":"Shijian Lu","hidden":false},{"_id":"6a0e9112164dbbc68a26c62f","name":"Xingxuan Li","hidden":false},{"_id":"6a0e9112164dbbc68a26c630","name":"Lidong Bing","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning","submittedOnDailyBy":{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user","name":"mwxely"},"summary":"Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.","upvotes":27,"discussionId":"6a0e9113164dbbc68a26c631","projectPage":"https://evolvinglmms-lab.github.io/ParaVT/","githubRepo":"https://github.com/EvolvingLMMs-Lab/ParaVT","githubRepoAddedBy":"user","ai_summary":"ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.","ai_keywords":["multimodal models","reinforcement learning","video-processing tools","parallel video tool calling","tool prior paradox","PARAGRPO","structural tokens","frame-budget randomization","long-video understanding","Qwen3-VL"],"githubStars":10,"organization":{"_id":"6583eb89bed3689928f5d845","name":"lmms-lab","fullname":"LMMs-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/0sliNO9xGhOjVWw20A1Ge.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6524d665ab1416594149e07e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6524d665ab1416594149e07e/KMsCaAtV0DLC4tqN8f2a7.png","isPro":false,"fullname":"Zuhao Yang","user":"mwxely","type":"user"},{"_id":"66bf00ca5b4e241fe266059d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66bf00ca5b4e241fe266059d/VoWPC_C4zoeT6dS699t7L.png","isPro":false,"fullname":"Keming Wu","user":"wukeming11","type":"user"},{"_id":"649aa367c6cf3cc95bc1b7f6","avatarUrl":"/avatars/4bf5446c261eab08fc06caebf4c5779a.svg","isPro":false,"fullname":"Yifei Shen","user":"yshenaw","type":"user"},{"_id":"66100bacac50abb8d56dece6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66100bacac50abb8d56dece6/fd-4VMpb_1nl903yAIK4K.jpeg","isPro":false,"fullname":"Ding Yue","user":"dingyue1011","type":"user"},{"_id":"646a11791556443f24b582e9","avatarUrl":"/avatars/119ad9a19fd69448403fafa9ad8fcb6f.svg","isPro":false,"fullname":"Zonglin Yang","user":"ZonglinY","type":"user"},{"_id":"66d94fa1c81167fc5e50781a","avatarUrl":"/avatars/df61cce3e4298b5c3e0c4f81b7281b25.svg","isPro":false,"fullname":"Wenhao Li","user":"wenhaoli-xmu","type":"user"},{"_id":"64bb77e786e7fb5b8a317a43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb77e786e7fb5b8a317a43/J0jOrlZJ9gazdYaeSH2Bo.png","isPro":false,"fullname":"kcz","user":"kcz358","type":"user"},{"_id":"6527b7280ae663e384eb8499","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6527b7280ae663e384eb8499/73yF3eu2cUx7jVZrhXnXx.jpeg","isPro":false,"fullname":"Senqiao Yang","user":"Senqiao","type":"user"},{"_id":"638f1803c67af472d317a922","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f1803c67af472d317a922/9BMVXqHa-AsdZPmBprcbd.jpeg","isPro":false,"fullname":"siyue zhang","user":"siyue","type":"user"},{"_id":"6690f58e2f9f6f9c88e91031","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6690f58e2f9f6f9c88e91031/QQ_VoEh7NlE6BUvii08zk.png","isPro":true,"fullname":"Sudong Wang","user":"xiao45791","type":"user"},{"_id":"655c70d331c4978366d4b2e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655c70d331c4978366d4b2e6/X-KjTNkxtzeYu9ngBOh_C.jpeg","isPro":false,"fullname":"yiyexy","user":"yiyexy","type":"user"},{"_id":"69d3fe4dc599864cfc5bddb5","avatarUrl":"/avatars/a720ed7e30f00414e6ed5590f989b0db.svg","isPro":false,"fullname":"s z","user":"simon77818","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6583eb89bed3689928f5d845","name":"lmms-lab","fullname":"LMMs-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d3f7d84b0933c48f3cdd9c/0sliNO9xGhOjVWw20A1Ge.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20342.md"}">

Papers

arxiv:2605.20342

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Published on May 19

· Submitted by

Zuhao Yang on May 26

LMMs-Lab

Upvote

Authors:

Zuhao Yang ,

Kaichen Zhang ,

Zhongyu Yang ,

Abstract

ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.

AI-generated summary

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

View arXiv page View PDF Project page GitHub 10 Add to collection

Community

mwxely

Paper author Paper submitter about 5 hours ago

Long-video understanding is becoming agentic where LMMs are post-trained with RL to natively invoke video tools (e.g., temporal cropping). But every existing native-RL recipe (including our own LongVT @ CVPR 2026) dispatches tool calls sequentially, one per turn: a bad crop has no peer correction, multi-turn calls drift the context, and inference cost grows linearly with turns.

ParaVT is the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling. A main agent emits multiple temporal-window crops in a single turn, weight-sharing sub-agents process them concurrently, and a gather-and-reason step produces the final answer.

But applying standard GRPO on top of a tool-native LMM surfaces two coupled failures driven by the same pretrained tool prior. We call this the Tool Prior Paradox:

Format Fragility — SFT-learned <think> / <tool_call> / <answer> closures collapse under temperature sampling.
Tool Necessity Gap — with a 64-frame overview, "skip-tool" becomes a shortcut and the GRPO advantage of calling vs. skipping flattens to zero.

We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), pairing one targeted fix per failure: (i) a format reward applied only at the structural-token positions most prone to collapse, and (ii) per-prompt overview-frame randomization K ∼ Uniform{4, 8, 16, 32, 64} that keeps the tool-call advantage non-degenerate.

Fully open: paper, code, weights, data
📄 arxiv.org/abs/2605.20342 · 💻 github.com/EvolvingLMMs-Lab/ParaVT · 🤖 https://huggingface.co/ParaVT · 🌐 evolvinglmms-lab.github.io/ParaVT

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.20342

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Abstract

Community

Models citing this paper 1

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers