Hugging Face Daily Papers · June 26, 2026 · 4 min read

Confidence-Aware Tool Orchestration for Robust Video Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Project Page: <a href=\"https://rova-v2.github.io/\" rel=\"nofollow\">https://rova-v2.github.io/</a></p>\n","updatedAt":"2026-06-26T02:44:41.039Z","author":{"_id":"652066649004117947e46ed6","avatarUrl":"/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg","fullname":"Jaehong Yoon","name":"jaehong31","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.34924739599227905},"editors":["jaehong31"],"editorAvatarUrls":["/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.26904","authors":[{"_id":"6a3de7653b43e283349ec19b","name":"Yangfan He","hidden":false},{"_id":"6a3de7653b43e283349ec19c","name":"Yujin Choi","hidden":false},{"_id":"6a3de7653b43e283349ec19d","name":"Jaehong Yoon","hidden":false}],"publishedAt":"2026-06-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-26T00:00:00.000Z","title":"Confidence-Aware Tool Orchestration for Robust Video Understanding","submittedOnDailyBy":{"_id":"652066649004117947e46ed6","avatarUrl":"/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg","isPro":false,"fullname":"Jaehong Yoon","user":"jaehong31","type":"user","name":"jaehong31"},"summary":"Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.","upvotes":9,"discussionId":"6a3de7653b43e283349ec19e","projectPage":"https://rova-v2.github.io/","githubRepo":"https://github.com/ROVA-V2/Robust-TO","githubRepoAddedBy":"user","ai_summary":"Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.","ai_keywords":["video reasoning","Blind Trust Problem","agentic video understanding","evidence interface","reliability-relevance score","calibrated reliability score","three-tier synthesis process","confidence-cost GRPO reward","video reasoning benchmarks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"6371470aafbe42caa5a76208","name":"nanyang-technological-university-singapore","fullname":"Nanyang Technological University Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637146c5afbe42caa5a75e1b/sZyHSA1AQaAS4nrGan682.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"652066649004117947e46ed6","avatarUrl":"/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg","isPro":false,"fullname":"Jaehong Yoon","user":"jaehong31","type":"user"},{"_id":"63946778dda2f4142a3526d0","avatarUrl":"/avatars/9162a41f8c6611bc1258e38475b1d098.svg","isPro":false,"fullname":"Yujin Choi","user":"uzn","type":"user"},{"_id":"68eef67bbbfffc8550ecc524","avatarUrl":"/avatars/9d73ff2af5620d9db7bc77a9b45fdff4.svg","isPro":false,"fullname":"Gyusik Suh","user":"WillSuh","type":"user"},{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"678604dea34abb89c6920834","avatarUrl":"/avatars/45f2c4a718c49cfcfa944ef78205674f.svg","isPro":false,"fullname":"SeungBum Ha","user":"SeungB","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"69f0bb9a53592156859aab90","avatarUrl":"/avatars/122aeb140c584b7842c50ae693c2a27e.svg","isPro":false,"fullname":"mini09999","user":"mini09999","type":"user"},{"_id":"696da0962b3e2d9587d0b35d","avatarUrl":"/avatars/4f6c177ad51fb687ca1be75d18f6f5d6.svg","isPro":false,"fullname":"mini","user":"mini0999","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6371470aafbe42caa5a76208","name":"nanyang-technological-university-singapore","fullname":"Nanyang Technological University Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637146c5afbe42caa5a75e1b/sZyHSA1AQaAS4nrGan682.png"},"query":{}}">

Papers

arxiv:2606.26904

Confidence-Aware Tool Orchestration for Robust Video Understanding

Published on Jun 25

· Submitted by

Jaehong Yoon on Jun 26

Nanyang Technological University Singapore

Upvote

Authors:

Abstract

Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

jaehong31

Paper submitter 2 days ago

Project Page: https://rova-v2.github.io/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.26904 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.26904 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.26904 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

No comments yet. Sign in and be the first to say something.

Confidence-Aware Tool Orchestration for Robust Video Understanding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 3

Discussion (0)

More from Hugging Face Daily Papers