Project Page: <a href=\"https://rova-v2.github.io/\" rel=\"nofollow\">https://rova-v2.github.io/</a></p>\n","updatedAt":"2026-06-26T02:44:41.039Z","author":{"_id":"652066649004117947e46ed6","avatarUrl":"/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg","fullname":"Jaehong Yoon","name":"jaehong31","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.34924739599227905},"editors":["jaehong31"],"editorAvatarUrls":["/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.26904","authors":[{"_id":"6a3de7653b43e283349ec19b","name":"Yangfan He","hidden":false},{"_id":"6a3de7653b43e283349ec19c","name":"Yujin Choi","hidden":false},{"_id":"6a3de7653b43e283349ec19d","name":"Jaehong Yoon","hidden":false}],"publishedAt":"2026-06-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-26T00:00:00.000Z","title":"Confidence-Aware Tool Orchestration for Robust Video Understanding","submittedOnDailyBy":{"_id":"652066649004117947e46ed6","avatarUrl":"/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg","isPro":false,"fullname":"Jaehong Yoon","user":"jaehong31","type":"user","name":"jaehong31"},"summary":"Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.","upvotes":9,"discussionId":"6a3de7653b43e283349ec19e","projectPage":"https://rova-v2.github.io/","githubRepo":"https://github.com/ROVA-V2/Robust-TO","githubRepoAddedBy":"user","ai_summary":"Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.","ai_keywords":["video reasoning","Blind Trust Problem","agentic video understanding","evidence interface","reliability-relevance score","calibrated reliability score","three-tier synthesis process","confidence-cost GRPO reward","video reasoning benchmarks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"6371470aafbe42caa5a76208","name":"nanyang-technological-university-singapore","fullname":"Nanyang Technological University Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637146c5afbe42caa5a75e1b/sZyHSA1AQaAS4nrGan682.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"652066649004117947e46ed6","avatarUrl":"/avatars/972c97df6f26d2c3d6ce71ec579984bb.svg","isPro":false,"fullname":"Jaehong Yoon","user":"jaehong31","type":"user"},{"_id":"63946778dda2f4142a3526d0","avatarUrl":"/avatars/9162a41f8c6611bc1258e38475b1d098.svg","isPro":false,"fullname":"Yujin Choi","user":"uzn","type":"user"},{"_id":"68eef67bbbfffc8550ecc524","avatarUrl":"/avatars/9d73ff2af5620d9db7bc77a9b45fdff4.svg","isPro":false,"fullname":"Gyusik Suh","user":"WillSuh","type":"user"},{"_id":"658247c592b5a9664de63882","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/658247c592b5a9664de63882/Jn03voLQjDlB3YjpSi-PI.jpeg","isPro":false,"fullname":"fan","user":"EasonFan","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"678604dea34abb89c6920834","avatarUrl":"/avatars/45f2c4a718c49cfcfa944ef78205674f.svg","isPro":false,"fullname":"SeungBum Ha","user":"SeungB","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"69f0bb9a53592156859aab90","avatarUrl":"/avatars/122aeb140c584b7842c50ae693c2a27e.svg","isPro":false,"fullname":"mini09999","user":"mini09999","type":"user"},{"_id":"696da0962b3e2d9587d0b35d","avatarUrl":"/avatars/4f6c177ad51fb687ca1be75d18f6f5d6.svg","isPro":false,"fullname":"mini","user":"mini0999","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6371470aafbe42caa5a76208","name":"nanyang-technological-university-singapore","fullname":"Nanyang Technological University Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/637146c5afbe42caa5a75e1b/sZyHSA1AQaAS4nrGan682.png"},"query":{}}">
Confidence-Aware Tool Orchestration for Robust Video Understanding
Abstract
Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.
Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.26904 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.26904 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.26904 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.