Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.</p>\n","updatedAt":"2026-06-05T08:36:42.521Z","author":{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","fullname":"XuQi","name":"insomnia7","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8900885581970215},"editors":["insomnia7"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06294","authors":[{"_id":"6a227d40047f837f9867782d","name":"Qi Xu","hidden":false},{"_id":"6a227d40047f837f9867782e","name":"Yue Tan","hidden":false},{"_id":"6a227d40047f837f9867782f","name":"Shihao Chen","hidden":false},{"_id":"6a227d40047f837f98677830","name":"Jiahao Meng","hidden":false},{"_id":"6a227d40047f837f98677831","name":"Anna Wang","hidden":false},{"_id":"6a227d40047f837f98677832","name":"Shunping Ji","hidden":false},{"_id":"6a227d40047f837f98677833","name":"Hao Fei","hidden":false},{"_id":"6a227d40047f837f98677834","name":"Jason Li","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Towards One-to-Many Temporal Grounding","submittedOnDailyBy":{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","isPro":false,"fullname":"XuQi","user":"insomnia7","type":"user","name":"insomnia7"},"summary":"Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\\% and 15.61\\%, respectively.","upvotes":4,"discussionId":"6a227d40047f837f98677835","projectPage":"https://insomniaaac.github.io/OMTG/","ai_summary":"One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.","ai_keywords":["Temporal Grounding","One-to-Many Temporal Grounding","MLLMs","Count Accuracy","Effective Temporal F1","Chain-of-Thought reasoning","policy optimization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","isPro":false,"fullname":"XuQi","user":"insomnia7","type":"user"},{"_id":"65743f4cd40e6ed32608a0d4","avatarUrl":"/avatars/f081ae2980f0f2eed045add6bacfbbe8.svg","isPro":false,"fullname":"dcr","user":"RestartZero","type":"user"},{"_id":"6436435c2d0ed796669258d3","avatarUrl":"/avatars/d357378eb039391e8ce74bbd84b80d07.svg","isPro":false,"fullname":"zhangtao","user":"zhangtao-whu","type":"user"},{"_id":"66350ea032a5f38150f7a82b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66350ea032a5f38150f7a82b/UOrv46KJsQV9Ye_bmyIS6.jpeg","isPro":false,"fullname":"GongDengxian","user":"godx7","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.06294.md"}">
Towards One-to-Many Temporal Grounding
Published on Jun 4
· Submitted by XuQi on Jun 5 Abstract
One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.
Community
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.06294 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.06294 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.