Hugging Face Daily Papers · · 5 min read

Towards One-to-Many Temporal Grounding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.</p>\n","updatedAt":"2026-06-05T08:36:42.521Z","author":{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","fullname":"XuQi","name":"insomnia7","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8900885581970215},"editors":["insomnia7"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06294","authors":[{"_id":"6a227d40047f837f9867782d","name":"Qi Xu","hidden":false},{"_id":"6a227d40047f837f9867782e","name":"Yue Tan","hidden":false},{"_id":"6a227d40047f837f9867782f","name":"Shihao Chen","hidden":false},{"_id":"6a227d40047f837f98677830","name":"Jiahao Meng","hidden":false},{"_id":"6a227d40047f837f98677831","name":"Anna Wang","hidden":false},{"_id":"6a227d40047f837f98677832","name":"Shunping Ji","hidden":false},{"_id":"6a227d40047f837f98677833","name":"Hao Fei","hidden":false},{"_id":"6a227d40047f837f98677834","name":"Jason Li","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Towards One-to-Many Temporal Grounding","submittedOnDailyBy":{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","isPro":false,"fullname":"XuQi","user":"insomnia7","type":"user","name":"insomnia7"},"summary":"Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\\% and 15.61\\%, respectively.","upvotes":4,"discussionId":"6a227d40047f837f98677835","projectPage":"https://insomniaaac.github.io/OMTG/","ai_summary":"One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.","ai_keywords":["Temporal Grounding","One-to-Many Temporal Grounding","MLLMs","Count Accuracy","Effective Temporal F1","Chain-of-Thought reasoning","policy optimization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","isPro":false,"fullname":"XuQi","user":"insomnia7","type":"user"},{"_id":"65743f4cd40e6ed32608a0d4","avatarUrl":"/avatars/f081ae2980f0f2eed045add6bacfbbe8.svg","isPro":false,"fullname":"dcr","user":"RestartZero","type":"user"},{"_id":"6436435c2d0ed796669258d3","avatarUrl":"/avatars/d357378eb039391e8ce74bbd84b80d07.svg","isPro":false,"fullname":"zhangtao","user":"zhangtao-whu","type":"user"},{"_id":"66350ea032a5f38150f7a82b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66350ea032a5f38150f7a82b/UOrv46KJsQV9Ye_bmyIS6.jpeg","isPro":false,"fullname":"GongDengxian","user":"godx7","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.06294.md"}">
Papers
arxiv:2606.06294

Towards One-to-Many Temporal Grounding

Published on Jun 4
· Submitted by
XuQi
on Jun 5
Authors:
,
,
,
,
,
,
,

Abstract

One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

Community

Paper submitter about 2 hours ago

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.06294
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06294 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06294 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers