Hugging Face Daily Papers · June 5, 2026 · 5 min read

Towards One-to-Many Temporal Grounding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85% and 15.61%, respectively.</p>\n","updatedAt":"2026-06-05T08:36:42.521Z","author":{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","fullname":"XuQi","name":"insomnia7","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8900885581970215},"editors":["insomnia7"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06294","authors":[{"_id":"6a227d40047f837f9867782d","name":"Qi Xu","hidden":false},{"_id":"6a227d40047f837f9867782e","name":"Yue Tan","hidden":false},{"_id":"6a227d40047f837f9867782f","name":"Shihao Chen","hidden":false},{"_id":"6a227d40047f837f98677830","name":"Jiahao Meng","hidden":false},{"_id":"6a227d40047f837f98677831","name":"Anna Wang","hidden":false},{"_id":"6a227d40047f837f98677832","name":"Shunping Ji","hidden":false},{"_id":"6a227d40047f837f98677833","name":"Hao Fei","hidden":false},{"_id":"6a227d40047f837f98677834","name":"Jason Li","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Towards One-to-Many Temporal Grounding","submittedOnDailyBy":{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","isPro":false,"fullname":"XuQi","user":"insomnia7","type":"user","name":"insomnia7"},"summary":"Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\\% and 15.61\\%, respectively.","upvotes":4,"discussionId":"6a227d40047f837f98677835","projectPage":"https://insomniaaac.github.io/OMTG/","ai_summary":"One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.","ai_keywords":["Temporal Grounding","One-to-Many Temporal Grounding","MLLMs","Count Accuracy","Effective Temporal F1","Chain-of-Thought reasoning","policy optimization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","isPro":false,"fullname":"XuQi","user":"insomnia7","type":"user"},{"_id":"65743f4cd40e6ed32608a0d4","avatarUrl":"/avatars/f081ae2980f0f2eed045add6bacfbbe8.svg","isPro":false,"fullname":"dcr","user":"RestartZero","type":"user"},{"_id":"6436435c2d0ed796669258d3","avatarUrl":"/avatars/d357378eb039391e8ce74bbd84b80d07.svg","isPro":false,"fullname":"zhangtao","user":"zhangtao-whu","type":"user"},{"_id":"66350ea032a5f38150f7a82b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66350ea032a5f38150f7a82b/UOrv46KJsQV9Ye_bmyIS6.jpeg","isPro":false,"fullname":"GongDengxian","user":"godx7","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.06294.md"}">

Papers

arxiv:2606.06294

Towards One-to-Many Temporal Grounding

Published on Jun 4

· Submitted by

XuQi on Jun 5

ByteDance

Upvote

Authors:

Abstract

One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Project page Add to collection

Community

insomnia7

Paper submitter about 2 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.06294

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06294 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06294 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Towards One-to-Many Temporal Grounding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 2

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers