Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computer Vision and Pattern Recognition
Title:Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
Abstract:Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at this https URL .
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.23826 [cs.CV] |
| (or arXiv:2605.23826v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.23826
arXiv-issued DOI via DataCite (pending registration)
|
Submission history
From: Michal Shlapentokh-Rothman [view email][v1] Fri, 22 May 2026 16:29:51 UTC (277 KB)
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
Evaluating Large Language Models in a Complex Hidden Role Game
May 25
-
A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development
May 25
-
Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion
May 25
-
Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model
May 25
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.