Hugging Face Daily Papers · · 3 min read

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We present TVIR, the first benchmark and agent framework specifically designed for text-visual interleaved report generation. Unlike existing text-only deep research systems, TVIR-Bench evaluates both textual quality and visual integration across 100 expert-curated tasks. Our TVIR-Agent achieves state-of-the-art performance, demonstrating that structured multi-agent collaboration is key to generating high-quality multimodal reports.</p>\n","updatedAt":"2026-06-02T17:05:20.993Z","author":{"_id":"670e53ca38721b4aecfeb639","avatarUrl":"/avatars/ec1d6ace31706fa370e6fc885f89f603.svg","fullname":"Xinkai Ma","name":"Cenji630","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8711318373680115},"editors":["Cenji630"],"editorAvatarUrls":["/avatars/ec1d6ace31706fa370e6fc885f89f603.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02320","authors":[{"_id":"6a1e578d808ddbc3c7d43db5","user":{"_id":"670e53ca38721b4aecfeb639","avatarUrl":"/avatars/ec1d6ace31706fa370e6fc885f89f603.svg","isPro":false,"fullname":"Xinkai Ma","user":"Cenji630","type":"user","name":"Cenji630"},"name":"Xinkai Ma","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:08:46.889Z","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db6","name":"Zhiqi Bai","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db7","name":"Dingling Zhang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db8","name":"Pei Liu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db9","name":"Yishuo Yuan","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dba","name":"He Zhu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbb","name":"Jiakai Wang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbc","name":"Qianqian Xie","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbd","name":"Yifan Zhao","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbe","name":"Xinlong Yang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbf","name":"Hao Cong","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc0","name":"Zhiheng Yao","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc1","name":"Fengxia Xie","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc2","name":"Zihao Xu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc3","name":"Haoran Xu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc4","name":"Zhaohui Wang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc5","name":"Minghao Liu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc6","name":"Shirong Lin","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc7","name":"Yingshui Tan","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc8","name":"Yuchi Xu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc9","name":"Wenbo Su","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dca","name":"Zhaoxiang Zhang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dcb","name":"Bo Zheng","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dcc","name":"Jiaheng Liu","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation","submittedOnDailyBy":{"_id":"670e53ca38721b4aecfeb639","avatarUrl":"/avatars/ec1d6ace31706fa370e6fc885f89f603.svg","isPro":false,"fullname":"Xinkai Ma","user":"Cenji630","type":"user","name":"Cenji630"},"summary":"Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.","upvotes":2,"discussionId":"6a1e578d808ddbc3c7d43dcd","projectPage":"https://nju-link.github.io/TVIR/","githubRepo":"https://github.com/NJU-LINK/TVIR","githubRepoAddedBy":"user","ai_summary":"A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.","ai_keywords":["multimodal deep research","visual elements","evidence-driven report generation","hierarchical multi-agent framework","textual assessment","visual assessment"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68355c5ec0003bc40230b3f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68355c5ec0003bc40230b3f2/fJjAPFtmAJskQJqxWUb-T.jpeg","isPro":false,"fullname":"jasmineWang","user":"Jessamine","type":"user"},{"_id":"6a1f2b2781eee8267eb43f95","avatarUrl":"/avatars/a0e8e1107650c7eecb2ef8f5aeb08f00.svg","isPro":false,"fullname":"Matsuro Junichi","user":"junmatsu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02320.md"}">
Papers
arxiv:2606.02320

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Published on Jun 1
· Submitted by
Xinkai Ma
on Jun 2
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

Community

Paper author Paper submitter about 9 hours ago

We present TVIR, the first benchmark and agent framework specifically designed for text-visual interleaved report generation. Unlike existing text-only deep research systems, TVIR-Bench evaluates both textual quality and visual integration across 100 expert-curated tasks. Our TVIR-Agent achieves state-of-the-art performance, demonstrating that structured multi-agent collaboration is key to generating high-quality multimodal reports.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.02320
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02320 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02320 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers