Hugging Face Daily Papers · June 2, 2026 · 3 min read

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

#model-release #multimodal #agents #benchmark #safety

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We present TVIR, the first benchmark and agent framework specifically designed for text-visual interleaved report generation. Unlike existing text-only deep research systems, TVIR-Bench evaluates both textual quality and visual integration across 100 expert-curated tasks. Our TVIR-Agent achieves state-of-the-art performance, demonstrating that structured multi-agent collaboration is key to generating high-quality multimodal reports.</p>\n","updatedAt":"2026-06-02T17:05:20.993Z","author":{"_id":"670e53ca38721b4aecfeb639","avatarUrl":"/avatars/ec1d6ace31706fa370e6fc885f89f603.svg","fullname":"Xinkai Ma","name":"Cenji630","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8711318373680115},"editors":["Cenji630"],"editorAvatarUrls":["/avatars/ec1d6ace31706fa370e6fc885f89f603.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02320","authors":[{"_id":"6a1e578d808ddbc3c7d43db5","user":{"_id":"670e53ca38721b4aecfeb639","avatarUrl":"/avatars/ec1d6ace31706fa370e6fc885f89f603.svg","isPro":false,"fullname":"Xinkai Ma","user":"Cenji630","type":"user","name":"Cenji630"},"name":"Xinkai Ma","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:08:46.889Z","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db6","name":"Zhiqi Bai","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db7","name":"Dingling Zhang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db8","name":"Pei Liu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43db9","name":"Yishuo Yuan","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dba","name":"He Zhu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbb","name":"Jiakai Wang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbc","name":"Qianqian Xie","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbd","name":"Yifan Zhao","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbe","name":"Xinlong Yang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dbf","name":"Hao Cong","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc0","name":"Zhiheng Yao","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc1","name":"Fengxia Xie","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc2","name":"Zihao Xu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc3","name":"Haoran Xu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc4","name":"Zhaohui Wang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc5","name":"Minghao Liu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc6","name":"Shirong Lin","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc7","name":"Yingshui Tan","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc8","name":"Yuchi Xu","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dc9","name":"Wenbo Su","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dca","name":"Zhaoxiang Zhang","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dcb","name":"Bo Zheng","hidden":false},{"_id":"6a1e578d808ddbc3c7d43dcc","name":"Jiaheng Liu","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation","submittedOnDailyBy":{"_id":"670e53ca38721b4aecfeb639","avatarUrl":"/avatars/ec1d6ace31706fa370e6fc885f89f603.svg","isPro":false,"fullname":"Xinkai Ma","user":"Cenji630","type":"user","name":"Cenji630"},"summary":"Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.","upvotes":2,"discussionId":"6a1e578d808ddbc3c7d43dcd","projectPage":"https://nju-link.github.io/TVIR/","githubRepo":"https://github.com/NJU-LINK/TVIR","githubRepoAddedBy":"user","ai_summary":"A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.","ai_keywords":["multimodal deep research","visual elements","evidence-driven report generation","hierarchical multi-agent framework","textual assessment","visual assessment"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68355c5ec0003bc40230b3f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68355c5ec0003bc40230b3f2/fJjAPFtmAJskQJqxWUb-T.jpeg","isPro":false,"fullname":"jasmineWang","user":"Jessamine","type":"user"},{"_id":"6a1f2b2781eee8267eb43f95","avatarUrl":"/avatars/a0e8e1107650c7eecb2ef8f5aeb08f00.svg","isPro":false,"fullname":"Matsuro Junichi","user":"junmatsu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68edc767abe005ac1b354573","name":"NJU-LINK","fullname":"NJU-LINK Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f9d060395fb1a0d7e4ae21/O3V4UZjcSGnOivcQqTcXW.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02320.md"}">

Papers

arxiv:2606.02320

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Published on Jun 1

· Submitted by

Xinkai Ma on Jun 2

NJU-LINK Lab

Upvote

Authors:

Xinkai Ma ,

Abstract

A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

Cenji630

Paper author Paper submitter about 9 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.02320

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02320 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02320 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers