Hugging Face Daily Papers · · 4 min read

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

hf: <a href=\"https://huggingface.co/InterleaveThinker\">https://huggingface.co/InterleaveThinker</a></p>\n","updatedAt":"2026-06-12T02:27:43.087Z","author":{"_id":"67e60ae6ac37824273d74389","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YvPKZ_0gyJnvNwM1zK3JS.png","fullname":"Dian Zheng","name":"zhengli1013","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6362226009368896},"editors":["zhengli1013"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YvPKZ_0gyJnvNwM1zK3JS.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13679","authors":[{"_id":"6a2b6bbe4957fcdd3aac061e","name":"Dian Zheng","hidden":false},{"_id":"6a2b6bbe4957fcdd3aac061f","name":"Harry Lee","hidden":false},{"_id":"6a2b6bbe4957fcdd3aac0620","name":"Manyuan Zhang","hidden":false},{"_id":"6a2b6bbe4957fcdd3aac0621","name":"Kaituo Feng","hidden":false},{"_id":"6a2b6bbe4957fcdd3aac0622","name":"Zoey Guo","hidden":false},{"_id":"6a2b6bbe4957fcdd3aac0623","name":"Ray Zhang","hidden":false},{"_id":"6a2b6bbe4957fcdd3aac0624","name":"Hongsheng Li","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"InterleaveThinker: Reinforcing Agentic Interleaved Generation","submittedOnDailyBy":{"_id":"67e60ae6ac37824273d74389","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YvPKZ_0gyJnvNwM1zK3JS.png","isPro":true,"fullname":"Dian Zheng","user":"zhengli1013","type":"user","name":"zhengli1013"},"summary":"Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.","upvotes":67,"discussionId":"6a2b6bbe4957fcdd3aac0625","projectPage":"https://zhengdian1.github.io/InterleaveThinker-proj/","githubRepo":"https://github.com/zhengdian1/InterleaveThinker","githubRepoAddedBy":"user","ai_summary":"InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks.","ai_keywords":["multi-agent pipeline","image generator","interleaved generation","planner agent","critic agent","reinforcement learning","GRPO","accuracy reward","step-wise reward"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67e60ae6ac37824273d74389","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/YvPKZ_0gyJnvNwM1zK3JS.png","isPro":true,"fullname":"Dian Zheng","user":"zhengli1013","type":"user"},{"_id":"665ac11fa1fe0db0879db9fe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665ac11fa1fe0db0879db9fe/-VTpu3h6SsSuWg73ZvvZf.jpeg","isPro":false,"fullname":"Qihang Peng","user":"pqh22","type":"user"},{"_id":"646a31bd3eb2bab0419a54ef","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646a31bd3eb2bab0419a54ef/RD2GsZdewSZz7VUwWabUV.png","isPro":false,"fullname":"Kairun Wen","user":"kairunwen","type":"user"},{"_id":"636e19078ba65db4a093a3f4","avatarUrl":"/avatars/287b063b44a022d8576256e80e489c31.svg","isPro":false,"fullname":"alexiosss","user":"Alexislhb","type":"user"},{"_id":"6a14564a3789c92742679ffd","avatarUrl":"/avatars/43623d2e0331834f07b7193b28be494c.svg","isPro":false,"fullname":"Charles Lopez","user":"charleslopez81","type":"user"},{"_id":"6a15c51c38e55d931413b496","avatarUrl":"/avatars/bb791fc8ab137c2374b136c194b752f5.svg","isPro":false,"fullname":"董若曦","user":"JacksonWhite66","type":"user"},{"_id":"6a15e8bdf28e1a7f4f2b4061","avatarUrl":"/avatars/06e1fb99c1f7421938c558569f613962.svg","isPro":false,"fullname":"罗子轩","user":"harperclark","type":"user"},{"_id":"6a14801df432ac02b881e091","avatarUrl":"/avatars/0cf3308977f6066b02e083598cb4726f.svg","isPro":false,"fullname":"Julian Perez","user":"julianpere32","type":"user"},{"_id":"647d9ab61a1fcad2fdbf2d3d","avatarUrl":"/avatars/48c8aeae8979d2c87df8bde922437d62.svg","isPro":true,"fullname":"Ziyu Guo","user":"ZiyuG","type":"user"},{"_id":"6a15c79764db0c9311f0d8fc","avatarUrl":"/avatars/ee789ea67565c0041ac9bd4637985f67.svg","isPro":false,"fullname":"王奕辰","user":"AsherRamirez1","type":"user"},{"_id":"67079840a9bcb7459b8d2a46","avatarUrl":"/avatars/32466863c5554f20cb2775b138832ac3.svg","isPro":false,"fullname":"Kaituo Feng","user":"KaituoFeng","type":"user"},{"_id":"6a15a30886efa551cffd4509","avatarUrl":"/avatars/55bb0dd90b93a63002e4ea83a5ade0da.svg","isPro":false,"fullname":"Lin Haoran","user":"liha5y","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13679.md","query":{}}">
Papers
arxiv:2606.13679

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Published on Jun 11
· Submitted by
Dian Zheng
on Jun 12
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks.

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13679
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13679 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13679 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers