The code has been open-sourced.</p>\n","updatedAt":"2026-05-14T01:52:20.893Z","author":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/GvHk_KSdE9Rhnk_o-NaZX.jpeg","fullname":"Zeyu Zhang","name":"SteveZeyuZhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9974549412727356},"editors":["SteveZeyuZhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/GvHk_KSdE9Rhnk_o-NaZX.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.11363","authors":[{"_id":"6a048563e94247db1a5a9e42","name":"Wei Wu","hidden":false},{"_id":"6a048563e94247db1a5a9e43","name":"Ziyang Xu","hidden":false},{"_id":"6a048563e94247db1a5a9e44","user":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/GvHk_KSdE9Rhnk_o-NaZX.jpeg","isPro":false,"fullname":"Zeyu Zhang","user":"SteveZeyuZhang","type":"user","name":"SteveZeyuZhang"},"name":"Zeyu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:56:55.132Z","hidden":false},{"_id":"6a048563e94247db1a5a9e45","name":"Yang Zhao","hidden":false},{"_id":"6a048563e94247db1a5a9e46","name":"Hao Tang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64ec877bb93654d4ca5c92e9/nwBtiN4toIpXG1ITwZvxV.mp4"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"PresentAgent-2: Towards Generalist Multimodal Presentation Agents","submittedOnDailyBy":{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/GvHk_KSdE9Rhnk_o-NaZX.jpeg","isPro":false,"fullname":"Zeyu Zhang","user":"SteveZeyuZhang","type":"user","name":"SteveZeyuZhang"},"summary":"Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.","upvotes":7,"discussionId":"6a048563e94247db1a5a9e47","projectPage":"https://aigeeksgroup.github.io/PresentAgent-2/","githubRepo":"https://github.com/AIGeeksGroup/PresentAgent-2","githubRepoAddedBy":"user","ai_summary":"PresentAgent-2 is an agentic framework that generates presentation videos from user queries by conducting research, creating multimodal slides, and producing interactive content across single, discussion, and interaction modes.","ai_keywords":["agentic framework","presentation video generation","multimodal media","research grounding","presentation modes","single presentation","discussion","interaction","dialogue naturalness","interaction grounding"],"githubStars":9,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ec877bb93654d4ca5c92e9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ec877bb93654d4ca5c92e9/GvHk_KSdE9Rhnk_o-NaZX.jpeg","isPro":false,"fullname":"Zeyu Zhang","user":"SteveZeyuZhang","type":"user"},{"_id":"688732934f8453e29d5c1933","avatarUrl":"/avatars/0195a13af06bca01c4bac7475fc1092b.svg","isPro":false,"fullname":"wuwei","user":"Ccccccddd","type":"user"},{"_id":"69abd5b976c88b126465bcaa","avatarUrl":"/avatars/dc6bbc00b8feb16b611fb0fddfcdd6a5.svg","isPro":false,"fullname":"Cong Li","user":"CongCongZi","type":"user"},{"_id":"69fe913ead9b19cd06ac747a","avatarUrl":"/avatars/d5448d0bb49403ed928f4bbf7961b6c3.svg","isPro":false,"fullname":"Ziyang Xu","user":"xzy-Zayn","type":"user"},{"_id":"66ffd015cdda3a0dd2ca14a8","avatarUrl":"/avatars/271ee177dee8002fd35f25316fa6380f.svg","isPro":false,"fullname":"wuwei","user":"weiwu-77","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"69bce0c14c7110a367b1bb71","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/c1IIZdiEjYi5tHz4tpZwr.png","isPro":false,"fullname":"Alexander THOMPSON","user":"davidwright2024","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.11363.md"}">
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
Abstract
PresentAgent-2 is an agentic framework that generates presentation videos from user queries by conducting research, creating multimodal slides, and producing interactive content across single, discussion, and interaction modes.
AI-generated summary
Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.
Community
The code has been open-sourced.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.11363 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.11363 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.