Hugging Face Daily Papers · · 20 min read

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

🐱 = our method &nbsp;·&nbsp; **bold** = best, _italic_ = second best. &nbsp; Metrics — Vis: visual quality · Mot: motion · Aud: audio quality · IB-TV / IB-TA / IB-AV: ImageBind Text–Video / Text–Audio / Audio–Video alignment · AV-Al: audio–visual alignment · AVH: Audio-Visual Harmony · JAVIS: Joint Audio-Visual Integrated Score. See the technical report for the full benchmark and metric definitions.</sub>\n\n**Table 3. Latency and model size comparison.** Sampling throughput (FPS) is measured for 480P 20-second generation on a single H100 GPU. 🐱 **MaineCoon (Ours)** has the **largest model yet by far the fastest** speed — up to **7× faster** than other streaming audio-visual generators, and faster even than a 1.3B streaming video model.\n\n| Type | Model | Params | FPS↑ |\n|:--|:--|:--:|:--:|\n| Bidirectional T2AV | JavisDiT++ | 1.8B | 0.87 |\n| | Ovi | 11B | 0.58 |\n| | JoyAI-Echo | 23B | 18.0 |\n| | MoVA | 32B | 0.26 |\n| | LTX-2.3 | 22B | 1.40 |\n| | LTX-2.3-Distilled | 22B | _20.7_ |\n| Streaming T2V | Causal-Forcing | 1.3B | 19.1 |\n| | Helios-Distilled | 14B | 18.2 |\n| | Krea | 14B | 6.1 |\n| Streaming TA2V | LiveAvatar | 14B | 6.7 |\n| | SoulX-FlashTalk | 14B | 6.6 |\n| **Streaming T2AV** | 🐱&nbsp;**MaineCoon&nbsp;(Ours)** | **22B** | **47.5**&nbsp;🥇 |\n\n<sub>🐱 = our method &nbsp;·&nbsp; **bold** = best, _italic_ = second best. FPS for 480P-20s on a single H100.</sub>\n\n## Paper\n\nThe full paper is available on **[arXiv:2606.17800](https://arxiv.org/abs/2606.17800)**. A PDF copy is also included in this repository: [`MaineCoon_Technical_Report.pdf`](./MaineCoon_Technical_Report.pdf). It covers the social-video data infrastructure, the native streaming autoregressive training recipe, the agentic streaming inference framework, SocialVideo-Bench, and a position/outlook on social world models.\n\n## Acknowledgements\n\nMaineCoon stands on the shoulders of the open-source community. We are especially grateful to:\n\n- **🎬 LTX-2.3 & the LTX series — [Lightricks](https://github.com/Lightricks).** MaineCoon's audio-visual backbone builds on the excellent open **LTX-2.3** model. Huge credit to the LTX team and the broader LTX-Video series.\n - **LTX-2** (incl. LTX-2.3): https://github.com/Lightricks/LTX-2\n - **LTX-Video**: https://github.com/Lightricks/LTX-Video\n- **⚡ DMD series & the distribution-matching distillation community.** Our reinforced online-policy distillation (ROPD) builds on the **Distribution Matching Distillation (DMD / DMD2)** line of work and the wider few-step / real-time distillation community.\n - **DMD2**: https://github.com/tianweiy/DMD2\n - **DMD** (project page): https://tianweiy.github.io/dmd/\n\nWe thank these projects and their communities for advancing real-time, few-step, and streaming video generation.\n\n## Citation\n\n```bibtex\n@article{catnip2026mainecoon,\n title = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},\n author = {Catnip AI Team},\n year = {2026},\n journal = {arXiv preprint arXiv:2606.17800},\n url = {https://arxiv.org/abs/2606.17800}\n}\n```","html":"<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"mainecoon-pursuing-a-real-time-audio-visual-social-world-model\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#mainecoon-pursuing-a-real-time-audio-visual-social-world-model\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tMaineCoon: Pursuing A Real-Time Audio-Visual Social World Model\n\t</span>\n</h1>\n<p><strong>Catnip AI Team</strong></p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th></th>\n<th></th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td>🌐 Project</td>\n<td><a href=\"https://mainecoon.tech/\" rel=\"nofollow\">https://mainecoon.tech/</a></td>\n</tr>\n<tr>\n<td>🕹️ Experience</td>\n<td><a href=\"https://mainecoon.tech/experience-platform\" rel=\"nofollow\">https://mainecoon.tech/experience-platform</a></td>\n</tr>\n<tr>\n<td>📄 Paper (arXiv)</td>\n<td><a href=\"https://arxiv.org/abs/2606.17800\" rel=\"nofollow\">https://arxiv.org/abs/2606.17800</a></td>\n</tr>\n<tr>\n<td>📝 Blog</td>\n<td><a href=\"https://mainecoon.tech/blogs\" rel=\"nofollow\">https://mainecoon.tech/blogs</a></td>\n</tr>\n<tr>\n<td>💻 GitHub</td>\n<td><a href=\"https://github.com/catnip-ai-tech/MaineCoon\" rel=\"nofollow\">https://github.com/catnip-ai-tech/MaineCoon</a></td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"abstract\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#abstract\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tAbstract\n\t</span>\n</h2>\n<p>As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. They typically omit critical auditory information or fail to capture the high-engagement pacing, emotional resonance, and rapid conversational flow that define viral social media. To bridge this gap as the first step to social world models, we present <strong>MaineCoon</strong>, the first real-time audio-visual autoregressive model that has <strong>22B parameters</strong> and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of <strong>up to 47.5 FPS, on a single GPU</strong>. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planning. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"highlights\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#highlights\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tHighlights\n\t</span>\n</h2>\n<ul>\n<li><strong>⚡ Real-time on a single GPU.</strong> A 22B interactive audio-visual autoregressive model capable of streaming generation and sub-second interaction, with a record-breaking frame rate of <strong>up to 47.5 FPS</strong> on a single H100. Generation cost drops well <strong>below $0.001 per second</strong> — and keeps falling.</li>\n<li><strong>🌍 A new paradigm: social world models.</strong> MaineCoon positions and serves as the first generative core for <em>social world models</em>, a technical foundation for next-generation AI-native social platforms.</li>\n<li><strong>🎓 Forcing-free streaming training.</strong> A multi-stage training paradigm — <strong>self-resampling</strong>, <strong>cross-modal representation alignment</strong>, <strong>domain-aware preference optimization</strong>, and <strong>reinforced online-policy distillation (ROPD)</strong> — that enables native, efficient streaming audio-visual training at 22B scale.</li>\n<li><strong>🧠 Agentic streaming inference.</strong> An agentic inference framework that supports <strong>thousand-second-scale</strong> generation while mitigating drift through agentic cache management, chunk commitment, long-context rollout, and prompt planning.</li>\n<li><strong>📊 SocialVideo-Bench.</strong> A new benchmark focused on audio-visual social-video generation, with 9 representative metrics covering visual quality, motion, audio quality, audio-visual alignment, and social-video harmony. MaineCoon outperforms 7 representative open audio-visual models while achieving the fastest generation speed — a new state of the art for real-time social video generation.</li>\n</ul>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"showcase\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#showcase\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tShowcase\n\t</span>\n</h2>\n<p>Hand-picked MaineCoon generations (audio-visual, with sound) play directly in the <strong><a href=\"https://github.com/catnip-ai-tech/MaineCoon\" rel=\"nofollow\">GitHub repository</a></strong>.</p>\n<p>🎬 <strong>Minute-scale, long-form demos</strong> are best viewed on our <strong><a href=\"https://mainecoon.tech/blogs\" rel=\"nofollow\">blog</a></strong>. &nbsp; 🕹️ <strong>Try MaineCoon live</strong> at the <strong><a href=\"https://mainecoon.tech/experience-platform\" rel=\"nofollow\">experience platform</a></strong>.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"benchmark--socialvideo-bench\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#benchmark--socialvideo-bench\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tBenchmark — SocialVideo-Bench\n\t</span>\n</h2>\n<p><strong>Table 2. Main quantitative results on SocialVideo-Bench.</strong> 🐱 <strong>MaineCoon (Ours)</strong> achieves the best average score and wins most metrics, including the two most comprehensive ones — Audio-Visual Harmony (AVH) and Joint Audio-Visual Integrated Score (JAVIS) — over both streaming and bidirectional baselines.</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th align=\"left\">Type</th>\n<th align=\"left\">Model</th>\n<th align=\"center\">Vis↑</th>\n<th align=\"center\">Mot↑</th>\n<th align=\"center\">Aud↑</th>\n<th align=\"center\">IB-TV↑</th>\n<th align=\"center\">IB-TA↑</th>\n<th align=\"center\">IB-AV↑</th>\n<th align=\"center\">AV-Al↑</th>\n<th align=\"center\">AVH↑</th>\n<th align=\"center\">JAVIS↑</th>\n<th align=\"center\">Average↑</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td align=\"left\">Bidirectional T2AV</td>\n<td align=\"left\">JavisDiT++</td>\n<td align=\"center\">4.39</td>\n<td align=\"center\"><strong>2.22</strong></td>\n<td align=\"center\">4.06</td>\n<td align=\"center\">0.134</td>\n<td align=\"center\">0.070</td>\n<td align=\"center\">0.151</td>\n<td align=\"center\">0.312</td>\n<td align=\"center\">0.136</td>\n<td align=\"center\">0.112</td>\n<td align=\"center\">0.711</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">Ovi</td>\n<td align=\"center\">4.44</td>\n<td align=\"center\">1.89</td>\n<td align=\"center\">3.76</td>\n<td align=\"center\"><em>0.138</em></td>\n<td align=\"center\">0.079</td>\n<td align=\"center\">0.191</td>\n<td align=\"center\"><strong>0.412</strong></td>\n<td align=\"center\">0.188</td>\n<td align=\"center\">0.162</td>\n<td align=\"center\">0.779</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">JoyAI-Echo</td>\n<td align=\"center\">4.61</td>\n<td align=\"center\">1.17</td>\n<td align=\"center\">3.47</td>\n<td align=\"center\"><strong>0.147</strong></td>\n<td align=\"center\">0.088</td>\n<td align=\"center\">0.226</td>\n<td align=\"center\">0.319</td>\n<td align=\"center\">0.196</td>\n<td align=\"center\">0.173</td>\n<td align=\"center\">0.749</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">MoVA</td>\n<td align=\"center\"><em>4.66</em></td>\n<td align=\"center\">1.68</td>\n<td align=\"center\">3.69</td>\n<td align=\"center\">0.133</td>\n<td align=\"center\">0.105</td>\n<td align=\"center\">0.258</td>\n<td align=\"center\"><em>0.359</em></td>\n<td align=\"center\">0.245</td>\n<td align=\"center\">0.216</td>\n<td align=\"center\">0.842</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">LTX-2.3</td>\n<td align=\"center\">4.10</td>\n<td align=\"center\">0.99</td>\n<td align=\"center\">4.06</td>\n<td align=\"center\">0.132</td>\n<td align=\"center\">0.111</td>\n<td align=\"center\">0.311</td>\n<td align=\"center\">0.334</td>\n<td align=\"center\">0.287</td>\n<td align=\"center\"><em>0.247</em></td>\n<td align=\"center\">0.848</td>\n</tr>\n<tr>\n<td align=\"left\">Streaming TA2V</td>\n<td align=\"left\">LiveAvatar</td>\n<td align=\"center\">4.60</td>\n<td align=\"center\">1.46</td>\n<td align=\"center\"><em>4.13</em></td>\n<td align=\"center\">0.131</td>\n<td align=\"center\">0.120</td>\n<td align=\"center\"><em>0.316</em></td>\n<td align=\"center\">0.326</td>\n<td align=\"center\"><em>0.291</em></td>\n<td align=\"center\">0.246</td>\n<td align=\"center\">0.892</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">SoulX-FlashTalk</td>\n<td align=\"center\">4.65</td>\n<td align=\"center\"><em>1.99</em></td>\n<td align=\"center\">4.07</td>\n<td align=\"center\">0.128</td>\n<td align=\"center\"><em>0.120</em></td>\n<td align=\"center\">0.307</td>\n<td align=\"center\">0.279</td>\n<td align=\"center\">0.283</td>\n<td align=\"center\">0.238</td>\n<td align=\"center\"><em>0.895</em></td>\n</tr>\n<tr>\n<td align=\"left\"><strong>Streaming T2AV</strong></td>\n<td align=\"left\">🐱&nbsp;<strong>MaineCoon&nbsp;(Ours)</strong></td>\n<td align=\"center\"><strong>4.71</strong></td>\n<td align=\"center\">1.62</td>\n<td align=\"center\"><strong>4.35</strong></td>\n<td align=\"center\">0.127</td>\n<td align=\"center\"><strong>0.130</strong></td>\n<td align=\"center\"><strong>0.318</strong></td>\n<td align=\"center\">0.334</td>\n<td align=\"center\"><strong>0.308</strong></td>\n<td align=\"center\"><strong>0.272</strong></td>\n<td align=\"center\"><strong>0.934</strong>&nbsp;🥇</td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<p><sub>🐱 = our method &nbsp;·&nbsp; <strong>bold</strong> = best, <em>italic</em> = second best. &nbsp; Metrics — Vis: visual quality · Mot: motion · Aud: audio quality · IB-TV / IB-TA / IB-AV: ImageBind Text–Video / Text–Audio / Audio–Video alignment · AV-Al: audio–visual alignment · AVH: Audio-Visual Harmony · JAVIS: Joint Audio-Visual Integrated Score. See the technical report for the full benchmark and metric definitions.</sub></p>\n<p><strong>Table 3. Latency and model size comparison.</strong> Sampling throughput (FPS) is measured for 480P 20-second generation on a single H100 GPU. 🐱 <strong>MaineCoon (Ours)</strong> has the <strong>largest model yet by far the fastest</strong> speed — up to <strong>7× faster</strong> than other streaming audio-visual generators, and faster even than a 1.3B streaming video model.</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th align=\"left\">Type</th>\n<th align=\"left\">Model</th>\n<th align=\"center\">Params</th>\n<th align=\"center\">FPS↑</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td align=\"left\">Bidirectional T2AV</td>\n<td align=\"left\">JavisDiT++</td>\n<td align=\"center\">1.8B</td>\n<td align=\"center\">0.87</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">Ovi</td>\n<td align=\"center\">11B</td>\n<td align=\"center\">0.58</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">JoyAI-Echo</td>\n<td align=\"center\">23B</td>\n<td align=\"center\">18.0</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">MoVA</td>\n<td align=\"center\">32B</td>\n<td align=\"center\">0.26</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">LTX-2.3</td>\n<td align=\"center\">22B</td>\n<td align=\"center\">1.40</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">LTX-2.3-Distilled</td>\n<td align=\"center\">22B</td>\n<td align=\"center\"><em>20.7</em></td>\n</tr>\n<tr>\n<td align=\"left\">Streaming T2V</td>\n<td align=\"left\">Causal-Forcing</td>\n<td align=\"center\">1.3B</td>\n<td align=\"center\">19.1</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">Helios-Distilled</td>\n<td align=\"center\">14B</td>\n<td align=\"center\">18.2</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">Krea</td>\n<td align=\"center\">14B</td>\n<td align=\"center\">6.1</td>\n</tr>\n<tr>\n<td align=\"left\">Streaming TA2V</td>\n<td align=\"left\">LiveAvatar</td>\n<td align=\"center\">14B</td>\n<td align=\"center\">6.7</td>\n</tr>\n<tr>\n<td align=\"left\"></td>\n<td align=\"left\">SoulX-FlashTalk</td>\n<td align=\"center\">14B</td>\n<td align=\"center\">6.6</td>\n</tr>\n<tr>\n<td align=\"left\"><strong>Streaming T2AV</strong></td>\n<td align=\"left\">🐱&nbsp;<strong>MaineCoon&nbsp;(Ours)</strong></td>\n<td align=\"center\"><strong>22B</strong></td>\n<td align=\"center\"><strong>47.5</strong>&nbsp;🥇</td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<p><sub>🐱 = our method &nbsp;·&nbsp; <strong>bold</strong> = best, <em>italic</em> = second best. FPS for 480P-20s on a single H100.</sub></p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"paper\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#paper\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tPaper\n\t</span>\n</h2>\n<p>The full paper is available on <strong><a href=\"https://arxiv.org/abs/2606.17800\" rel=\"nofollow\">arXiv:2606.17800</a></strong>. A PDF copy is also included in this repository: <a href=\"./MaineCoon_Technical_Report.pdf\" rel=\"nofollow\"><code>MaineCoon_Technical_Report.pdf</code></a>. It covers the social-video data infrastructure, the native streaming autoregressive training recipe, the agentic streaming inference framework, SocialVideo-Bench, and a position/outlook on social world models.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"acknowledgements\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#acknowledgements\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tAcknowledgements\n\t</span>\n</h2>\n<p>MaineCoon stands on the shoulders of the open-source community. We are especially grateful to:</p>\n<ul>\n<li><strong>🎬 LTX-2.3 &amp; the LTX series — <a href=\"https://github.com/Lightricks\" rel=\"nofollow\">Lightricks</a>.</strong> MaineCoon's audio-visual backbone builds on the excellent open <strong>LTX-2.3</strong> model. Huge credit to the LTX team and the broader LTX-Video series.<ul>\n<li><strong>LTX-2</strong> (incl. LTX-2.3): <a href=\"https://github.com/Lightricks/LTX-2\" rel=\"nofollow\">https://github.com/Lightricks/LTX-2</a></li>\n<li><strong>LTX-Video</strong>: <a href=\"https://github.com/Lightricks/LTX-Video\" rel=\"nofollow\">https://github.com/Lightricks/LTX-Video</a></li>\n</ul>\n</li>\n<li><strong>⚡ DMD series &amp; the distribution-matching distillation community.</strong> Our reinforced online-policy distillation (ROPD) builds on the <strong>Distribution Matching Distillation (DMD / DMD2)</strong> line of work and the wider few-step / real-time distillation community.<ul>\n<li><strong>DMD2</strong>: <a href=\"https://github.com/tianweiy/DMD2\" rel=\"nofollow\">https://github.com/tianweiy/DMD2</a></li>\n<li><strong>DMD</strong> (project page): <a href=\"https://tianweiy.github.io/dmd/\" rel=\"nofollow\">https://tianweiy.github.io/dmd/</a></li>\n</ul>\n</li>\n</ul>\n<p>We thank these projects and their communities for advancing real-time, few-step, and streaming video generation.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"citation\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#citation\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tCitation\n\t</span>\n</h2>\n<pre><code class=\"language-bibtex\"><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;article&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/article\">@<span class=\"underline\">article</span></a></span> </span></span>{catnip2026mainecoon,\n title = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},\n author = {Catnip AI Team},\n year = {2026},\n journal = {arXiv preprint arXiv:2606.17800},\n url = {https://arxiv.org/abs/2606.17800}\n}\n</code></pre>\n","updatedAt":"2026-06-18T14:41:53.379Z","author":{"_id":"6424772d956c16097c2745b4","avatarUrl":"/avatars/469af721009b9825ae6ac49112f58fdb.svg","fullname":"Bai LiChen","name":"indulgeBai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7321161031723022},"editors":["indulgeBai"],"editorAvatarUrls":["/avatars/469af721009b9825ae6ac49112f58fdb.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17800","authors":[{"_id":"6a337b7559127a45e2c1c689","name":"Lichen Bai","hidden":false},{"_id":"6a337b7559127a45e2c1c68a","name":"Tianhao Zhang","hidden":false},{"_id":"6a337b7559127a45e2c1c68b","name":"Shitong Shao","hidden":false},{"_id":"6a337b7559127a45e2c1c68c","name":"Dingwei Tan","hidden":false},{"_id":"6a337b7559127a45e2c1c68d","name":"Qiyu Zhong","hidden":false},{"_id":"6a337b7559127a45e2c1c68e","name":"Zhengpeng Xie","hidden":false},{"_id":"6a337b7559127a45e2c1c68f","name":"Haopeng Li","hidden":false},{"_id":"6a337b7559127a45e2c1c690","name":"Qinghao Huang","hidden":false},{"_id":"6a337b7559127a45e2c1c691","name":"Dandan Shen","hidden":false},{"_id":"6a337b7559127a45e2c1c692","name":"Tengjiao Ji","hidden":false},{"_id":"6a337b7559127a45e2c1c693","name":"Wei Wang","hidden":false},{"_id":"6a337b7559127a45e2c1c694","name":"Peicheng Wu","hidden":false},{"_id":"6a337b7559127a45e2c1c695","name":"Yuxuan Zhao","hidden":false},{"_id":"6a337b7559127a45e2c1c696","name":"Xiangyu Zhu","hidden":false},{"_id":"6a337b7559127a45e2c1c697","name":"Welly Luo","hidden":false},{"_id":"6a337b7559127a45e2c1c698","name":"Shurui Yang","hidden":false},{"_id":"6a337b7559127a45e2c1c699","name":"Zeke Xie","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model","submittedOnDailyBy":{"_id":"6424772d956c16097c2745b4","avatarUrl":"/avatars/469af721009b9825ae6ac49112f58fdb.svg","isPro":false,"fullname":"Bai LiChen","user":"indulgeBai","type":"user","name":"indulgeBai"},"summary":"As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.","upvotes":7,"discussionId":"6a337b7559127a45e2c1c69a","projectPage":"https://mainecoon.tech/","githubRepo":"https://github.com/catnip-ai-tech/MaineCoon","githubRepoAddedBy":"user","ai_summary":"MaineCoon represents the first real-time audio-visual autoregressive model for social worlds, achieving high frame rates and long-horizon generation through novel training techniques and inference frameworks.","ai_keywords":["audio-visual autoregressive model","real-time streaming generation","frame rate","domain-aware preference optimization","reinforced online-policy distillation","agentic streaming inference","cross-modal representation alignment","self-resampling"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":24,"organization":{"_id":"6a30f3ec63c271161d0d32aa","name":"catnip-ai-tech","fullname":"catnip","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/691ffe98e8cecc35b061bbf4/bmwx0woB3l__WiUE8sAsU.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6424772d956c16097c2745b4","avatarUrl":"/avatars/469af721009b9825ae6ac49112f58fdb.svg","isPro":false,"fullname":"Bai LiChen","user":"indulgeBai","type":"user"},{"_id":"67a5fb17baac478d94c59fec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67a5fb17baac478d94c59fec/8fjKxO8AkEIkjXHsfRPDa.jpeg","isPro":false,"fullname":"Zeke Xie","user":"xLeaF","type":"user"},{"_id":"67136093d2e50f1e8c9fad52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/0q49MyGuav8lJ9CIeyLhu.png","isPro":false,"fullname":"Donghao Zhou","user":"donghao-zhou","type":"user"},{"_id":"65a4431ee82b0b8490a37d33","avatarUrl":"/avatars/78fd7601016ae9bfb53a76cc8e423f0f.svg","isPro":false,"fullname":"Yufei Gu","user":"YufeiGu451","type":"user"},{"_id":"6900a30689120fe64f26aeec","avatarUrl":"/avatars/5d06a595a9281667396a3f7c4801caee.svg","isPro":false,"fullname":"Zhengpeng Xie","user":"adasfdsrsgs","type":"user"},{"_id":"65ad57da57f263e3d030187a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/dJ3DYSIlv3Pb_6IEbqwOQ.png","isPro":false,"fullname":"潘子豪","user":"Apostle723","type":"user"},{"_id":"64e86fbd0c2413c3571ef7a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64e86fbd0c2413c3571ef7a6/KDpJ0UpjICfQKUr14ekcR.png","isPro":false,"fullname":"Haopeng Li","user":"hp-l33","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a30f3ec63c271161d0d32aa","name":"catnip-ai-tech","fullname":"catnip","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/691ffe98e8cecc35b061bbf4/bmwx0woB3l__WiUE8sAsU.jpeg"},"query":{}}">
Papers
arxiv:2606.17800

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Published on Jun 16
· Submitted by
Bai LiChen
on Jun 18
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

MaineCoon represents the first real-time audio-visual autoregressive model for social worlds, achieving high frame rates and long-horizon generation through novel training techniques and inference frameworks.

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

Community

Paper submitter about 1 hour ago

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

Catnip AI Team

Abstract

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. They typically omit critical auditory information or fail to capture the high-engagement pacing, emotional resonance, and rapid conversational flow that define viral social media. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planning. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

Highlights

  • ⚡ Real-time on a single GPU. A 22B interactive audio-visual autoregressive model capable of streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS on a single H100. Generation cost drops well below $0.001 per second — and keeps falling.
  • 🌍 A new paradigm: social world models. MaineCoon positions and serves as the first generative core for social world models, a technical foundation for next-generation AI-native social platforms.
  • 🎓 Forcing-free streaming training. A multi-stage training paradigm — self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD) — that enables native, efficient streaming audio-visual training at 22B scale.
  • 🧠 Agentic streaming inference. An agentic inference framework that supports thousand-second-scale generation while mitigating drift through agentic cache management, chunk commitment, long-context rollout, and prompt planning.
  • 📊 SocialVideo-Bench. A new benchmark focused on audio-visual social-video generation, with 9 representative metrics covering visual quality, motion, audio quality, audio-visual alignment, and social-video harmony. MaineCoon outperforms 7 representative open audio-visual models while achieving the fastest generation speed — a new state of the art for real-time social video generation.

Showcase

Hand-picked MaineCoon generations (audio-visual, with sound) play directly in the GitHub repository.

🎬 Minute-scale, long-form demos are best viewed on our blog.   🕹️ Try MaineCoon live at the experience platform.

Benchmark — SocialVideo-Bench

Table 2. Main quantitative results on SocialVideo-Bench. 🐱 MaineCoon (Ours) achieves the best average score and wins most metrics, including the two most comprehensive ones — Audio-Visual Harmony (AVH) and Joint Audio-Visual Integrated Score (JAVIS) — over both streaming and bidirectional baselines.

Type Model Vis↑ Mot↑ Aud↑ IB-TV↑ IB-TA↑ IB-AV↑ AV-Al↑ AVH↑ JAVIS↑ Average↑
Bidirectional T2AV JavisDiT++ 4.39 2.22 4.06 0.134 0.070 0.151 0.312 0.136 0.112 0.711
Ovi 4.44 1.89 3.76 0.138 0.079 0.191 0.412 0.188 0.162 0.779
JoyAI-Echo 4.61 1.17 3.47 0.147 0.088 0.226 0.319 0.196 0.173 0.749
MoVA 4.66 1.68 3.69 0.133 0.105 0.258 0.359 0.245 0.216 0.842
LTX-2.3 4.10 0.99 4.06 0.132 0.111 0.311 0.334 0.287 0.247 0.848
Streaming TA2V LiveAvatar 4.60 1.46 4.13 0.131 0.120 0.316 0.326 0.291 0.246 0.892
SoulX-FlashTalk 4.65 1.99 4.07 0.128 0.120 0.307 0.279 0.283 0.238 0.895
Streaming T2AV 🐱 MaineCoon (Ours) 4.71 1.62 4.35 0.127 0.130 0.318 0.334 0.308 0.272 0.934 🥇

🐱 = our method  ·  bold = best, italic = second best.   Metrics — Vis: visual quality · Mot: motion · Aud: audio quality · IB-TV / IB-TA / IB-AV: ImageBind Text–Video / Text–Audio / Audio–Video alignment · AV-Al: audio–visual alignment · AVH: Audio-Visual Harmony · JAVIS: Joint Audio-Visual Integrated Score. See the technical report for the full benchmark and metric definitions.

Table 3. Latency and model size comparison. Sampling throughput (FPS) is measured for 480P 20-second generation on a single H100 GPU. 🐱 MaineCoon (Ours) has the largest model yet by far the fastest speed — up to 7× faster than other streaming audio-visual generators, and faster even than a 1.3B streaming video model.

Type Model Params FPS↑
Bidirectional T2AV JavisDiT++ 1.8B 0.87
Ovi 11B 0.58
JoyAI-Echo 23B 18.0
MoVA 32B 0.26
LTX-2.3 22B 1.40
LTX-2.3-Distilled 22B 20.7
Streaming T2V Causal-Forcing 1.3B 19.1
Helios-Distilled 14B 18.2
Krea 14B 6.1
Streaming TA2V LiveAvatar 14B 6.7
SoulX-FlashTalk 14B 6.6
Streaming T2AV 🐱 MaineCoon (Ours) 22B 47.5 🥇

🐱 = our method  ·  bold = best, italic = second best. FPS for 480P-20s on a single H100.

Paper

The full paper is available on arXiv:2606.17800. A PDF copy is also included in this repository: MaineCoon_Technical_Report.pdf. It covers the social-video data infrastructure, the native streaming autoregressive training recipe, the agentic streaming inference framework, SocialVideo-Bench, and a position/outlook on social world models.

Acknowledgements

MaineCoon stands on the shoulders of the open-source community. We are especially grateful to:

We thank these projects and their communities for advancing real-time, few-step, and streaming video generation.

Citation

@article {catnip2026mainecoon,
  title        = {MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model},
  author       = {Catnip AI Team},
  year         = {2026},
  journal      = {arXiv preprint arXiv:2606.17800},
  url          = {https://arxiv.org/abs/2606.17800}
}
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17800 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17800 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers