Hugging Face Daily Papers · · 4 min read

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

FlowLong is a training-free, model-agnostic inference-time method that extends pretrained flow-based video diffusion models beyond their native generation horizon — works uniformly for text-to-video, audio-video joint, and text-to-3D scene generation.<br>Project page: <a href=\"https://flowlong-video.github.io/\" rel=\"nofollow\">https://flowlong-video.github.io/</a><br>Paper: <a href=\"https://arxiv.org/abs/2605.20910\" rel=\"nofollow\">https://arxiv.org/abs/2605.20910</a></p>\n","updatedAt":"2026-05-22T03:11:06.959Z","author":{"_id":"67e9fc3797cd6860c81d5838","avatarUrl":"/avatars/6c37731156bf52c123bd390823890d28.svg","fullname":"Jangho Park","name":"jhpark96","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7996401786804199},"editors":["jhpark96"],"editorAvatarUrls":["/avatars/6c37731156bf52c123bd390823890d28.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20910","authors":[{"_id":"6a0e8062164dbbc68a26c595","user":{"_id":"67e9fc3797cd6860c81d5838","avatarUrl":"/avatars/6c37731156bf52c123bd390823890d28.svg","isPro":false,"fullname":"Jangho Park","user":"jhpark96","type":"user","name":"jhpark96"},"name":"Jangho Park","status":"claimed_verified","statusLastChangedAt":"2026-05-21T19:22:08.860Z","hidden":false},{"_id":"6a0e8062164dbbc68a26c596","name":"Geon Yeong Park","hidden":false},{"_id":"6a0e8062164dbbc68a26c597","name":"Gihyun Kwon","hidden":false},{"_id":"6a0e8062164dbbc68a26c598","name":"Jong Chul Ye","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching","submittedOnDailyBy":{"_id":"67e9fc3797cd6860c81d5838","avatarUrl":"/avatars/6c37731156bf52c123bd390823890d28.svg","isPro":false,"fullname":"Jangho Park","user":"jhpark96","type":"user","name":"jhpark96"},"summary":"Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via Tweedie matching to enforce both manifold constraint and temporal consistency across overlap regions. Stochastic early-phase sampling then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.","upvotes":21,"discussionId":"6a0e8063164dbbc68a26c599","projectPage":"https://flowlong-video.github.io/","githubRepo":"https://github.com/jhq1234/flowlong","githubRepoAddedBy":"user","ai_summary":"A novel inference-time method for long video generation using overlapping sliding windows with Tweedie matching and stochastic early-phase sampling to improve temporal consistency and visual quality.","ai_keywords":["video diffusion models","sliding windows","Tweedie matching","temporal consistency","stochastic early-phase sampling","deterministic ODE sampling","autoregressive models","bidirectional models","exposure bias","video generation","audio-video joint generation","text-to-3DGS"],"githubStars":2,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67e9fc3797cd6860c81d5838","avatarUrl":"/avatars/6c37731156bf52c123bd390823890d28.svg","isPro":false,"fullname":"Jangho Park","user":"jhpark96","type":"user"},{"_id":"691412477450a96c4ca8cff6","avatarUrl":"/avatars/ac410d938b1664b2fac1709e1546a48d.svg","isPro":false,"fullname":"Kim","user":"tom919","type":"user"},{"_id":"64dad9818da011d656047767","avatarUrl":"/avatars/ca5b46aed0b122e03118165882caea57.svg","isPro":false,"fullname":"Yoojin Oh","user":"yoojinoh","type":"user"},{"_id":"6555f90bbf4c000cc625a4b5","avatarUrl":"/avatars/10303d35d3e3157b0a000aefdc18cecb.svg","isPro":false,"fullname":"Chunsan Hong","user":"hoarer123","type":"user"},{"_id":"64cfff0abf39f9c8bebd2a6f","avatarUrl":"/avatars/e3405136c8af5065c287eff718c78b5b.svg","isPro":false,"fullname":"Geon Yeong Park","user":"geonyoung-park","type":"user"},{"_id":"6628efe14e1fa854f48d3a28","avatarUrl":"/avatars/aa5421149a07a82b5c2a25978f9b6926.svg","isPro":false,"fullname":"Bryan Sangwoo Kim","user":"bryanswkim","type":"user"},{"_id":"67bdabc757b91d9de0a9798a","avatarUrl":"/avatars/769d647b3389c3f88dde5fbf4047e800.svg","isPro":false,"fullname":"Jonghyun Park","user":"eundungosu","type":"user"},{"_id":"65ffd913b1e509e1e45c81f5","avatarUrl":"/avatars/3a997f3a7874126596575c5bcd99d756.svg","isPro":false,"fullname":"JunhaSong","user":"junha1125","type":"user"},{"_id":"64c3732de6c3860fba66ceb0","avatarUrl":"/avatars/785783ca08687923053eac641326281f.svg","isPro":false,"fullname":"JaeminKim","user":"kjm981995","type":"user"},{"_id":"6463554dd2044cd1d7c6e0bf","avatarUrl":"/avatars/d7653623117268c545a7063fec69664b.svg","isPro":false,"fullname":"Bingzheng Wei","user":"Bingzheng","type":"user"},{"_id":"69c22d4c7f7dd6e7c11b970e","avatarUrl":"/avatars/cef41aedd141c6727cd1c6eb5cd36192.svg","isPro":false,"fullname":"Beomsu Kim","user":"1202kbs","type":"user"},{"_id":"674031fdb86fd98d953d87ac","avatarUrl":"/avatars/2f4455d63935384ab6b0ea7b8ee73a87.svg","isPro":false,"fullname":"hongeunkim","user":"ki-mong","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20910.md"}">
Papers
arxiv:2605.20910

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

Published on May 20
· Submitted by
Jangho Park
on May 22
Authors:
,
,

Abstract

A novel inference-time method for long video generation using overlapping sliding windows with Tweedie matching and stochastic early-phase sampling to improve temporal consistency and visual quality.

AI-generated summary

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via Tweedie matching to enforce both manifold constraint and temporal consistency across overlap regions. Stochastic early-phase sampling then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

Community

Paper author Paper submitter about 9 hours ago

FlowLong is a training-free, model-agnostic inference-time method that extends pretrained flow-based video diffusion models beyond their native generation horizon — works uniformly for text-to-video, audio-video joint, and text-to-3D scene generation.
Project page: https://flowlong-video.github.io/
Paper: https://arxiv.org/abs/2605.20910

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20910
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.20910 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.20910 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20910 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers