Hugging Face Daily Papers · June 10, 2026 · 3 min read

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cvlab-kaist.github.io/LipForcing/\" rel=\"nofollow\">https://cvlab-kaist.github.io/LipForcing/</a></p>\n","updatedAt":"2026-06-10T05:03:30.868Z","author":{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","fullname":"Yi Jung","name":"YJ-142150","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.37900567054748535},"editors":["YJ-142150"],"editorAvatarUrls":["/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11180","authors":[{"_id":"6a28efafe7d78ea7587e5572","name":"Paul Hyunbin Cho","hidden":false},{"_id":"6a28efafe7d78ea7587e5573","name":"Jinhyuk Jang","hidden":false},{"_id":"6a28efafe7d78ea7587e5574","name":"SeokYoung Lee","hidden":false},{"_id":"6a28efafe7d78ea7587e5575","name":"Joungbin Lee","hidden":false},{"_id":"6a28efafe7d78ea7587e5576","name":"Siyoon Jin","hidden":false},{"_id":"6a28efafe7d78ea7587e5577","name":"Heeseong Shin","hidden":false},{"_id":"6a28efafe7d78ea7587e5578","name":"Jung Yi","hidden":false},{"_id":"6a28efafe7d78ea7587e5579","name":"Yunjin Park","hidden":false},{"_id":"6a28efafe7d78ea7587e557a","name":"Chulmin Park","hidden":false},{"_id":"6a28efafe7d78ea7587e557b","name":"Seungryong Kim","hidden":false}],"publishedAt":"2026-06-09T17:56:36.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization","submittedOnDailyBy":{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user","name":"YJ-142150"},"summary":"Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, 17.6times faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs 39.8times faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.","upvotes":26,"discussionId":"6a28efafe7d78ea7587e557c","projectPage":"https://cvlab-kaist.github.io/LipForcing/","githubRepo":"https://github.com/cvlab-kaist/LipForcing","githubRepoAddedBy":"user","ai_summary":"Autoregressive diffusion method for video-to-video lip synchronization achieves real-time performance through distillation and optimized inference schedules.","ai_keywords":["diffusion models","lip synchronization","video-to-video","bidirectional attention","denoising steps","causal students","teacher-student distillation","inference-time CFG","SyncNet","time-to-first-frame"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":24,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"69b8d158bd8e1d2d307dff61","avatarUrl":"/avatars/0f724bb2ca98888c7b751a5929dba974.svg","isPro":false,"fullname":"Eunju Yang","user":"boreum0302","type":"user"},{"_id":"661e49608b9ee68c0a519b7a","avatarUrl":"/avatars/86ded1cf3692ee8a5a4c9255fa683785.svg","isPro":false,"fullname":"Yejichoi","user":"cyjcyj91","type":"user"},{"_id":"67861f4658328c475597e540","avatarUrl":"/avatars/ff3d7b7912544cd0799d289e6c51db7a.svg","isPro":false,"fullname":"Seonghu Jeon","user":"SeonghuJeon","type":"user"},{"_id":"6752ac9be0c39c0eaf6ba90d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cbByUmYoPVUAr35MWQeVm.png","isPro":false,"fullname":"lee","user":"lshlsh","type":"user"},{"_id":"69c3c23a21928e804c9d21f3","avatarUrl":"/avatars/4198131926d783d97b067cf61797b935.svg","isPro":false,"fullname":"seokyeong lee","user":"seokyeong94","type":"user"},{"_id":"64cb5884d469fc2cf83bdd76","avatarUrl":"/avatars/10e63cf62d8200beef3e31846796e398.svg","isPro":false,"fullname":"JisooKim","user":"Jiiiiiisoo","type":"user"},{"_id":"67c7b179e3f9241dde9ff772","avatarUrl":"/avatars/37cc7a744d8077a0fe7d926cde9d52b2.svg","isPro":false,"fullname":"LeeJaeho","user":"Jaeho0810","type":"user"},{"_id":"67e3a3cc0c2f0d766d401bdb","avatarUrl":"/avatars/0de4c3b11295505ec9d3626e65302cbd.svg","isPro":false,"fullname":"Siyoon Jin","user":"clwm515","type":"user"},{"_id":"65ec3449a69aaabb431db0da","avatarUrl":"/avatars/d7b507be0175a61a8fc21176eea45001.svg","isPro":false,"fullname":"Jin Hyeon Kim","user":"jinlovespho","type":"user"},{"_id":"652554ff88514c588fb9ea01","avatarUrl":"/avatars/50f2218632d1423980a3e5bef4e1c4e8.svg","isPro":false,"fullname":"Junghyun Park","user":"jamespark30","type":"user"},{"_id":"6752b5ebebb87145beedaecb","avatarUrl":"/avatars/1de059e88dad6fe070cb22ba96d32914.svg","isPro":false,"fullname":"Seungryong Kim","user":"seungryongkim","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11180.md"}">

Papers

arxiv:2606.11180

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Published on Jun 9

· Submitted by

Yi Jung on Jun 10

KAIST AI

Upvote

Authors:

Abstract

Autoregressive diffusion method for video-to-video lip synchronization achieves real-time performance through distillation and optimized inference schedules.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method for video-to-video (V2V) lip synchronization, which distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students. At inference, the students generate each chunk in only two denoising steps without inference-time CFG, enabling real-time lip synchronization. A lip-sync-specific teacher-trajectory analysis reveals a CFG fidelity-sync tradeoff: no-CFG predictions favor reference fidelity, whereas CFG-guided predictions favor synchronization within a mid-trajectory band. Lip Forcing translates this finding into three analysis-derived components: Sync-Window DMD, a two-step inference schedule, and a SyncNet-based reward. We validate Lip Forcing at two student scales, both distilled from the 14B teacher. The 1.3B student crosses into real-time streaming at 31 FPS, 17.6times faster than its same-scale bidirectional model. The 14B student, the largest diffusion model reported for V2V lip synchronization, runs 39.8times faster than its teacher at comparable reference fidelity. Time-to-first-frame is sub-millisecond at both scales, far below every diffusion baseline.