News / #video-gen Tag Video Gen 80 articles archived under #video-gen · RSS Sign in to follow Hugging Face Daily Papers research 4d ago Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are… 23 Hugging Face Daily Papers research 4d ago Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models Abstract Autoregressive video diffusion extends diffusion distillation frameworks to real-time streaming generation through causal training paradigms, achieving state-of-the-art performance with fast convergence and interactive world modeling capabilities. Generated by… 4 Hugging Face Daily Papers research 4d ago MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation Abstract A novel-view video synthesis method that enhances motion-aware diffusion models through multi-view point tracking supervision to improve geometric consistency and motion fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Synthesizing a novel-view video from a… 37 Hugging Face Daily Papers research 5d ago UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating Abstract UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors. Generated by… 7 Hugging Face Daily Papers research 5d ago TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy Abstract Camera-controllable video virtual try-on framework uses a 4D proxy with explicit human-environment decoupling and DiT-based video generation for omnidirectional viewing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While Video Virtual Try-on (VVT) has achieved… 4 Hugging Face Daily Papers research 5d ago DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation Abstract DomainShuttle enables open domain subject-driven text-to-video generation with high fidelity and flexibility across in-domain and cross-domain scenarios through domain-aware modeling and dual RoPE schemes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Open domain… 10 arXiv — Machine Learning research 6d ago Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization arXiv:2606.24025v1 Announce Type: new Abstract: Diffusion models have achieved strong performance in image, text-to-image, and video generation, where conditional generation is often controlled by classifier-free guidance (CFG). CFG improves condition consistency by increasing a… 35 arXiv — Machine Learning research 6d ago Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation arXiv:2606.23743v1 Announce Type: cross Abstract: Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective… 30 arXiv — Machine Learning research 6d ago Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models arXiv:2606.24152v1 Announce Type: cross Abstract: Existing literature claims that video generation essentially is world modelling. On the one hand, the claim is productive because it pushes generative AI beyond static images and toward temporally extended physical scenes. On the… 15 Hugging Face Daily Papers research 6d ago Go-with-the-Track: Video Compositing and Motion Control with Point Tracking Abstract Go-with-the-Track unifies motion control and reference image compositing in video generation by using point-track embeddings with spatial-aware encoding and video diffusion transformers. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Filmmaking demands precise motion… 32 Hugging Face Daily Papers research 10d ago ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? Abstract ImageWAM demonstrates that pretrained image editing models can effectively replace video generation in world action models for robot control, achieving better performance with reduced computational costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World Action Models… 25 Hugging Face Daily Papers research 11d ago LooseControlVideo: Directorial Video Control using Spatial Blocking Abstract LooseControlVideo enables intuitive 3D spatial control in text-to-video generation using sparse oriented 3D boxes as proxies, achieving superior trajectory accuracy and occlusion handling compared to existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Precise… 10 Hugging Face Daily Papers research 11d ago MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction Abstract 3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks. Generated by… 4 Hugging Face Daily Papers research 13d ago Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks Abstract Track2View generates novel camera viewpoints from videos by using 3D point tracks to establish explicit spatiotemporal correspondences, achieving superior visual quality and camera accuracy compared to existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 9 Hugging Face Daily Papers research 13d ago LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies Abstract LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language-Action models (VLAs)… 33 Hugging Face Daily Papers research 13d ago Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation Abstract Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and embodied world knowledge corpus. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We… 5 Hugging Face Daily Papers research 13d ago Memento: Reconstruct to Remember for Consistent Long Video Generation Abstract Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-form video generation requires… 17 Hugging Face Daily Papers research 13d ago PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory Abstract PermaVid addresses long-term video consistency after edits by using multi-modal memory banks that separate appearance and geometric structure, enabling coherent video generation across time and viewpoints. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Consistent video… 30 arXiv — NLP / Computation & Language research 18d ago Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures arXiv:2606.12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems… 11 Hugging Face Daily Papers research 19d ago MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation Abstract Video generative models achieve improved long-range consistency through coarse-to-fine token generation using a multi-scale autoencoder and diffusion model architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generative models have become increasingly… 28 Hugging Face Daily Papers research 19d ago Next Forcing: Causal World Modeling with Multi-Chunk Prediction Abstract Next Forcing introduces a multi-chunk prediction framework that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Autoregressive video generation has… 19 Hugging Face Daily Papers research 19d ago FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion Abstract FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints. Generated by… 36 Hugging Face Daily Papers research 21d ago Streaming Video Generation with Streaming Force Control Abstract StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce StreamForce,… 17 Hugging Face Daily Papers research 24d ago Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation? Abstract Video generation models were evaluated through robotic manipulation tasks to assess their ability to reflect physical reality, revealing that visual quality does not predict executable motion accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models… 20 Hugging Face Daily Papers research 25d ago LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing… 33 r/MachineLearning community 25d ago Research in Image/Video Gen AI models [D] I've been going down a rabbit hole with image/video generation/editing models for a few months now, started with playing around with Stable Diffusion and ComfyUI, then got genuinely hooked on understanding why things work, not just that they do. I have an Engineering background… 20 Hugging Face Daily Papers research 26d ago AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation Abstract AAD-1 framework improves one-step autoregressive image-to-video generation by breaking generator-discriminator symmetry and using phased training to prevent motion collapse and training instability. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present AAD-1, an… 17 Hugging Face Daily Papers research 26d ago Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation Abstract Echo Infinity enables real-time infinite video generation using learnable evolving memory and unified relative RoPE to overcome limitations in existing autoregressive methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present Echo Infinity, an autoregressive (AR)… 18 Hugging Face Daily Papers research 27d ago NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation Abstract OmniDreams, a foundation generative world model trained from the Cosmos diffusion model, enables real-time action-conditioned video generation for autonomous driving policy evaluation in complex, unseen scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As… 23 Hugging Face Daily Papers research 27d ago LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation Abstract LongLive-RAG addresses long-video generation challenges by using retrieval-augmented generation to overcome error accumulation from sliding-window attention, enabling better temporal coherence and quality. AI-generated summary Autoregressive (AR) video diffusion enables… 22 arXiv — Machine Learning research 28d ago SORA: Free Second-Order Attacks in Fast Adversarial Training arXiv:2606.00738v1 Announce Type: new Abstract: Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high… 33 Hugging Face Daily Papers research 28d ago VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization Abstract Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance. AI-generated summary The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs)… 36 Hugging Face Daily Papers research 28d ago StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration Abstract StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage… 8 Hugging Face Daily Papers research 28d ago Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models Abstract A systematic comparison of vision-language models and video generation models reveals complementary strengths for spatial intelligence tasks, with vision-language models excelling in semantic tagging and instance grouping while video generation models perform better in… 21 Hugging Face Daily Papers research 28d ago One-Forcing: Towards Stable One-Step Autoregressive Video Generation Abstract One-Forcing improves one-step video generation quality and efficiency by combining DMD objective with GAN loss, achieving state-of-the-art results with reduced training costs. AI-generated summary Recent advances have substantially improved real-time interactive video… 32 Hugging Face Daily Papers research 28d ago DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory Abstract A novel decoupled memory architecture called DecMem is introduced for consistent long-horizon video generation, addressing computational inefficiency and attention dispersion issues in learnable memory systems. AI-generated summary Recent advances in video generative… 30 Hugging Face Daily Papers research 29d ago Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models Abstract Lumos-Nexus is a training-efficient video generation framework that uses a two-stage approach with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency… 35 r/MachineLearning community 29d ago What’s the actual focus in World Models right now? [R] Hey everyone, I'm trying to get back into the loop on world models. The last time I followed SSL closely, the buzz was all about Barlow Twins and DINO, but now everything just looks like scaled-up video generation from big industry labs. What is the actual academic research… 36 r/LocalLLaMA community 1mo ago Keeping multi-GPU rigs cool? As a newbie to building computers, been having issues trying to figure out how to cool my rig. The problem is that as the heat gets shunted upwards, each card gets hotter than the last (eg 31C -> 38C -> 42C -> 44C. At load during things like video generation, the hottest card… 34 Hugging Face Daily Papers research 1mo ago YoCausal: How Far is Video Generation from World Model? A Causality Perspective Abstract Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis. AI-generated summary As video diffusion models (VDMs)… 38 Hugging Face Daily Papers research 1mo ago AdaState: Self-Evolving Anchors for Streaming Video Generation Abstract Video diffusion models with adaptive state replacement generate more dynamic videos by evolving scene references rather than fixing to initial frames, using recurrent denoising as transition function. AI-generated summary Autoregressive video diffusion models generate… 24 Hugging Face Daily Papers research 1mo ago SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control Abstract SmartDirector enhances video generation by using multiple keyframes to improve narrative structure and temporal pacing through a two-stage process of low-resolution generation and high-resolution refinement. AI-generated summary The narrative quality of a video… 30 Hugging Face Daily Papers research 1mo ago Native Audio-Visual Alignment for Generation Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and… 38 arXiv — Machine Learning research 1mo ago Refining Multidimensional Video Reward Models via Disentangled Influence Functions arXiv:2605.28203v1 Announce Type: new Abstract: As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional… 14 Hugging Face Daily Papers research 1mo ago EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration Abstract EverAnimate addresses long-horizon animated video generation challenges through persistent latent propagation and restorative flow matching to maintain visual quality and character identity. AI-generated summary We propose EverAnimate, an efficient post-training method… 4 The Information — AI news-outlet 1mo ago Kuaishou’s Kling AI Video Unit Reaches $500 Million in Annualized Revenue Chinese social media giant Kuaishou Technology said on Wednesday that its Kling AI video business reached an annualized revenue of about $500 million in March. Kling, which develops and sells AI video generation models, generated more than 650 million yuan ($96 million) in… 28 arXiv — NLP / Computation & Language research 1mo ago Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs… 27 Hugging Face Daily Papers research 1mo ago EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation Abstract EvalVerse presents a comprehensive evaluation framework for generative video models that bridges the gap between human aesthetic judgment and machine scoring through expert-calibrated vision-language models and multi-stage cinematic assessment. AI-generated summary The… 33 Hugging Face Daily Papers research 1mo ago MotiMotion: Motion-Controlled Video Generation with Visual Reasoning Abstract MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that improves plausibility through vision-language reasoning and confidence-aware control mechanisms. AI-generated summary Current motion-controlled image-to-video… 4 Hugging Face Daily Papers research 1mo ago On-Policy Adversarial Flow Distillation for Autoregressive Video Generation Abstract Adversarial Flow Distillation enables efficient distillation of heterogeneous video generation models by using on-policy feedback and forward-process flow-matching updates without requiring teacher scores or detailed trajectory information. AI-generated summary… 37 Page 1 of 2 · 80 articles Older →