Tag

Video Gen

80 articles archived under #video-gen · RSS

Hugging Face Daily Papers research 4d ago

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are…

23
Hugging Face Daily Papers research 4d ago

Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

Abstract Autoregressive video diffusion extends diffusion distillation frameworks to real-time streaming generation through causal training paradigms, achieving state-of-the-art performance with fast convergence and interactive world modeling capabilities. Generated by…

4
Hugging Face Daily Papers research 4d ago

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

Abstract A novel-view video synthesis method that enhances motion-aware diffusion models through multi-view point tracking supervision to improve geometric consistency and motion fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Synthesizing a novel-view video from a…

37
Hugging Face Daily Papers research 5d ago

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Abstract UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors. Generated by…

7
Hugging Face Daily Papers research 5d ago

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

Abstract Camera-controllable video virtual try-on framework uses a 4D proxy with explicit human-environment decoupling and DiT-based video generation for omnidirectional viewing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While Video Virtual Try-on (VVT) has achieved…

4
Hugging Face Daily Papers research 5d ago

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Abstract DomainShuttle enables open domain subject-driven text-to-video generation with high fidelity and flexibility across in-domain and cross-domain scenarios through domain-aware modeling and dual RoPE schemes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Open domain…

10
arXiv — Machine Learning research 6d ago

Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization

arXiv:2606.24025v1 Announce Type: new Abstract: Diffusion models have achieved strong performance in image, text-to-image, and video generation, where conditional generation is often controlled by classifier-free guidance (CFG). CFG improves condition consistency by increasing a…

35
arXiv — Machine Learning research 6d ago

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

arXiv:2606.23743v1 Announce Type: cross Abstract: Modern video diffusion models achieve higher generation quality through scaling, but this also increases inference cost. Although many acceleration methods have been proposed, a central challenge is that the most effective…

30
arXiv — Machine Learning research 6d ago

Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models

arXiv:2606.24152v1 Announce Type: cross Abstract: Existing literature claims that video generation essentially is world modelling. On the one hand, the claim is productive because it pushes generative AI beyond static images and toward temporally extended physical scenes. On the…

15
Hugging Face Daily Papers research 6d ago

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Abstract Go-with-the-Track unifies motion control and reference image compositing in video generation by using point-track embeddings with spatial-aware encoding and video diffusion transformers. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Filmmaking demands precise motion…

32
Hugging Face Daily Papers research 10d ago

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Abstract ImageWAM demonstrates that pretrained image editing models can effectively replace video generation in world action models for robot control, achieving better performance with reduced computational costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World Action Models…

25
Hugging Face Daily Papers research 11d ago

LooseControlVideo: Directorial Video Control using Spatial Blocking

Abstract LooseControlVideo enables intuitive 3D spatial control in text-to-video generation using sparse oriented 3D boxes as proxies, achieving superior trajectory accuracy and occlusion handling compared to existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Precise…

10
Hugging Face Daily Papers research 11d ago

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Abstract 3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks. Generated by…

4
Hugging Face Daily Papers research 13d ago

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Abstract Track2View generates novel camera viewpoints from videos by using 3D point tracks to establish explicit spatiotemporal correspondences, achieving superior visual quality and camera accuracy compared to existing methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

9
Hugging Face Daily Papers research 13d ago

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Abstract LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language-Action models (VLAs)…

33
Hugging Face Daily Papers research 13d ago

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Abstract Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and embodied world knowledge corpus. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

5
Hugging Face Daily Papers research 13d ago

Memento: Reconstruct to Remember for Consistent Long Video Generation

Abstract Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-form video generation requires…

17
Hugging Face Daily Papers research 13d ago

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Abstract PermaVid addresses long-term video consistency after edits by using multi-modal memory banks that separate appearance and geometric structure, enabling coherent video generation across time and viewpoints. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Consistent video…

30
arXiv — NLP / Computation & Language research 18d ago

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

arXiv:2606.12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems…

11
Hugging Face Daily Papers research 19d ago

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

Abstract Video generative models achieve improved long-range consistency through coarse-to-fine token generation using a multi-scale autoencoder and diffusion model architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generative models have become increasingly…

28
Hugging Face Daily Papers research 19d ago

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Abstract Next Forcing introduces a multi-chunk prediction framework that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Autoregressive video generation has…

19
Hugging Face Daily Papers research 19d ago

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Abstract FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation by preserving recent context and long-range anchors under fixed cache constraints. Generated by…

36
Hugging Face Daily Papers research 21d ago

Streaming Video Generation with Streaming Force Control

Abstract StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce StreamForce,…

17
Hugging Face Daily Papers research 24d ago

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Abstract Video generation models were evaluated through robotic manipulation tasks to assess their ability to reflect physical reality, revealing that visual quality does not predict executable motion accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models…

20
Hugging Face Daily Papers research 25d ago

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing…

33
r/MachineLearning community 25d ago

Research in Image/Video Gen AI models [D]

I've been going down a rabbit hole with image/video generation/editing models for a few months now, started with playing around with Stable Diffusion and ComfyUI, then got genuinely hooked on understanding why things work, not just that they do. I have an Engineering background…

20
Hugging Face Daily Papers research 26d ago

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Abstract AAD-1 framework improves one-step autoregressive image-to-video generation by breaking generator-discriminator symmetry and using phased training to prevent motion collapse and training instability. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present AAD-1, an…

17
Hugging Face Daily Papers research 26d ago

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

Abstract Echo Infinity enables real-time infinite video generation using learnable evolving memory and unified relative RoPE to overcome limitations in existing autoregressive methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present Echo Infinity, an autoregressive (AR)…

18
Hugging Face Daily Papers research 27d ago

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

Abstract OmniDreams, a foundation generative world model trained from the Cosmos diffusion model, enables real-time action-conditioned video generation for autonomous driving policy evaluation in complex, unseen scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As…

23
Hugging Face Daily Papers research 27d ago

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Abstract LongLive-RAG addresses long-video generation challenges by using retrieval-augmented generation to overcome error accumulation from sliding-window attention, enabling better temporal coherence and quality. AI-generated summary Autoregressive (AR) video diffusion enables…

22
arXiv — Machine Learning research 28d ago

SORA: Free Second-Order Attacks in Fast Adversarial Training

arXiv:2606.00738v1 Announce Type: new Abstract: Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high…

33
Hugging Face Daily Papers research 28d ago

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Abstract Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance. AI-generated summary The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs)…

36
Hugging Face Daily Papers research 28d ago

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Abstract StreamChar enables real-time streaming audio-video generation for character animation by separating long-horizon orchestration from short-window denoising through an LLM-based orchestrator and joint audio-video DiT, achieving efficient deployment via two-stage…

8
Hugging Face Daily Papers research 28d ago

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Abstract A systematic comparison of vision-language models and video generation models reveals complementary strengths for spatial intelligence tasks, with vision-language models excelling in semantic tagging and instance grouping while video generation models perform better in…

21
Hugging Face Daily Papers research 28d ago

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

Abstract One-Forcing improves one-step video generation quality and efficiency by combining DMD objective with GAN loss, achieving state-of-the-art results with reduced training costs. AI-generated summary Recent advances have substantially improved real-time interactive video…

32
Hugging Face Daily Papers research 28d ago

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Abstract A novel decoupled memory architecture called DecMem is introduced for consistent long-horizon video generation, addressing computational inefficiency and attention dispersion issues in learnable memory systems. AI-generated summary Recent advances in video generative…

30
Hugging Face Daily Papers research 29d ago

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Abstract Lumos-Nexus is a training-efficient video generation framework that uses a two-stage approach with a lightweight generator for training and a high-capacity pretrained generator for inference, achieving enhanced visual fidelity through Unified Progressive Frequency…

35
r/MachineLearning community 29d ago

What’s the actual focus in World Models right now? [R]

Hey everyone, I'm trying to get back into the loop on world models. The last time I followed SSL closely, the buzz was all about Barlow Twins and DINO, but now everything just looks like scaled-up video generation from big industry labs. What is the actual academic research…

36
r/LocalLLaMA community 1mo ago

Keeping multi-GPU rigs cool?

As a newbie to building computers, been having issues trying to figure out how to cool my rig. The problem is that as the heat gets shunted upwards, each card gets hotter than the last (eg 31C -> 38C -> 42C -> 44C. At load during things like video generation, the hottest card…

34
Hugging Face Daily Papers research 1mo ago

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Abstract Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis. AI-generated summary As video diffusion models (VDMs)…

38
Hugging Face Daily Papers research 1mo ago

AdaState: Self-Evolving Anchors for Streaming Video Generation

Abstract Video diffusion models with adaptive state replacement generate more dynamic videos by evolving scene references rather than fixing to initial frames, using recurrent denoising as transition function. AI-generated summary Autoregressive video diffusion models generate…

24
Hugging Face Daily Papers research 1mo ago

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Abstract SmartDirector enhances video generation by using multiple keyframes to improve narrative structure and temporal pacing through a two-stage process of low-resolution generation and high-resolution refinement. AI-generated summary The narrative quality of a video…

30
Hugging Face Daily Papers research 1mo ago

Native Audio-Visual Alignment for Generation

Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and…

38
arXiv — Machine Learning research 1mo ago

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

arXiv:2605.28203v1 Announce Type: new Abstract: As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional…

14
Hugging Face Daily Papers research 1mo ago

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Abstract EverAnimate addresses long-horizon animated video generation challenges through persistent latent propagation and restorative flow matching to maintain visual quality and character identity. AI-generated summary We propose EverAnimate, an efficient post-training method…

4
The Information — AI news-outlet 1mo ago

Kuaishou’s Kling AI Video Unit Reaches $500 Million in Annualized Revenue

Chinese social media giant Kuaishou Technology said on Wednesday that its Kling AI video business reached an annualized revenue of about $500 million in March. Kling, which develops and sells AI video generation models, generated more than 650 million yuan ($96 million) in…

28
arXiv — NLP / Computation & Language research 1mo ago

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs…

27
Hugging Face Daily Papers research 1mo ago

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

Abstract EvalVerse presents a comprehensive evaluation framework for generative video models that bridges the gap between human aesthetic judgment and machine scoring through expert-calibrated vision-language models and multi-stage cinematic assessment. AI-generated summary The…

33
Hugging Face Daily Papers research 1mo ago

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Abstract MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that improves plausibility through vision-language reasoning and confidence-aware control mechanisms. AI-generated summary Current motion-controlled image-to-video…

4
Hugging Face Daily Papers research 1mo ago

On-Policy Adversarial Flow Distillation for Autoregressive Video Generation

Abstract Adversarial Flow Distillation enables efficient distillation of heterogeneous video generation models by using on-policy feedback and forward-process flow-matching updates without requiring teacher scores or detailed trajectory information. AI-generated summary…

37

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Information-Theoretic Classifier-Free Guidance with Adaptive Schedule Optimization

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

LooseControlVideo: Directorial Video Control using Spatial Blocking

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Memento: Reconstruct to Remember for Consistent Long Video Generation

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

Streaming Video Generation with Streaming Force Control

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Research in Image/Video Gen AI models [D]

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

SORA: Free Second-Order Attacks in Fast Adversarial Training

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

What’s the actual focus in World Models right now? [R]

Keeping multi-GPU rigs cool?

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

AdaState: Self-Evolving Anchors for Streaming Video Generation

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Native Audio-Visual Alignment for Generation

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

Kuaishou’s Kling AI Video Unit Reaches $500 Million in Annualized Revenue

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

On-Policy Adversarial Flow Distillation for Autoregressive Video Generation