r/LocalLLaMA · · 1 min read

meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Face

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

meituan-longcat/LongCat-Video-Avatar-1.5 · Hugging Face

🚀 Model Introduction

We are excited to announce the release of LongCat-Video-Avatar 1.5, an upgraded open-source framework that prioritizes extreme empirical optimization and production-readiness for audio-driven human video generation. Built upon the LongCat-Video foundation model, v1.5 delivers highly stable, commercial-grade avatar video synthesis supporting native tasks including Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), and Video Continuation, with seamless compatibility for both single-stream and multi-stream audio inputs.

Key Features

  • 🌟 Upgraded Audio Encoder (Whisper-Large):: Replaces Wav2Vec2 with Whisper-Large, yielding significantly smoother and more natural lip dynamics.
  • 🌟 Production-Ready Stability: Achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency.
  • 🌟 Stylized Domain Generalization: Robustly generalizes to anime, animals, and complex real-world conditions such as multi-person interactions and object handling.
  • 🌟 Efficient 8-Step Inference: Advanced DMD2-based step distillation accelerates inference to 8 NFE, balancing cost-effective serving with exceptional visual fidelity.

📊 Human Evaluation

We introduce a comprehensive human evaluation benchmark specifically tailored for audio-driven digital human generation. The benchmark encompasses 6 application scenarios (News Broadcasting, Knowledge Education, Daily Life, Entertainment, Singing, Commercial Promotion), 2 languages (Chinese/English), and 2 visual styles (Realistic/Animated), yielding a total of 508 image-audio source pairs. Evaluation Methodology:(1)Subjective Track: 770 crowdsourced evaluators rated each generated video on a 1–5 human-likeness scale, yielding 13,240 judgments. (2) Objective Track: 10 domain experts conducted structured quality analysis across four dimensions: Physical Rationality, Harmony (Audio-Visual Coordination), Temporal Stability, and Identity Consistency.

⚖️ License Agreement

The model weights are released under the MIT License.

submitted by /u/pmttyji
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA