r/LocalLLaMA · May 14, 2026 · 3 min read

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

#multimodal #voice #image-gen #open-source #gpu #security

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline

Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality). ~45 minutes end-to-end on a single AMD Instinct MI300X. Every model is Apache 2.0 or MIT.

Pipeline (8 stages, all sequential on the same GPU):

Director Agent - Qwen3.5-35B-A3B (vLLM + AITER MoE) plans 6 shots from one sentence, returns structured JSON with character bibles, shot prompts, music brief, per-shot voice-over script, narration language
Character masters - FLUX.2 [klein] paints one canonical portrait per character. No LoRA training step - reference editing pins identity across shots by construction
Per-shot keyframes - FLUX.2 again with reference image. Sub-second per keyframe after warmup
Animation - Wan2.2-I2V-A14B, 81 frames @ 16 fps native. FLF2V for cut:false continuation arcs (last frame of shot N anchors first frame of shot N+1)
Vision critic - same Qwen3.5-35B reloaded with 10 structured failure labels (character drift, extras invade frame, camera ignored, walking backwards, object morphing, hand/finger artifact, wardrobe drift, neon glow leak, stylized AI look, random intimacy). Bad clips re-render with targeted retry strategies (different seed, FLF2V anchor, prompt simplification)
Music - ACE-Step v1 generates a 30s instrumental from Director's brief
Narration - Kokoro-82M, 9 languages. Director picks language to match setting (Tokyo→Japanese, Paris→French, Mumbai→Hindi)
Mix - ffmpeg with per-shot vo aligned via adelay

Wan 2.2 specifics (the bit this sub will care about): - 1280×720, not 640×640 default. Costs more but matches what producers want - 121 frames at 24 fps was my first attempt - gave temporal rippling. Switched to 81 @ 16 fps native (the distribution Wan was trained on) and it cleaned up - flow_shift = 5 for hero shots, 8 for b-roll (upstream wan_i2v_A14B.py defaults) - Negative prompt: verbatim Chinese trained negative from shared_config.py. umT5 was multilingual-pretrained against those exact tokens. English translation is observably weaker - Camera language: ONE camera verb per shot, sentence-case, placed first ("Tracking shot following from behind"). Multiple verbs in one prompt cancel each other out - Avoid the word "cinematic" - triggers Wan's stylization branch, gives the AI look. Use lens/film tags instead ("Arri Alexa, anamorphic, 35mm film grain")

Performance work: - ParaAttention FBCache (lossless 2× on Wan2.2) - torch.compile on transformer_2 (selective, the dual-expert MoE makes full compile flaky) - another 1.2× - AITER MoE acceleration on Qwen director (vLLM) - End-to-end: 25.9 min → 10.4 min per 720p clip on MI300X

Why a single MI300X: 192 GB HBM3 lets a 35B MoE, 4B diffusion, 14B I2V MoE, 3.5B music, and a TTS share the same card sequentially. Same stack on a 24 GB consumer GPU would need 4-5 boxes wired together.

Code (public, Apache 2.0): https://github.com/bladedevoff/studiomi300

Hugging Face (documentation, like this space 🙏) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/studiomi300

Live demo on HF Space is temporarily offline while infra restores - should be back within hours. In the meantime the showcase reels in the repo are real pipeline outputs, no human re-edited shots.

Happy to dig into AITER MoE setup, FBCache tuning, FLF2V anchoring, or the vision critic's failure taxonomy in comments.

submitted by /u/Inevitable-Log5414
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA