r/MachineLearning · · 1 min read

Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Hey everyone, I’m building a backend that analyzes long YouTube videos using an LLM.

Currently, my flow is a slow waterfall: Download full audio -> Whisper -> LLM -> Return results. For a 30-minute video, the user waits forever.

I want to pipeline this for real-time SSE streaming: [Chunk Audio on the fly] -> [Whisper] -> [LLM] -> [Stream to UI]

My questions for the data/backend engineers:

  1. Chunking & VAD: What's the best way to chunk YouTube audio streams (e.g., via ffmpeg) without cutting sentences in half and ruining the LLM's context?
  2. Queueing: Is standard asyncio in FastAPI enough to handle these overlapping tasks, or do I strictly need Celery/Redis workers for this pipeline?

Any library recommendations or architectural patterns would be hugely appreciated

submitted by /u/Sea_Lawfulness_5602
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning