Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hey everyone, I’m building a backend that analyzes long YouTube videos using an LLM.
Currently, my flow is a slow waterfall: Download full audio -> Whisper -> LLM -> Return results. For a 30-minute video, the user waits forever.
I want to pipeline this for real-time SSE streaming: [Chunk Audio on the fly] -> [Whisper] -> [LLM] -> [Stream to UI]
My questions for the data/backend engineers:
- Chunking & VAD: What's the best way to chunk YouTube audio streams (e.g., via ffmpeg) without cutting sentences in half and ruining the LLM's context?
- Queueing: Is standard
asyncioin FastAPI enough to handle these overlapping tasks, or do I strictly need Celery/Redis workers for this pipeline?
Any library recommendations or architectural patterns would be hugely appreciated
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.