Real-time multilingual ASR using rolling buffers and monolingual models [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
| I built a routing-based approach to lightweight real-time multilingual ASR as part of my research at Gladia. The core problem was how multilingual models that accurately handle mid-conversation language switches are often too big for most local hardware and have poor accuracy. So rather than relying on one massive multilingual model, the system routes audio between smaller, specialized monolingual models (~100M parameters each).
It works by starting the transcription immediately without waiting for language detection. A coordinator buffers audio, monitors language confidence, and when a switch is detected above a threshold, it rolls back to the last speech boundary and re-transcribes with the correct model. Users may briefly see incorrect text, but it self-corrects quickly. On inter-utterance code-switching benchmarks, this approach hits ~13% WER, ahead of every other system I tested, including cloud APIs. Intra-utterance switching (mid-sentence Spanglish, etc.) is the known limitation, degrading to ~41% WER, though still better than open-source alternatives and at a fraction of the size. Open-source repo with instructions and the detailed benchmark results. https://github.com/gladiaio/realtime-multilingual-asr-router Let me know what you think. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.