r/MachineLearning · June 1, 2026 · 1 min read

Real-time multilingual ASR using rolling buffers and monolingual models [P]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Real-time multilingual ASR using rolling buffers and monolingual models [P]

I built a routing-based approach to lightweight real-time multilingual ASR as part of my research at Gladia.

The core problem was how multilingual models that accurately handle mid-conversation language switches are often too big for most local hardware and have poor accuracy.

So rather than relying on one massive multilingual model, the system routes audio between smaller, specialized monolingual models (~100M parameters each).

Zipformer for low-latency streaming transcription
Silero VAD for detecting speech boundaries
SpeechBrain for language identification

It works by starting the transcription immediately without waiting for language detection. A coordinator buffers audio, monitors language confidence, and when a switch is detected above a threshold, it rolls back to the last speech boundary and re-transcribes with the correct model. Users may briefly see incorrect text, but it self-corrects quickly.

Rollback Pipeline Overiew

On inter-utterance code-switching benchmarks, this approach hits ~13% WER, ahead of every other system I tested, including cloud APIs. Intra-utterance switching (mid-sentence Spanglish, etc.) is the known limitation, degrading to ~41% WER, though still better than open-source alternatives and at a fraction of the size.

Open-source repo with instructions and the detailed benchmark results. https://github.com/gladiaio/realtime-multilingual-asr-router

Let me know what you think.
Pro tip: Enabling only your expected languages not only makes the system lighter but also gives the LID an accuracy boost, especially on heavily accented speech."

submitted by /u/JeanMichelRanu
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning