Best architecture for seamless Bilingual TTS? (Azure / English + Korean) [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hi guys, when building a language learning app (React Native/Expo frontend, Python backend) and I’ve hit a frustrating wall with Text-to-Speech. I need the app to read sentences that mix English instructions and Korean examples (e.g., "To say hello, we use the phrase 안녕하세요.").
Since native pronunciation is critical for a learning app, I'm struggling to find a solution that sounds natural. I'm currently using Azure Cognitive Services, and I'm stuck between two bad options:
Approach 1: The Multilingual Voice (en-US-AvaMultilingualNeural)
The Good: Seamless reading, zero pauses mid-sentence.
The Bad: Because it's an English-first model, the Korean comes out with a slight, robotic/Americanized accent. It doesn't sound like a true native speaker, which defeats the purpose of teaching pronunciation. And also there is some scratching and lack of smoothness when it is reading korean words.
Approach 2: SSML Voice Switching (Ava for EN, SunHi for KO)
The Good: Perfect English, perfect native Korean.
The Bad: Switching <voice> tags mid-sentence causes Azure to pause for a fraction of a second while it unloads/loads the neural models. It completely ruins the natural flow of the audio, making it sound very disjointed.
My Questions:
Is there an SSML trick in Azure to pre-load voices or eliminate that micro-pause when switching voices?
How do the big apps handle this? Because if I use two models for korean and english they will sound different when reading.
Should I migrate away from standard Azure Speech and use the Azure OpenAI voices (alloy, nova) instead? Are they truly seamless for bilingual text?
Any advice on the best tech stack or architecture for this would be massively appreciated!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.