Full duplex vs half duplex - the spectrum of AI voice models [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
It seems that there are two ways to build voice AI:
Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today.
Full-duplex: two channels, both sides can talk at any time - no more waiting for your “turn”. ← This is the way humans actually talk.
In fact, there are three crucial things half-duplex voice models can't really do:
- Overlap - talking and listening at the same time without falling apart
- Backchannels - the "mhms," "rights," and "yeahs" you drop in while the other person is still going
- Barge-in - getting interrupted mid-sentence and recovering gracefully
These three features are a big reason why voice agents still feel “robotic” to this day.
But what exactly is the spectrum from half-duplex to full-duplex? Is a Moshi-style architecture the only way to approach full-duplex natural voice conversations? What are ways half-duplex systems could imitate full-duplex?
Would love to hear others' thoughts on this.
[link] [comments]
More from r/MachineLearning
-
MeshFlow: production-safe multi-agent orchestration — SHA-256 audit chain, HIPAA/SOX/GDPR built in, 70-85% token cost reduction [Open Source][D]
Jun 2
-
MeshFlow: An open-source orchestrator for governed, cost-optimized multi-agent workflows [D]
Jun 2
-
ICML Conference Ticket (looking to purchase) [D]
Jun 1
-
Feedback on my EU AI Act Risk Tier Assessor [P]
Jun 1
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.