r/MachineLearning · · 1 min read

Full duplex vs half duplex - the spectrum of AI voice models [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

It seems that there are two ways to build voice AI:

Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today.

Full-duplex: two channels, both sides can talk at any time - no more waiting for your “turn”. ← This is the way humans actually talk.

In fact, there are three crucial things half-duplex voice models can't really do:

  • Overlap - talking and listening at the same time without falling apart
  • Backchannels - the "mhms," "rights," and "yeahs" you drop in while the other person is still going
  • Barge-in - getting interrupted mid-sentence and recovering gracefully

These three features are a big reason why voice agents still feel “robotic” to this day.

But what exactly is the spectrum from half-duplex to full-duplex? Is a Moshi-style architecture the only way to approach full-duplex natural voice conversations? What are ways half-duplex systems could imitate full-duplex?

Would love to hear others' thoughts on this.

submitted by /u/Chilly5
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning