Hugging Face Daily Papers · June 1, 2026 · 5 min read

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation.</p>\n","updatedAt":"2026-06-01T03:46:10.794Z","author":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","fullname":"Yu Zhang","name":"AaronZ345","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8948639631271362},"editors":["AaronZ345"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30993","authors":[{"_id":"6a1d0081808ddbc3c7d4353d","user":{"_id":"667d4ae1144f0f683483f3cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/6rZVS79wB8jTNUrZI9Ot2.jpeg","isPro":false,"fullname":"Ruiqi Li","user":"RL-2000","type":"user","name":"RL-2000"},"name":"Ruiqi Li","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:57.308Z","hidden":false},{"_id":"6a1d0081808ddbc3c7d4353e","user":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user","name":"AaronZ345"},"name":"Yu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:55.302Z","hidden":false},{"_id":"6a1d0081808ddbc3c7d4353f","name":"Changhao Pan","hidden":false},{"_id":"6a1d0081808ddbc3c7d43540","name":"Ke Lei","hidden":false},{"_id":"6a1d0081808ddbc3c7d43541","name":"Xiang Yin","hidden":false},{"_id":"6a1d0081808ddbc3c7d43542","name":"Cheng Yang","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue","submittedOnDailyBy":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user","name":"AaronZ345"},"summary":"Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.","upvotes":28,"discussionId":"6a1d0081808ddbc3c7d43543","projectPage":"https://swanaigc.github.io/#/swanvoice","ai_summary":"A zero-shot text-to-speech system called SwanVoice is presented that addresses expressive long-form multi-speaker dialogue synthesis by combining VAE, flow-matching DiT, and diffusion post-training techniques.","ai_keywords":["zero-shot text-to-speech","VAE","flow-matching DiT","diffusion post-training","forced aligner","monologue TTS","dialogue TTS","speaker-turn conditioning","pause-aware symbols","pinyin substitution"],"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user"},{"_id":"667d4ae1144f0f683483f3cd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/6rZVS79wB8jTNUrZI9Ot2.jpeg","isPro":false,"fullname":"Ruiqi Li","user":"RL-2000","type":"user"},{"_id":"6645ea5638f0db40582bddcf","avatarUrl":"/avatars/216aeb4d365e28dff484cc275f9f90d7.svg","isPro":false,"fullname":"Yifu Chen","user":"1f","type":"user"},{"_id":"68fa24847d310d427b22496e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68fa24847d310d427b22496e/D30kiW0TL5NAMZoytAGMC.png","isPro":false,"fullname":"Tianle Liang","user":"leungtianle","type":"user"},{"_id":"66692faa83408bb0da40f8f5","avatarUrl":"/avatars/09f81a8b0bb130c8fa72dbb6526ac4c4.svg","isPro":false,"fullname":"wenxiang guo","user":"verstar","type":"user"},{"_id":"6821e40cf372d0853064027a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/EN2uonqhOWqnMTyEgG-ly.png","isPro":false,"fullname":"liyangzhuo","user":"sgshdgdhsdg","type":"user"},{"_id":"68cf7dc7c7d1f18ed078cb42","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68cf7dc7c7d1f18ed078cb42/rmal_8aoj9CT8yzAI637c.jpeg","isPro":false,"fullname":"Cheng Yang","user":"ironyoung","type":"user"},{"_id":"663a1a61197afc06304c7c32","avatarUrl":"/avatars/f4ed0f78189c30db239b85d0a2f844f7.svg","isPro":false,"fullname":"Lei Ke","user":"BrokenMoon","type":"user"},{"_id":"68120a1375e6e2d3c078cc5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Xh1AQCiYFggk-AjIT-gcB.png","isPro":false,"fullname":"yangrui","user":"yrainbow","type":"user"},{"_id":"69e991019834ce1409ee46c3","avatarUrl":"/avatars/45941141bb526507cdc360c032c57545.svg","isPro":false,"fullname":"Zhuan Zhou","user":"Phoenix-Alan233","type":"user"},{"_id":"67285bba520ec569b6a9f6ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/TH5X9DTDrzYzah-5Fop94.png","isPro":true,"fullname":"salah","user":"Davidwang215","type":"user"},{"_id":"673d4716cc1ef74a349cd2ad","avatarUrl":"/avatars/a88f1d461c199a2caa1d5e13b70921fe.svg","isPro":false,"fullname":"Yixuan Han","user":"yixuan7878","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"653b817d32c97d0655575872","name":"ByteDance","fullname":"ByteDance","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/0clr54wj5Ly-RkYU9OXPp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30993.md"}">

Papers

arxiv:2605.30993

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Published on May 29

· Submitted by

Yu Zhang on Jun 1

ByteDance

Upvote

Authors:

Ruiqi Li ,

Yu Zhang ,

Abstract

A zero-shot text-to-speech system called SwanVoice is presented that addresses expressive long-form multi-speaker dialogue synthesis by combining VAE, flow-matching DiT, and diffusion post-training techniques.

AI-generated summary