As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.</p>\n","updatedAt":"2026-06-25T16:44:26.202Z","author":{"_id":"650b0d66664f7b7d088ca281","avatarUrl":"/avatars/fce475c301f53e166fc3c8f5c5112c4a.svg","fullname":"Yi-Cheng Lin","name":"dlion168","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.887643575668335},"editors":["dlion168"],"editorAvatarUrls":["/avatars/fce475c301f53e166fc3c8f5c5112c4a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.21215","authors":[{"_id":"6a3d5aae3b43e283349ec001","name":"Tzu-Chieh Wei","hidden":false},{"_id":"6a3d5aae3b43e283349ec002","name":"Yi-Cheng Lin","hidden":false},{"_id":"6a3d5aae3b43e283349ec003","name":"Huang-Cheng Chou","hidden":false},{"_id":"6a3d5aae3b43e283349ec004","name":"Kuan-Yu Chen","hidden":false},{"_id":"6a3d5aae3b43e283349ec005","name":"Hsin-Yen Sung","hidden":false},{"_id":"6a3d5aae3b43e283349ec006","name":"Shrikanth Narayanan","hidden":false},{"_id":"6a3d5aae3b43e283349ec007","name":"Hung-yi Lee","hidden":false}],"publishedAt":"2026-06-19T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach","submittedOnDailyBy":{"_id":"650b0d66664f7b7d088ca281","avatarUrl":"/avatars/fce475c301f53e166fc3c8f5c5112c4a.svg","isPro":false,"fullname":"Yi-Cheng Lin","user":"dlion168","type":"user","name":"dlion168"},"summary":"As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.","upvotes":0,"discussionId":"6a3d5aae3b43e283349ec008","githubRepo":"https://github.com/wiizzz/nonverbal-sv","githubRepoAddedBy":"user","ai_summary":"A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance.","ai_keywords":["Data2Vec","ECAPA-TDNN","Mixture of Experts","conditional distillation loss","contrastive loss","speaker verification","non-verbal vocalizations","speech-to-speech accuracy","domain-aware routing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.21215.md","query":{}}">
Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach
Abstract
A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance.
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.
Community
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.21215 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.21215 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.21215 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.