Hugging Face Daily Papers · · 4 min read

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.</p>\n","updatedAt":"2026-06-25T16:44:26.202Z","author":{"_id":"650b0d66664f7b7d088ca281","avatarUrl":"/avatars/fce475c301f53e166fc3c8f5c5112c4a.svg","fullname":"Yi-Cheng Lin","name":"dlion168","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.887643575668335},"editors":["dlion168"],"editorAvatarUrls":["/avatars/fce475c301f53e166fc3c8f5c5112c4a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.21215","authors":[{"_id":"6a3d5aae3b43e283349ec001","name":"Tzu-Chieh Wei","hidden":false},{"_id":"6a3d5aae3b43e283349ec002","name":"Yi-Cheng Lin","hidden":false},{"_id":"6a3d5aae3b43e283349ec003","name":"Huang-Cheng Chou","hidden":false},{"_id":"6a3d5aae3b43e283349ec004","name":"Kuan-Yu Chen","hidden":false},{"_id":"6a3d5aae3b43e283349ec005","name":"Hsin-Yen Sung","hidden":false},{"_id":"6a3d5aae3b43e283349ec006","name":"Shrikanth Narayanan","hidden":false},{"_id":"6a3d5aae3b43e283349ec007","name":"Hung-yi Lee","hidden":false}],"publishedAt":"2026-06-19T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach","submittedOnDailyBy":{"_id":"650b0d66664f7b7d088ca281","avatarUrl":"/avatars/fce475c301f53e166fc3c8f5c5112c4a.svg","isPro":false,"fullname":"Yi-Cheng Lin","user":"dlion168","type":"user","name":"dlion168"},"summary":"As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.","upvotes":0,"discussionId":"6a3d5aae3b43e283349ec008","githubRepo":"https://github.com/wiizzz/nonverbal-sv","githubRepoAddedBy":"user","ai_summary":"A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance.","ai_keywords":["Data2Vec","ECAPA-TDNN","Mixture of Experts","conditional distillation loss","contrastive loss","speaker verification","non-verbal vocalizations","speech-to-speech accuracy","domain-aware routing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.21215.md","query":{}}">
Papers
arxiv:2606.21215

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

Published on Jun 19
· Submitted by
Yi-Cheng Lin
on Jun 25
Authors:
,
,
,
,
,
,

Abstract

A novel speaker verification framework combines frozen self-supervised features with ECAPA-TDNN and MoE modules to improve identity verification across both speech and non-verbal vocalizations while maintaining speech performance.

As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.

Community

Paper submitter about 9 hours ago

As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.21215
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.21215 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.21215 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.21215 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers