Hugging Face Daily Papers · June 1, 2026 · 5 min read

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.</p>\n","updatedAt":"2026-06-01T03:52:08.003Z","author":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","fullname":"Yu Zhang","name":"AaronZ345","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8870756030082703},"editors":["AaronZ345"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28618","authors":[{"_id":"6a1d01ca808ddbc3c7d43550","name":"Changhao Pan","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43551","name":"Rui Yang","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43552","name":"Han Wang","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43553","name":"Zhuan Zhou","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43554","name":"Xuming He","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43555","name":"Wenxiang Guo","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43556","name":"Ziyue Jiang","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43557","name":"Ruiqi Li","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43558","user":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user","name":"AaronZ345"},"name":"Yu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:49.125Z","hidden":true},{"_id":"6a1d01ca808ddbc3c7d43559","name":"Chenyuhao Wen","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355a","name":"Ke Lei","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355b","name":"Xiang Yin","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355c","name":"Jingyu Lu","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355d","name":"Zhiyuan Zhu","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355e","name":"Zhou Zhao","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios","submittedOnDailyBy":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user","name":"AaronZ345"},"summary":"Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.","upvotes":20,"discussionId":"6a1d01ca808ddbc3c7d4355f","projectPage":"https://swanaigc.github.io//#bench","ai_summary":"Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations.","ai_keywords":[""],"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user"},{"_id":"6645ea5638f0db40582bddcf","avatarUrl":"/avatars/216aeb4d365e28dff484cc275f9f90d7.svg","isPro":false,"fullname":"Yifu Chen","user":"1f","type":"user"},{"_id":"68fa24847d310d427b22496e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68fa24847d310d427b22496e/D30kiW0TL5NAMZoytAGMC.png","isPro":false,"fullname":"Tianle Liang","user":"leungtianle","type":"user"},{"_id":"6821e40cf372d0853064027a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/EN2uonqhOWqnMTyEgG-ly.png","isPro":false,"fullname":"liyangzhuo","user":"sgshdgdhsdg","type":"user"},{"_id":"663a1a61197afc06304c7c32","avatarUrl":"/avatars/f4ed0f78189c30db239b85d0a2f844f7.svg","isPro":false,"fullname":"Lei Ke","user":"BrokenMoon","type":"user"},{"_id":"68120a1375e6e2d3c078cc5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Xh1AQCiYFggk-AjIT-gcB.png","isPro":false,"fullname":"yangrui","user":"yrainbow","type":"user"},{"_id":"69e991019834ce1409ee46c3","avatarUrl":"/avatars/45941141bb526507cdc360c032c57545.svg","isPro":false,"fullname":"Zhuan Zhou","user":"Phoenix-Alan233","type":"user"},{"_id":"67285bba520ec569b6a9f6ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/TH5X9DTDrzYzah-5Fop94.png","isPro":true,"fullname":"salah","user":"Davidwang215","type":"user"},{"_id":"673d4716cc1ef74a349cd2ad","avatarUrl":"/avatars/a88f1d461c199a2caa1d5e13b70921fe.svg","isPro":false,"fullname":"Yixuan Han","user":"yixuan7878","type":"user"},{"_id":"6684a72f74af0ef94892a3fa","avatarUrl":"/avatars/69c8bb5696f55a83aab627316a629ba8.svg","isPro":false,"fullname":"XUMING HE","user":"hexmSeeU","type":"user"},{"_id":"66568060c6a8cb4e884be331","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66568060c6a8cb4e884be331/jx8NsxV374oURta6JdTzU.jpeg","isPro":false,"fullname":"PanChanghao","user":"DavidPigeon","type":"user"},{"_id":"691ae4477dd80eff9b4d0005","avatarUrl":"/avatars/cb0406c2b2208129fc0fdf48f53b0a34.svg","isPro":false,"fullname":"WorldEdit","user":"WorldEdit0","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28618.md"}">

Papers

arxiv:2605.28618

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Published on May 27

· Submitted by

Yu Zhang on Jun 1

Zhejiang University

Upvote

Authors:

Yu Zhang ,

Abstract

Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations.

AI-generated summary

View arXiv page View PDF Project page Add to collection

Community

AaronZ345

Paper author Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.28618

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28618 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28618 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28618 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers