Hugging Face Daily Papers · · 5 min read

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.</p>\n","updatedAt":"2026-06-01T03:52:08.003Z","author":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","fullname":"Yu Zhang","name":"AaronZ345","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8870756030082703},"editors":["AaronZ345"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28618","authors":[{"_id":"6a1d01ca808ddbc3c7d43550","name":"Changhao Pan","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43551","name":"Rui Yang","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43552","name":"Han Wang","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43553","name":"Zhuan Zhou","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43554","name":"Xuming He","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43555","name":"Wenxiang Guo","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43556","name":"Ziyue Jiang","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43557","name":"Ruiqi Li","hidden":false},{"_id":"6a1d01ca808ddbc3c7d43558","user":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user","name":"AaronZ345"},"name":"Yu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:49.125Z","hidden":true},{"_id":"6a1d01ca808ddbc3c7d43559","name":"Chenyuhao Wen","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355a","name":"Ke Lei","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355b","name":"Xiang Yin","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355c","name":"Jingyu Lu","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355d","name":"Zhiyuan Zhu","hidden":false},{"_id":"6a1d01ca808ddbc3c7d4355e","name":"Zhou Zhao","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios","submittedOnDailyBy":{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user","name":"AaronZ345"},"summary":"Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.","upvotes":20,"discussionId":"6a1d01ca808ddbc3c7d4355f","projectPage":"https://swanaigc.github.io//#bench","ai_summary":"Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations.","ai_keywords":[""],"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66569729ea21cfae5f5797c4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66569729ea21cfae5f5797c4/IguwJzljFN3QiEd1bn5BP.jpeg","isPro":false,"fullname":"Yu Zhang","user":"AaronZ345","type":"user"},{"_id":"6645ea5638f0db40582bddcf","avatarUrl":"/avatars/216aeb4d365e28dff484cc275f9f90d7.svg","isPro":false,"fullname":"Yifu Chen","user":"1f","type":"user"},{"_id":"68fa24847d310d427b22496e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68fa24847d310d427b22496e/D30kiW0TL5NAMZoytAGMC.png","isPro":false,"fullname":"Tianle Liang","user":"leungtianle","type":"user"},{"_id":"6821e40cf372d0853064027a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/EN2uonqhOWqnMTyEgG-ly.png","isPro":false,"fullname":"liyangzhuo","user":"sgshdgdhsdg","type":"user"},{"_id":"663a1a61197afc06304c7c32","avatarUrl":"/avatars/f4ed0f78189c30db239b85d0a2f844f7.svg","isPro":false,"fullname":"Lei Ke","user":"BrokenMoon","type":"user"},{"_id":"68120a1375e6e2d3c078cc5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Xh1AQCiYFggk-AjIT-gcB.png","isPro":false,"fullname":"yangrui","user":"yrainbow","type":"user"},{"_id":"69e991019834ce1409ee46c3","avatarUrl":"/avatars/45941141bb526507cdc360c032c57545.svg","isPro":false,"fullname":"Zhuan Zhou","user":"Phoenix-Alan233","type":"user"},{"_id":"67285bba520ec569b6a9f6ff","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/TH5X9DTDrzYzah-5Fop94.png","isPro":true,"fullname":"salah","user":"Davidwang215","type":"user"},{"_id":"673d4716cc1ef74a349cd2ad","avatarUrl":"/avatars/a88f1d461c199a2caa1d5e13b70921fe.svg","isPro":false,"fullname":"Yixuan Han","user":"yixuan7878","type":"user"},{"_id":"6684a72f74af0ef94892a3fa","avatarUrl":"/avatars/69c8bb5696f55a83aab627316a629ba8.svg","isPro":false,"fullname":"XUMING HE","user":"hexmSeeU","type":"user"},{"_id":"66568060c6a8cb4e884be331","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66568060c6a8cb4e884be331/jx8NsxV374oURta6JdTzU.jpeg","isPro":false,"fullname":"PanChanghao","user":"DavidPigeon","type":"user"},{"_id":"691ae4477dd80eff9b4d0005","avatarUrl":"/avatars/cb0406c2b2208129fc0fdf48f53b0a34.svg","isPro":false,"fullname":"WorldEdit","user":"WorldEdit0","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28618.md"}">
Papers
arxiv:2605.28618

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Published on May 27
· Submitted by
Yu Zhang
on Jun 1
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations.

AI-generated summary

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.

Community

Paper author Paper submitter about 7 hours ago

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.28618
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28618 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28618 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28618 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers