Hugging Face Daily Papers · May 21, 2026 · 3 min read

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Really solid work.</p>\n","updatedAt":"2026-05-21T08:11:52.198Z","author":{"_id":"66c5d81a4061fd5907443787","avatarUrl":"/avatars/2e107195b1ff7d06bbc6c9bd4e5620cf.svg","fullname":"zhifei","name":"filicos","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.921946108341217},"editors":["filicos"],"editorAvatarUrls":["/avatars/2e107195b1ff7d06bbc6c9bd4e5620cf.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19833","authors":[{"_id":"6a0df8e4d1ef9ecdf71c0e9d","name":"Zhifei Xie","hidden":false},{"_id":"6a0df8e4d1ef9ecdf71c0e9e","name":"Kaiyu Pang","hidden":false},{"_id":"6a0df8e4d1ef9ecdf71c0e9f","name":"Haobin Zhang","hidden":false},{"_id":"6a0df8e4d1ef9ecdf71c0ea0","name":"Deheng Ye","hidden":false},{"_id":"6a0df8e4d1ef9ecdf71c0ea1","name":"Xiaobin Hu","hidden":false},{"_id":"6a0df8e4d1ef9ecdf71c0ea2","name":"Shuicheng Yan","hidden":false},{"_id":"6a0df8e4d1ef9ecdf71c0ea3","name":"Chunyan Miao","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66c5d81a4061fd5907443787/z0m7qDFg_25I0H1Ttgo16.jpeg"],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation","submittedOnDailyBy":{"_id":"66c5d81a4061fd5907443787","avatarUrl":"/avatars/2e107195b1ff7d06bbc6c9bd4e5620cf.svg","isPro":false,"fullname":"zhifei","user":"filicos","type":"user","name":"filicos"},"summary":"Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an \"acoustic robustness bottleneck\": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.","upvotes":86,"discussionId":"6a0df8e4d1ef9ecdf71c0ea4","projectPage":"https://xzf-thu.github.io/Mega-ASR/","githubRepo":"https://github.com/xzf-thu/Mega-ASR","githubRepoAddedBy":"user","ai_summary":"Mega-ASR framework improves robustness in real-world speech recognition through compound-data construction and progressive acoustic-to-semantic optimization techniques.","ai_keywords":["automatic speech recognition","acoustic robustness bottleneck","compound-data construction","progressive acoustic-to-semantic optimization","Acoustic-to-Semantic Progressive Supervised Fine-Tuning","Dual-Granularity WER-Gated Policy Optimization","VOiCES R4-B-F","NOIZEUS Sta-0","WER reduction"],"githubStars":55,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68847634fd55f50919642240","avatarUrl":"/avatars/c93c0952a67045b10cf42af7c6534192.svg","isPro":false,"fullname":"filicos-data","user":"filicos-data","type":"user"},{"_id":"6901c653911da714e753b276","avatarUrl":"/avatars/ad1d135891254090b9493b018a0ae193.svg","isPro":false,"fullname":"Prummn Will","user":"Prummn","type":"user"},{"_id":"66c5d81a4061fd5907443787","avatarUrl":"/avatars/2e107195b1ff7d06bbc6c9bd4e5620cf.svg","isPro":false,"fullname":"zhifei","user":"filicos","type":"user"},{"_id":"69afbd376e731c6a5f0b1a63","avatarUrl":"/avatars/41bdc53ff544dfe38293462580750b94.svg","isPro":false,"fullname":"Lang Jiaqi","user":"dear77","type":"user"},{"_id":"657921d38628ec00e9135144","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657921d38628ec00e9135144/F7G9aMx5-mFh1ro06kmfq.png","isPro":false,"fullname":"Yaphet","user":"yaphetlee2002","type":"user"},{"_id":"678207daa97a827e419d4b80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/H4oEjilAxZVhcChPUYw1h.png","isPro":false,"fullname":"Zhuosong Jiang","user":"JazySong","type":"user"},{"_id":"68f9ccaddb2226511a85032f","avatarUrl":"/avatars/b40219beba089e77bd4a5bfeb39499d3.svg","isPro":false,"fullname":"Mingtao Nie","user":"lemondrops608","type":"user"},{"_id":"631be3bdcf39db4b171e77c7","avatarUrl":"/avatars/6f9d41c2ddeb5d9e08fd5394945ce25c.svg","isPro":true,"fullname":"vincent","user":"runfuture","type":"user"},{"_id":"67c6a15cd2f8cf8dcadb5d27","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/GAtPJSyNB8JbPR5JzmVDW.png","isPro":false,"fullname":"Shikai Dong","user":"DAKnell","type":"user"},{"_id":"6916d088f82788e699d7b757","avatarUrl":"/avatars/ab10bdd109e86d7b4ad0afafb48a8d72.svg","isPro":false,"fullname":"fyyyy","user":"zleibston","type":"user"},{"_id":"6a0ec5cf68f0656ed8591adf","avatarUrl":"/avatars/a36888bb07d52b9b01cfa666945c87c2.svg","isPro":false,"fullname":"xxx","user":"LL123qwert","type":"user"},{"_id":"6742814cde9997dd26097adc","avatarUrl":"/avatars/c2117f69e2923bd98707e56b4c5f259a.svg","isPro":false,"fullname":"Fengyuan Yu","user":"Yuandao1511","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19833.md"}">

Papers

arxiv:2605.19833

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Published on May 19

· Submitted by

zhifei on May 21

#1 Paper of the day

National University of Singapore

Upvote

Authors:

Abstract

Mega-ASR framework improves robustness in real-world speech recognition through compound-data construction and progressive acoustic-to-semantic optimization techniques.

AI-generated summary

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.