Hugging Face Daily Papers · May 21, 2026 · 7 min read

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This survey provides a timely and comprehensive overview of trustworthiness issues in Large Audio Language Models. It clearly identifies the unique risks introduced by continuous acoustic inputs, including cross-modal attacks, acoustic backdoors, privacy leakage, hallucination, and fairness concerns. The proposed roadmap toward defense-in-depth architectures and intrinsic representation engineering is valuable. A stronger empirical comparison of existing LALMs and their defense coverage would further improve the survey. Overall, this is a useful reference for researchers working on trustworthy audio-language intelligence.\n","updatedAt":"2026-05-21T09:14:53.704Z","author":{"_id":"69ac287dc21950cd90ce4cf0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/50faZQbVMoItMp_aJqiFD.png","fullname":"sty","name":"aidawang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9101870059967041},"editors":["aidawang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/50faZQbVMoItMp_aJqiFD.png"],"reactions":[],"isReport":false}},{"id":"6a0ed7d613bfbb72fcb77aed","author":{"_id":"6745863f8eebd2eaf7fef95f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Co-bXna1xVMyJJi4M_L1S.png","fullname":"Yang Xiao","name":"AustinXiao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-21T10:00:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Most conversations about Multimodal LLMs and universal auditory intelligence focus purely on model capabilities and performance scaling. In our new comprehensive survey, \"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook\", we make a critical argument: for real-world deployment, empirical performance means nothing without intrinsic trustworthiness. The evidence is hard to ignore. Recent benchmarks reveal that the transition to unified end-to-end audio frameworks has dramatically expanded the attack surface.\n\nWe evaluate the state-of-the-art landscape across six analytical pillars: Hallucination, Robustness, Safety, Privacy, Fairness, and Authentication. The survey systematically uncovers a profound imbalance between a mature offensive ecosystem and fragmented, reactive defenses. To bridge this chasm, we propose a strategic roadmap advocating for \"Defense-in-Depth\" architectures, causal auditory world modeling, and intrinsic representation engineering. \n\nIf you're building real-time full-duplex conversational agents, voice assistants, speech security systems, or anything that interacts with live acoustic data, we hope you'll find something vital here. \n\n📄 Paper: https://arxiv.org/abs/2605.20266\n💻 Project: https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs \n😊 Hugging Face: https://huggingface.co/papers/2605.20266\n\nHuge thanks to my incredible co-authors","html":"Most conversations about Multimodal LLMs and universal auditory intelligence focus purely on model capabilities and performance scaling. In our new comprehensive survey, \"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook\", we make a critical argument: for real-world deployment, empirical performance means nothing without intrinsic trustworthiness. The evidence is hard to ignore. Recent benchmarks reveal that the transition to unified end-to-end audio frameworks has dramatically expanded the attack surface.\nWe evaluate the state-of-the-art landscape across six analytical pillars: Hallucination, Robustness, Safety, Privacy, Fairness, and Authentication. The survey systematically uncovers a profound imbalance between a mature offensive ecosystem and fragmented, reactive defenses. To bridge this chasm, we propose a strategic roadmap advocating for \"Defense-in-Depth\" architectures, causal auditory world modeling, and intrinsic representation engineering. \nIf you're building real-time full-duplex conversational agents, voice assistants, speech security systems, or anything that interacts with live acoustic data, we hope you'll find something vital here. \n📄 Paper: <a href=\"https://arxiv.org/abs/2605.20266\" rel=\"nofollow\">https://arxiv.org/abs/2605.20266</a> 💻 Project: <a href=\"https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs\" rel=\"nofollow\">https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs</a> 😊 Hugging Face: <a href=\"https://huggingface.co/papers/2605.20266\">https://huggingface.co/papers/2605.20266</a>\nHuge thanks to my incredible co-authors\n","updatedAt":"2026-05-21T10:00:54.138Z","author":{"_id":"6745863f8eebd2eaf7fef95f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Co-bXna1xVMyJJi4M_L1S.png","fullname":"Yang Xiao","name":"AustinXiao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8589794635772705},"editors":["AustinXiao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Co-bXna1xVMyJJi4M_L1S.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20266","authors":[{"_id":"6a0ec0cb164dbbc68a26c76e","name":"Kaiwen Luo","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c76f","name":"Zhenhong Zhou","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c770","name":"Leo Wang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c771","name":"Liang Lin","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c772","name":"Yang Xiao","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c773","name":"Tianyu Shao","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c774","name":"Yuanhe Zhang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c775","name":"Yuxuan Li","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c776","name":"Miao Yu","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c777","name":"Kailin Lyu","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c778","name":"Jiaming Zhang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c779","name":"Dongrui Liu","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c77a","name":"Li Sun","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c77b","name":"Yueming Wu","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c77c","name":"Kai Li","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c77d","name":"Ting Dang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c77e","name":"Xiaojun Jia","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c77f","name":"Rohan Kumar Das","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c780","name":"Xinfeng Li","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c781","name":"Siyuan Liang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c782","name":"Qiufeng Wang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c783","name":"Xingjun Ma","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c784","name":"Jing Chen","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c785","name":"Kun Wang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c786","name":"Junhao Dong","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c787","name":"Deqing Zou","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c788","name":"Yu Cheng","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c789","name":"Xia Hu","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c78a","name":"Zhigang Zeng","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c78b","name":"Sen Su","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c78c","name":"Yang Liu","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c78d","name":"Yu-Gang Jiang","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c78e","name":"Philip S. Yu","hidden":false},{"_id":"6a0ec0cb164dbbc68a26c78f","name":"Yew-Soon Ong","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","submittedOnDailyBy":{"_id":"6745863f8eebd2eaf7fef95f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Co-bXna1xVMyJJi4M_L1S.png","isPro":false,"fullname":"Yang Xiao","user":"AustinXiao","type":"user","name":"AustinXiao"},"summary":"The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for \"Defense-in-Depth\" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.","upvotes":23,"discussionId":"6a0ec0cb164dbbc68a26c790","githubRepo":"https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs","githubRepoAddedBy":"user","ai_summary":"Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies.","ai_keywords":["Large Language Models","Multimodal Large Language Models","Large Audio Language Models","end-to-end frameworks","acoustic signals","attack surface","cross-modal jailbreaking","acoustic backdoors","biometric privacy leakage","hallucination","robustness","safety","privacy","fairness","authentication","Defense-in-Depth","causal auditory world modeling","intrinsic representation engineering"],"githubStars":188,"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68256e5486f32a5d5dd870d8","avatarUrl":"/avatars/a6ffeeca14072b93ffe95b3f0e232218.svg","isPro":false,"fullname":"Kevin Luo","user":"Kwwwww995","type":"user"},{"_id":"6629fbfc30063c3fe855210c","avatarUrl":"/avatars/bac566d2747beb6f7943ed0d886fffed.svg","isPro":false,"fullname":"wly","user":"LeyanW","type":"user"},{"_id":"63f1c4c2bc705ef8c2407466","avatarUrl":"/avatars/8409f5963dd59e676527acdc08d34f41.svg","isPro":false,"fullname":"zz","user":"ydyjya","type":"user"},{"_id":"69b10735eb8c5f1fc8df3aa0","avatarUrl":"/avatars/a250b2f1116c464c08f95831cd52a95f.svg","isPro":false,"fullname":"soo","user":"p-soosoo-123","type":"user"},{"_id":"69c4d70c3f058a1065049ac0","avatarUrl":"/avatars/6317ed1209f11db1e43c77cb858b65a6.svg","isPro":false,"fullname":"dont_def","user":"dontdef","type":"user"},{"_id":"64ba81f090dfdda6ab7cf355","avatarUrl":"/avatars/facd55ed1489cd9d032b1af71ac00604.svg","isPro":false,"fullname":"wang","user":"sh1ra","type":"user"},{"_id":"65e1d98582549cce484798aa","avatarUrl":"/avatars/4c50f96c652bac65b0fa18a4979242e8.svg","isPro":false,"fullname":"Lin","user":"aijwhedqie","type":"user"},{"_id":"6732c51627ffccdca4ab810f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/o7r8LpJzqkbj37_HaT65X.png","isPro":false,"fullname":"huang","user":"zhenghao23333","type":"user"},{"_id":"65d6c47f90f11951bcf17a8f","avatarUrl":"/avatars/b0a378a9dd5c62c02809b7bcabdaca59.svg","isPro":false,"fullname":"Peng Wang","user":"wpwpyo","type":"user"},{"_id":"660ab1edf36ab0a44695d232","avatarUrl":"/avatars/5e2e6f9050237e51d9fe5d970412d3d5.svg","isPro":false,"fullname":"JosueLin","user":"JosueLin7","type":"user"},{"_id":"6a0ec6960baf3696df6378ef","avatarUrl":"/avatars/14cbbd34a26f9f0055c51371746dd10b.svg","isPro":false,"fullname":"Jialiang Tao","user":"jlt813941","type":"user"},{"_id":"69ac287dc21950cd90ce4cf0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/50faZQbVMoItMp_aJqiFD.png","isPro":false,"fullname":"sty","user":"aidawang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20266.md"}">

Papers

arxiv:2605.20266

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Published on May 18

· Submitted by

Yang Xiao on May 21

Nanyang Technological University

Upvote

Authors:

Abstract

Large Audio Language Models exhibit significant trustworthiness challenges despite performance advances, requiring comprehensive frameworks addressing security vulnerabilities and defensive strategies.

AI-generated summary

The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

View arXiv page View PDF GitHub 188 Add to collection

Community

aidawang

about 4 hours ago

AustinXiao

Paper submitter about 3 hours ago

Most conversations about Multimodal LLMs and universal auditory intelligence focus purely on model capabilities and performance scaling. In our new comprehensive survey, "A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook", we make a critical argument: for real-world deployment, empirical performance means nothing without intrinsic trustworthiness. The evidence is hard to ignore. Recent benchmarks reveal that the transition to unified end-to-end audio frameworks has dramatically expanded the attack surface.

We evaluate the state-of-the-art landscape across six analytical pillars: Hallucination, Robustness, Safety, Privacy, Fairness, and Authentication. The survey systematically uncovers a profound imbalance between a mature offensive ecosystem and fragmented, reactive defenses. To bridge this chasm, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering.

If you're building real-time full-duplex conversational agents, voice assistants, speech security systems, or anything that interacts with live acoustic data, we hope you'll find something vital here.

📄 Paper: https://arxiv.org/abs/2605.20266
💻 Project: https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs
😊 Hugging Face: https://huggingface.co/papers/2605.20266

Huge thanks to my incredible co-authors

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.20266

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.20266 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.20266 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20266 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers