Hugging Face Daily Papers · May 29, 2026 · 6 min read

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across more than 15 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison.\nWe evaluate a representative range of audio and speech foundation models, including self-supervised (e.g., SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g., Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.\n","updatedAt":"2026-05-29T05:44:02.753Z","author":{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","fullname":"Tiantian Feng","name":"tiantiaf","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.907477080821991},"editors":["tiantiaf"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a408bfaffc4321310531f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Exploring Speech Foundation Models for Speaker Diarization Across Lifespan](https://huggingface.co/papers/2604.05201) (2026)\n* [MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation](https://huggingface.co/papers/2604.17435) (2026)\n* [Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs](https://huggingface.co/papers/2604.12506) (2026)\n* [Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages](https://huggingface.co/papers/2604.18204) (2026)\n* [When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition](https://huggingface.co/papers/2605.02782) (2026)\n* [Raon-Speech Technical Report](https://huggingface.co/papers/2605.23912) (2026)\n* [StepAudio 2.5 Technical Report](https://huggingface.co/papers/2605.23463) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.05201\">Exploring Speech Foundation Models for Speaker Diarization Across Lifespan</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17435\">MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.12506\">Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.18204\">Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.02782\">When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23912\">Raon-Speech Technical Report</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23463\">StepAudio 2.5 Technical Report</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:42:35.165Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.714546799659729},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29257","authors":[{"_id":"6a19274556b4bb14ec65d0ae","user":{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user","name":"tiantiaf"},"name":"Tiantian Feng","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:49:37.487Z","hidden":false},{"_id":"6a19274556b4bb14ec65d0af","name":"Anfeng Xu","hidden":false},{"_id":"6a19274556b4bb14ec65d0b0","name":"Xuan Shi","hidden":false},{"_id":"6a19274556b4bb14ec65d0b1","name":"Aditya Kommineni","hidden":false},{"_id":"6a19274556b4bb14ec65d0b2","name":"Shakhrul Iman Siam","hidden":false},{"_id":"6a19274556b4bb14ec65d0b3","name":"Megan Micheletti","hidden":false},{"_id":"6a19274556b4bb14ec65d0b4","name":"Zhonghao Shi","hidden":false},{"_id":"6a19274556b4bb14ec65d0b5","name":"Helen Tager-Flusberg","hidden":false},{"_id":"6a19274556b4bb14ec65d0b6","name":"Mi Zhang","hidden":false},{"_id":"6a19274556b4bb14ec65d0b7","name":"Lynn K. Perry","hidden":false},{"_id":"6a19274556b4bb14ec65d0b8","name":"Catherine Lord","hidden":false},{"_id":"6a19274556b4bb14ec65d0b9","name":"Daniel Messinger","hidden":false},{"_id":"6a19274556b4bb14ec65d0ba","name":"Shrikanth Narayanan","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood","submittedOnDailyBy":{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user","name":"tiantiaf"},"summary":"We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.","upvotes":4,"discussionId":"6a19274656b4bb14ec65d0bb","projectPage":"https://tiantiaf0627.github.io/childvox/","ai_summary":"ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models.","ai_keywords":["audio and speech foundation models","self-supervised models","ASR-oriented models","large audio-language models","physiological sound classification","vocalization modeling","canonical syllables modeling","speech quality assessment","speech recognition"],"organization":{"_id":"66a403d0dcb5bbc6e98bb7d0","name":"UniversityofSouthernCalifornia","fullname":"University of Southern California","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a403728069e3c30e0d8524/tkYCfeIJfF1FxtYiRZ8bf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user"},{"_id":"6831715940326f1fec92e798","avatarUrl":"/avatars/cdc1c42270466e4bd78d81a9d10b4754.svg","isPro":false,"fullname":"Julia K","user":"juliak115","type":"user"},{"_id":"682e2dc63a8adc94315ac3ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Cku7VHwexGi4vTSs4loVb.png","isPro":false,"fullname":"Frank","user":"frank0125","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66a403d0dcb5bbc6e98bb7d0","name":"UniversityofSouthernCalifornia","fullname":"University of Southern California","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a403728069e3c30e0d8524/tkYCfeIJfF1FxtYiRZ8bf.png"}}">

Papers

arxiv:2605.29257

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Published on May 28

· Submitted by

Tiantian Feng on May 29

University of Southern California

Upvote

Authors:

Tiantian Feng ,

Abstract

ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models.

AI-generated summary

View arXiv page View PDF Project page Add to collection

Community

tiantiaf

Paper author Paper submitter 1 day ago

We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g., SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g., Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

librarian-bot

about 13 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29257 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29257 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29257 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers