We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across more than 15 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison.</p>\n<p>We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g., SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g., Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.</p>\n","updatedAt":"2026-05-29T05:44:02.753Z","author":{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","fullname":"Tiantian Feng","name":"tiantiaf","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":16,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.907477080821991},"editors":["tiantiaf"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a408bfaffc4321310531f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Exploring Speech Foundation Models for Speaker Diarization Across Lifespan](https://huggingface.co/papers/2604.05201) (2026)\n* [MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation](https://huggingface.co/papers/2604.17435) (2026)\n* [Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs](https://huggingface.co/papers/2604.12506) (2026)\n* [Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages](https://huggingface.co/papers/2604.18204) (2026)\n* [When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition](https://huggingface.co/papers/2605.02782) (2026)\n* [Raon-Speech Technical Report](https://huggingface.co/papers/2605.23912) (2026)\n* [StepAudio 2.5 Technical Report](https://huggingface.co/papers/2605.23463) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.05201\">Exploring Speech Foundation Models for Speaker Diarization Across Lifespan</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17435\">MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.12506\">Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.18204\">Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.02782\">When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23912\">Raon-Speech Technical Report</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23463\">StepAudio 2.5 Technical Report</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:42:35.165Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.714546799659729},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29257","authors":[{"_id":"6a19274556b4bb14ec65d0ae","user":{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user","name":"tiantiaf"},"name":"Tiantian Feng","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:49:37.487Z","hidden":false},{"_id":"6a19274556b4bb14ec65d0af","name":"Anfeng Xu","hidden":false},{"_id":"6a19274556b4bb14ec65d0b0","name":"Xuan Shi","hidden":false},{"_id":"6a19274556b4bb14ec65d0b1","name":"Aditya Kommineni","hidden":false},{"_id":"6a19274556b4bb14ec65d0b2","name":"Shakhrul Iman Siam","hidden":false},{"_id":"6a19274556b4bb14ec65d0b3","name":"Megan Micheletti","hidden":false},{"_id":"6a19274556b4bb14ec65d0b4","name":"Zhonghao Shi","hidden":false},{"_id":"6a19274556b4bb14ec65d0b5","name":"Helen Tager-Flusberg","hidden":false},{"_id":"6a19274556b4bb14ec65d0b6","name":"Mi Zhang","hidden":false},{"_id":"6a19274556b4bb14ec65d0b7","name":"Lynn K. Perry","hidden":false},{"_id":"6a19274556b4bb14ec65d0b8","name":"Catherine Lord","hidden":false},{"_id":"6a19274556b4bb14ec65d0b9","name":"Daniel Messinger","hidden":false},{"_id":"6a19274556b4bb14ec65d0ba","name":"Shrikanth Narayanan","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood","submittedOnDailyBy":{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user","name":"tiantiaf"},"summary":"We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.","upvotes":4,"discussionId":"6a19274656b4bb14ec65d0bb","projectPage":"https://tiantiaf0627.github.io/childvox/","ai_summary":"ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models.","ai_keywords":["audio and speech foundation models","self-supervised models","ASR-oriented models","large audio-language models","physiological sound classification","vocalization modeling","canonical syllables modeling","speech quality assessment","speech recognition"],"organization":{"_id":"66a403d0dcb5bbc6e98bb7d0","name":"UniversityofSouthernCalifornia","fullname":"University of Southern California","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a403728069e3c30e0d8524/tkYCfeIJfF1FxtYiRZ8bf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64092a1ab6a334f53e278b3b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64092a1ab6a334f53e278b3b/tcueLWyyDL6WMUTw3Or4t.jpeg","isPro":false,"fullname":"Tiantian Feng","user":"tiantiaf","type":"user"},{"_id":"6831715940326f1fec92e798","avatarUrl":"/avatars/cdc1c42270466e4bd78d81a9d10b4754.svg","isPro":false,"fullname":"Julia K","user":"juliak115","type":"user"},{"_id":"682e2dc63a8adc94315ac3ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Cku7VHwexGi4vTSs4loVb.png","isPro":false,"fullname":"Frank","user":"frank0125","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66a403d0dcb5bbc6e98bb7d0","name":"UniversityofSouthernCalifornia","fullname":"University of Southern California","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a403728069e3c30e0d8524/tkYCfeIJfF1FxtYiRZ8bf.png"}}">
ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
Authors: ,
,
,
,
,
,
,
,
,
,
,
Abstract
ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models.
AI-generated summary
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.
Community
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across more than 15 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison.
We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g., SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g., Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29257 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.29257 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.29257 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.