Hugging Face Daily Papers · · 3 min read

Stable Audio 3

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Repo: <a href=\"https://github.com/Stability-AI/stable-audio-3\" rel=\"nofollow\">https://github.com/Stability-AI/stable-audio-3</a></p>\n","updatedAt":"2026-05-21T13:25:42.868Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1209,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6308038234710693},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17991","authors":[{"_id":"6a0bec748ca2d0b256380510","name":"Zach Evans","hidden":false},{"_id":"6a0bec748ca2d0b256380511","name":"Julian D. Parker","hidden":false},{"_id":"6a0bec748ca2d0b256380512","name":"Matthew Rice","hidden":false},{"_id":"6a0bec748ca2d0b256380513","name":"CJ Carr","hidden":false},{"_id":"6a0bec748ca2d0b256380514","name":"Zack Zukowski","hidden":false},{"_id":"6a0bec748ca2d0b256380515","name":"Josiah Taylor","hidden":false},{"_id":"6a0bec748ca2d0b256380516","name":"Jordi Pons","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Stable Audio 3","submittedOnDailyBy":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","isPro":false,"fullname":"Niels Rogge","user":"nielsr","type":"user","name":"nielsr"},"summary":"Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.","upvotes":7,"discussionId":"6a0bec758ca2d0b256380517","projectPage":"https://stability.ai/news-updates/meet-stable-audio-3-the-model-family-built-for-artistic-experimentation-with-open-weight-models","githubRepo":"https://github.com/Stability-AI/stable-audio-3","githubRepoAddedBy":"user","ai_summary":"Stable Audio 3 enables efficient variable-length audio generation and editing through latent diffusion models operating on a semantic-acoustic autoencoder, with adversarial post-training for improved speed and quality.","ai_keywords":["latent diffusion models","audio generation","audio editing","inpainting","semantic-acoustic autoencoder","latent space","adversarial post-training","inference steps","audio fidelity","prompt adherence"],"githubStars":158,"organization":{"_id":"62e1573a6fb6e362b4a90690","name":"stabilityai","fullname":"Stability AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/643feeb67bc3fbde1385cc25/7vmYr2XwVcPtkLzac_jxQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62afc5f5457691d789dca4bd","avatarUrl":"/avatars/d4e4f35ad15ad6e766c605599fac9a35.svg","isPro":false,"fullname":"Matthew Rice","user":"mattricesound","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"6823b9f63be5f346a049afec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7sTK8Q_gZRoOz0uaKnVNJ.png","isPro":false,"fullname":"Luke Cavanagh","user":"luke-loan-atlas","type":"user"},{"_id":"63107b18e87051f3e3e0f598","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63107b18e87051f3e3e0f598/R9onir4Y0MZuq1jEWCZ2-.jpeg","isPro":false,"fullname":"Unchun Yang","user":"ucyang","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","isPro":false,"fullname":"MeiYi","user":"natalie5","type":"user"},{"_id":"62a1280b88bfb47fc40fe75b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a1280b88bfb47fc40fe75b/u6teJWcB6BWdD04G7g6uy.png","isPro":false,"fullname":"Gabriel Mongaras","user":"gmongaras","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62e1573a6fb6e362b4a90690","name":"stabilityai","fullname":"Stability AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/643feeb67bc3fbde1385cc25/7vmYr2XwVcPtkLzac_jxQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.17991.md"}">
Papers
arxiv:2605.17991

Stable Audio 3

Published on May 18
· Submitted by
Niels Rogge
on May 21
Authors:
,
,
,
,
,
,

Abstract

Stable Audio 3 enables efficient variable-length audio generation and editing through latent diffusion models operating on a semantic-acoustic autoencoder, with adversarial post-training for improved speed and quality.

AI-generated summary

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.17991
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 10

Browse 10 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17991 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers