Hugging Face Daily Papers · May 21, 2026 · 3 min read

Stable Audio 3

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Repo: <a href=\"https://github.com/Stability-AI/stable-audio-3\" rel=\"nofollow\">https://github.com/Stability-AI/stable-audio-3</a></p>\n","updatedAt":"2026-05-21T13:25:42.868Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1209,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6308038234710693},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17991","authors":[{"_id":"6a0bec748ca2d0b256380510","name":"Zach Evans","hidden":false},{"_id":"6a0bec748ca2d0b256380511","name":"Julian D. Parker","hidden":false},{"_id":"6a0bec748ca2d0b256380512","name":"Matthew Rice","hidden":false},{"_id":"6a0bec748ca2d0b256380513","name":"CJ Carr","hidden":false},{"_id":"6a0bec748ca2d0b256380514","name":"Zack Zukowski","hidden":false},{"_id":"6a0bec748ca2d0b256380515","name":"Josiah Taylor","hidden":false},{"_id":"6a0bec748ca2d0b256380516","name":"Jordi Pons","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Stable Audio 3","submittedOnDailyBy":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","isPro":false,"fullname":"Niels Rogge","user":"nielsr","type":"user","name":"nielsr"},"summary":"Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.","upvotes":7,"discussionId":"6a0bec758ca2d0b256380517","projectPage":"https://stability.ai/news-updates/meet-stable-audio-3-the-model-family-built-for-artistic-experimentation-with-open-weight-models","githubRepo":"https://github.com/Stability-AI/stable-audio-3","githubRepoAddedBy":"user","ai_summary":"Stable Audio 3 enables efficient variable-length audio generation and editing through latent diffusion models operating on a semantic-acoustic autoencoder, with adversarial post-training for improved speed and quality.","ai_keywords":["latent diffusion models","audio generation","audio editing","inpainting","semantic-acoustic autoencoder","latent space","adversarial post-training","inference steps","audio fidelity","prompt adherence"],"githubStars":158,"organization":{"_id":"62e1573a6fb6e362b4a90690","name":"stabilityai","fullname":"Stability AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/643feeb67bc3fbde1385cc25/7vmYr2XwVcPtkLzac_jxQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62afc5f5457691d789dca4bd","avatarUrl":"/avatars/d4e4f35ad15ad6e766c605599fac9a35.svg","isPro":false,"fullname":"Matthew Rice","user":"mattricesound","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"6823b9f63be5f346a049afec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7sTK8Q_gZRoOz0uaKnVNJ.png","isPro":false,"fullname":"Luke Cavanagh","user":"luke-loan-atlas","type":"user"},{"_id":"63107b18e87051f3e3e0f598","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63107b18e87051f3e3e0f598/R9onir4Y0MZuq1jEWCZ2-.jpeg","isPro":false,"fullname":"Unchun Yang","user":"ucyang","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","isPro":false,"fullname":"MeiYi","user":"natalie5","type":"user"},{"_id":"62a1280b88bfb47fc40fe75b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a1280b88bfb47fc40fe75b/u6teJWcB6BWdD04G7g6uy.png","isPro":false,"fullname":"Gabriel Mongaras","user":"gmongaras","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62e1573a6fb6e362b4a90690","name":"stabilityai","fullname":"Stability AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/643feeb67bc3fbde1385cc25/7vmYr2XwVcPtkLzac_jxQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.17991.md"}">

Papers

arxiv:2605.17991

Stable Audio 3

Published on May 18

· Submitted by

Niels Rogge on May 21

Stability AI

Upvote

Authors:

Abstract

Stable Audio 3 enables efficient variable-length audio generation and editing through latent diffusion models operating on a semantic-acoustic autoencoder, with adversarial post-training for improved speed and quality.

AI-generated summary

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

View arXiv page View PDF Project page GitHub 158 Add to collection

Community

nielsr

Paper submitter about 13 hours ago

Repo: https://github.com/Stability-AI/stable-audio-3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.17991

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 10

Browse 10 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17991 in a dataset README.md to link it from this page.

Spaces citing this paper 2

Collections including this paper 2

Discussion (0)

No comments yet. Sign in and be the first to say something.

Stable Audio 3

Abstract

Community

Models citing this paper 10

Datasets citing this paper 0

Spaces citing this paper 2

Collections including this paper 2

Discussion (0)

More from Hugging Face Daily Papers