Most similarity models collapse melody, rhythm, and timbre into a single undifferentiated score. MERIT exposes all three as independent, interpretable signals from the same audio query.</p>\n","updatedAt":"2026-06-03T04:07:56.939Z","author":{"_id":"655431b2997379e9b0999d23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png","fullname":"Dorien Herremans","name":"dorienh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9134742021560669},"editors":["dorienh"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27346","authors":[{"_id":"6a1fa758e292c1c78ecb13ba","name":"Abhinaba Roy","hidden":false},{"_id":"6a1fa758e292c1c78ecb13bb","name":"Junyi Liang","hidden":false},{"_id":"6a1fa758e292c1c78ecb13bc","name":"Dorien Herremans","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"MERIT: Learning Disentangled Music Representations for Audio Similarity","submittedOnDailyBy":{"_id":"655431b2997379e9b0999d23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png","isPro":false,"fullname":"Dorien Herremans","user":"dorienh","type":"user","name":"dorienh"},"summary":"Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.","upvotes":5,"discussionId":"6a1fa758e292c1c78ecb13bd","projectPage":"https://github.com/AMAAI-Lab/MERIT","githubRepo":"https://github.com/AMAAI-Lab/MERIT","githubRepoAddedBy":"user","ai_summary":"MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.","ai_keywords":["music similarity models","disentangled representations","factor-specific music representations","conditional audio generation","source-separated stems","perceptual dimensions"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":20,"organization":{"_id":"655435e0db4619c8eb051bd0","name":"amaai-lab","fullname":"AMAAI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/zrIzOdksqPZXB0QsnGL2H.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"655431b2997379e9b0999d23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png","isPro":false,"fullname":"Dorien Herremans","user":"dorienh","type":"user"},{"_id":"6a1fac9798a95251fc13566a","avatarUrl":"/avatars/9c0a957c43b0c8a57b04ec6b76a3d93e.svg","isPro":false,"fullname":"Roshan Emmanuel Manoranjan","user":"redstonerosh12","type":"user"},{"_id":"665d506460135db197b2f483","avatarUrl":"/avatars/207f4c3cfe3705dd68e62b09390ada4c.svg","isPro":false,"fullname":"Geeta Puri","user":"geetapuri","type":"user"},{"_id":"6166fa64e4e97c73a0703992","avatarUrl":"/avatars/aa3006c77786dd05cdbc8cdeab9907bc.svg","isPro":false,"fullname":"Keshav Bhandari","user":"keshavbhandari","type":"user"},{"_id":"635ba0c637c6a2c12e2daef9","avatarUrl":"/avatars/9fc2932d9ace2715f540f896754ec7d2.svg","isPro":false,"fullname":"Ollie McCarthy","user":"ollieollie","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"655435e0db4619c8eb051bd0","name":"amaai-lab","fullname":"AMAAI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/zrIzOdksqPZXB0QsnGL2H.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27346.md"}">
MERIT: Learning Disentangled Music Representations for Audio Similarity
Abstract
MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.
Community
Most similarity models collapse melody, rhythm, and timbre into a single undifferentiated score. MERIT exposes all three as independent, interpretable signals from the same audio query.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.27346 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.