Hugging Face Daily Papers · · 3 min read

MERIT: Learning Disentangled Music Representations for Audio Similarity

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Most similarity models collapse melody, rhythm, and timbre into a single undifferentiated score. MERIT exposes all three as independent, interpretable signals from the same audio query.</p>\n","updatedAt":"2026-06-03T04:07:56.939Z","author":{"_id":"655431b2997379e9b0999d23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png","fullname":"Dorien Herremans","name":"dorienh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9134742021560669},"editors":["dorienh"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27346","authors":[{"_id":"6a1fa758e292c1c78ecb13ba","name":"Abhinaba Roy","hidden":false},{"_id":"6a1fa758e292c1c78ecb13bb","name":"Junyi Liang","hidden":false},{"_id":"6a1fa758e292c1c78ecb13bc","name":"Dorien Herremans","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"MERIT: Learning Disentangled Music Representations for Audio Similarity","submittedOnDailyBy":{"_id":"655431b2997379e9b0999d23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png","isPro":false,"fullname":"Dorien Herremans","user":"dorienh","type":"user","name":"dorienh"},"summary":"Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.","upvotes":5,"discussionId":"6a1fa758e292c1c78ecb13bd","projectPage":"https://github.com/AMAAI-Lab/MERIT","githubRepo":"https://github.com/AMAAI-Lab/MERIT","githubRepoAddedBy":"user","ai_summary":"MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.","ai_keywords":["music similarity models","disentangled representations","factor-specific music representations","conditional audio generation","source-separated stems","perceptual dimensions"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":20,"organization":{"_id":"655435e0db4619c8eb051bd0","name":"amaai-lab","fullname":"AMAAI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/zrIzOdksqPZXB0QsnGL2H.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"655431b2997379e9b0999d23","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/OhHCPCXfOS61Z7CSewUCU.png","isPro":false,"fullname":"Dorien Herremans","user":"dorienh","type":"user"},{"_id":"6a1fac9798a95251fc13566a","avatarUrl":"/avatars/9c0a957c43b0c8a57b04ec6b76a3d93e.svg","isPro":false,"fullname":"Roshan Emmanuel Manoranjan","user":"redstonerosh12","type":"user"},{"_id":"665d506460135db197b2f483","avatarUrl":"/avatars/207f4c3cfe3705dd68e62b09390ada4c.svg","isPro":false,"fullname":"Geeta Puri","user":"geetapuri","type":"user"},{"_id":"6166fa64e4e97c73a0703992","avatarUrl":"/avatars/aa3006c77786dd05cdbc8cdeab9907bc.svg","isPro":false,"fullname":"Keshav Bhandari","user":"keshavbhandari","type":"user"},{"_id":"635ba0c637c6a2c12e2daef9","avatarUrl":"/avatars/9fc2932d9ace2715f540f896754ec7d2.svg","isPro":false,"fullname":"Ollie McCarthy","user":"ollieollie","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"655435e0db4619c8eb051bd0","name":"amaai-lab","fullname":"AMAAI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/655431b2997379e9b0999d23/zrIzOdksqPZXB0QsnGL2H.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27346.md"}">
Papers
arxiv:2605.27346

MERIT: Learning Disentangled Music Representations for Audio Similarity

Published on May 26
· Submitted by
Dorien Herremans
on Jun 3
Authors:
,
,

Abstract

MERIT framework learns disentangled music representations for melody, rhythm, and timbre through conditional audio generation and source-separated stems, enabling nuanced musical queries.

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

Community

Paper submitter about 9 hours ago

Most similarity models collapse melody, rhythm, and timbre into a single undifferentiated score. MERIT exposes all three as independent, interpretable signals from the same audio query.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27346
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27346 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers