Hugging Face Daily Papers · · 5 min read

Improving Text-to-Music Generation with Human Preference Rewards

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<strong>Can a human preference reward improve a small text-to-music model, without new labels or scale up?</strong><br>Our <a href=\"https://ntu-musicailab.github.io/ICME26-ATTM-Grand-Challenge/\" rel=\"nofollow\">ICME 2026 ATTM Grand Challenge</a> (Efficiency Track) entry puts an open human preference reward (<a href=\"https://huggingface.co/spaces/TuneJury/landing\">TuneJury</a>, trained on judgments from <a href=\"https://huggingface.co/music-arena\">Music Arena</a>, <a href=\"https://huggingface.co/datasets/i-need-sleep/musicprefs\">MusicPrefs</a>, <a href=\"https://huggingface.co/datasets/disco-eth/AIME-survey\">AIME</a>, <a href=\"https://huggingface.co/datasets/ASLP-lab/SongEval\">SongEval</a>) at the center of a 120M <a href=\"https://github.com/ntu-musicailab/ICME26-ATTM-GC-MeanAudio\" rel=\"nofollow\">FluxAudio-S</a> pipeline: training-time conditioning, expert-iteration ranking, and a short preference-tuning pass. It trains in ~40 GPU hours on single RTX A5000 overall and generates 10s clips in under a second.</p>\n<p><strong>Does it work?</strong><br><em><strong>Yes!</strong></em> Over the same 120M baseline, all three evaluation metrics are improved:</p>\n<ul>\n<li><strong>TuneJury reward ↑</strong> (Do people prefer it?): −0.39 → +0.53</li>\n<li><strong>FAD-CLAP ↓</strong> (Does it sound like real music?): 0.60 → 0.42</li>\n<li><strong>CLAP score ↑</strong> (Does it match the text prompt?): 0.23 → 0.29</li>\n</ul>\n<p>🌐 <a href=\"https://huggingface.co/spaces/yonghyunk1m/TTM-HumanPref\">Project page</a> · 🎧 <a href=\"https://yonghyunk1m.github.io/TTM-HumanPref/\" rel=\"nofollow\">Listening samples</a> · 📄 <a href=\"https://arxiv.org/abs/2606.21670\" rel=\"nofollow\">Paper</a> · 💻 <a href=\"https://github.com/yonghyunk1m/TTM-HumanPref\" rel=\"nofollow\">Code</a></p>\n","updatedAt":"2026-06-23T07:47:29.229Z","author":{"_id":"67dc6c6e8fc6577e1851b36e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67dc6c6e8fc6577e1851b36e/Sb0-erpwIC4Qutty5KEln.jpeg","fullname":"Yonghyun Kim","name":"yonghyunk1m","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":3,"identifiedLanguage":{"language":"en","probability":0.635358452796936},"editors":["yonghyunk1m"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67dc6c6e8fc6577e1851b36e/Sb0-erpwIC4Qutty5KEln.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.21670","authors":[{"_id":"6a39deb4fdcd3514343bb48b","user":{"_id":"67dc6c6e8fc6577e1851b36e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67dc6c6e8fc6577e1851b36e/Sb0-erpwIC4Qutty5KEln.jpeg","isPro":true,"fullname":"Yonghyun Kim","user":"yonghyunk1m","type":"user","name":"yonghyunk1m"},"name":"Yonghyun Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:57:02.411Z","hidden":false},{"_id":"6a39deb4fdcd3514343bb48c","name":"Junwon Lee","hidden":false},{"_id":"6a39deb4fdcd3514343bb48d","name":"Haiwen Xia","hidden":false},{"_id":"6a39deb4fdcd3514343bb48e","name":"Yinghao Ma","hidden":false},{"_id":"6a39deb4fdcd3514343bb48f","name":"Chris Donahue","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/67dc6c6e8fc6577e1851b36e/pk8by0qTvuby3b6IGpSEN.png"],"publishedAt":"2026-06-19T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Improving Text-to-Music Generation with Human Preference Rewards","submittedOnDailyBy":{"_id":"67dc6c6e8fc6577e1851b36e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67dc6c6e8fc6577e1851b36e/Sb0-erpwIC4Qutty5KEln.jpeg","isPro":true,"fullname":"Yonghyun Kim","user":"yonghyunk1m","type":"user","name":"yonghyunk1m"},"summary":"We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.","upvotes":0,"discussionId":"6a39deb4fdcd3514343bb490","projectPage":"https://huggingface.co/spaces/yonghyunk1m/TTM-HumanPref","githubRepo":"https://github.com/yonghyunk1m/TTM-HumanPref","githubRepoAddedBy":"user","ai_summary":"A text-to-music generation system uses reward conditioning, expert iteration, and preference tuning to improve audio quality while maintaining efficiency within a 120M-parameter model framework.","ai_keywords":["FluxAudio-S","FAD-CLAP","CLAP score","TuneJury","twin pairwise ranker","training-time reward conditioning","inference-time CFG","score-conditioning architectures","expert iteration","CRPO","audio-text alignment","joint CFG","source separation","loudness normalization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"6a168f5894b3a8d00be4a67b","name":"TuneJury","fullname":"TuneJury","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67dc6c6e8fc6577e1851b36e/dEOYSNWAz-ONqGZHoLpA7.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"6a168f5894b3a8d00be4a67b","name":"TuneJury","fullname":"TuneJury","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67dc6c6e8fc6577e1851b36e/dEOYSNWAz-ONqGZHoLpA7.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.21670.md","query":{}}">
Papers
arxiv:2606.21670

Improving Text-to-Music Generation with Human Preference Rewards

Published on Jun 19
· Submitted by
Yonghyun Kim
on Jun 23
Authors:
,
,
,

Abstract

A text-to-music generation system uses reward conditioning, expert iteration, and preference tuning to improve audio quality while maintaining efficiency within a 120M-parameter model framework.

We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.

Community

Paper author Paper submitter about 17 hours ago
edited about 17 hours ago

Can a human preference reward improve a small text-to-music model, without new labels or scale up?
Our ICME 2026 ATTM Grand Challenge (Efficiency Track) entry puts an open human preference reward (TuneJury, trained on judgments from Music Arena, MusicPrefs, AIME, SongEval) at the center of a 120M FluxAudio-S pipeline: training-time conditioning, expert-iteration ranking, and a short preference-tuning pass. It trains in ~40 GPU hours on single RTX A5000 overall and generates 10s clips in under a second.

Does it work?
Yes! Over the same 120M baseline, all three evaluation metrics are improved:

  • TuneJury reward ↑ (Do people prefer it?): −0.39 → +0.53
  • FAD-CLAP ↓ (Does it sound like real music?): 0.60 → 0.42
  • CLAP score ↑ (Does it match the text prompt?): 0.23 → 0.29

🌐 Project page · 🎧 Listening samples · 📄 Paper · 💻 Code

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.21670
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.21670 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.21670 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers