Hugging Face Daily Papers · June 5, 2026 · 3 min read

Multimodal Music Recommendation System using LLMs

#model-release #multimodal #reasoning #music

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Dataset: <a href=\"https://zenodo.org/records/20431748\" rel=\"nofollow\">https://zenodo.org/records/20431748</a></p>\n","updatedAt":"2026-06-05T06:21:01.079Z","author":{"_id":"62c5947524171688a9feb992","avatarUrl":"/avatars/5a151713b9eae8dc566f5957acee3475.svg","fullname":"Franck Dernoncourt","name":"Franck-Dernoncourt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.29973381757736206},"editors":["Franck-Dernoncourt"],"editorAvatarUrls":["/avatars/5a151713b9eae8dc566f5957acee3475.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.00125","authors":[{"_id":"6a226a923490a593e87b15e8","name":"Srikar Prabhas Kandagatla","hidden":false},{"_id":"6a226a923490a593e87b15e9","name":"Sreehitha R. Narayana","hidden":false},{"_id":"6a226a923490a593e87b15ea","name":"Chandana Magapu","hidden":false},{"_id":"6a226a923490a593e87b15eb","name":"Swetha Mohan","hidden":false},{"_id":"6a226a923490a593e87b15ec","name":"Shamanth Kuthpadi","hidden":false},{"_id":"6a226a923490a593e87b15ed","name":"Hongjie Chen","hidden":false},{"_id":"6a226a923490a593e87b15ee","name":"Ryan A. Rossi","hidden":false},{"_id":"6a226a923490a593e87b15ef","name":"Franck Dernoncourt","hidden":false},{"_id":"6a226a923490a593e87b15f0","name":"Nesreen Ahmed","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"Multimodal Music Recommendation System using LLMs","submittedOnDailyBy":{"_id":"62c5947524171688a9feb992","avatarUrl":"/avatars/5a151713b9eae8dc566f5957acee3475.svg","isPro":false,"fullname":"Franck Dernoncourt","user":"Franck-Dernoncourt","type":"user","name":"Franck-Dernoncourt"},"summary":"Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.","upvotes":1,"discussionId":"6a226a923490a593e87b15f1","projectPage":"https://zenodo.org/records/20431748","ai_summary":"A multimodal framework for session-based music recommendation integrates audio, lyric, and semantic signals with LLM-based sequential reasoning to improve recommendation accuracy.","ai_keywords":["multimodal framework","LastFM-1K dataset","audio embeddings","lyric embeddings","pretrained music models","text representation models","LLM-generated semantic metadata","MGPHot annotation schema","listening completion ratios","E4SRec framework","item ID encoder backbones","SASRec","BERT4Rec","GRU4Rec","LLaMa-2-13B","Qwen2.5-7B-Instruct","LLaMa-3-70B","zero-shot learning","fine-tuned settings","Recall","NDCG","naive multimodal fusion","cross-modal integration"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62c5947524171688a9feb992","avatarUrl":"/avatars/5a151713b9eae8dc566f5957acee3475.svg","isPro":false,"fullname":"Franck Dernoncourt","user":"Franck-Dernoncourt","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.00125.md"}">

Papers

arxiv:2606.00125

Multimodal Music Recommendation System using LLMs

Published on May 28

· Submitted by

Franck Dernoncourt on Jun 5

Upvote

Authors:

Abstract

A multimodal framework for session-based music recommendation integrates audio, lyric, and semantic signals with LLM-based sequential reasoning to improve recommendation accuracy.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.

View arXiv page View PDF Project page Add to collection

Community

Franck-Dernoncourt

Paper submitter about 5 hours ago

Dataset: https://zenodo.org/records/20431748

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.00125

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.00125 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.00125 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.00125 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Multimodal Music Recommendation System using LLMs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers