Hugging Face Daily Papers · May 26, 2026 · 5 min read

Towards Customized Multimodal Role-Play

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.</p>\n","updatedAt":"2026-05-26T05:10:46.910Z","author":{"_id":"65bce64b8467e2a3d6a450af","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bce64b8467e2a3d6a450af/IFjoy4GYMA6oUgYyJfZ1F.jpeg","fullname":"Chao Tang","name":"Tangc03","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8383357524871826},"editors":["Tangc03"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65bce64b8467e2a3d6a450af/IFjoy4GYMA6oUgYyJfZ1F.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08129","authors":[{"_id":"6a06046fb1a8cbabc9f09641","user":{"_id":"65bce64b8467e2a3d6a450af","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bce64b8467e2a3d6a450af/IFjoy4GYMA6oUgYyJfZ1F.jpeg","isPro":false,"fullname":"Chao Tang","user":"Tangc03","type":"user","name":"Tangc03"},"name":"Chao Tang","status":"claimed_verified","statusLastChangedAt":"2026-05-18T09:46:16.912Z","hidden":false},{"_id":"6a06046fb1a8cbabc9f09642","name":"Jianzong Wu","hidden":false},{"_id":"6a06046fb1a8cbabc9f09643","name":"Qingyu Shi","hidden":false},{"_id":"6a06046fb1a8cbabc9f09644","name":"Ye Tian","hidden":false},{"_id":"6a06046fb1a8cbabc9f09645","name":"Aixi Zhang","hidden":false},{"_id":"6a06046fb1a8cbabc9f09646","name":"Hao Jiang","hidden":false},{"_id":"6a06046fb1a8cbabc9f09647","name":"Jiangning Zhang","hidden":false},{"_id":"6a06046fb1a8cbabc9f09648","name":"Yunhai Tong","hidden":false}],"publishedAt":"2026-05-01T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Towards Customized Multimodal Role-Play","submittedOnDailyBy":{"_id":"65bce64b8467e2a3d6a450af","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bce64b8467e2a3d6a450af/IFjoy4GYMA6oUgYyJfZ1F.jpeg","isPro":false,"fullname":"Chao Tang","user":"Tangc03","type":"user","name":"Tangc03"},"summary":"Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.","upvotes":4,"discussionId":"6a06046fb1a8cbabc9f09649","projectPage":"https://tangc03.github.io/UniCharacter.github.io/","githubRepo":"https://github.com/Tangc03/UniCharacter","githubRepoAddedBy":"user","ai_summary":"A new task and dataset for customized multimodal role-play is introduced, along with a unified model framework that enables consistent character customization across text and image modalities using few-shot learning.","ai_keywords":["Customized Multimodal Role-Play","Unified Supervised Finetuning","Character-GRPO","cross-modal consistency","few-shot customization","unified modeling"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65bce64b8467e2a3d6a450af","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bce64b8467e2a3d6a450af/IFjoy4GYMA6oUgYyJfZ1F.jpeg","isPro":false,"fullname":"Chao Tang","user":"Tangc03","type":"user"},{"_id":"68d6ac8e887ba1ce93995849","avatarUrl":"/avatars/a5c3aaad690413327254bcfc7e43c02b.svg","isPro":false,"fullname":"Ziming Huang","user":"ZeldaHuangk","type":"user"},{"_id":"68d6ad53c75ae535471b5227","avatarUrl":"/avatars/06b78108aa429d5b685c7f24b1e7a289.svg","isPro":false,"fullname":"Jiaxing Chen","user":"valorix25","type":"user"},{"_id":"657a6eed1ccc3c2a5ea7b585","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/RIQIF-JJdNI0SwJEq_9z7.jpeg","isPro":false,"fullname":"Jianzong Wu","user":"jianzongwu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08129.md"}">

Papers

arxiv:2605.08129

Towards Customized Multimodal Role-Play

Published on May 1

· Submitted by

Chao Tang on May 26

Upvote

Authors:

Chao Tang ,

Abstract

A new task and dataset for customized multimodal role-play is introduced, along with a unified model framework that enables consistent character customization across text and image modalities using few-shot learning.

AI-generated summary

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

Tangc03

Paper author Paper submitter about 3 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.08129

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08129 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Towards Customized Multimodal Role-Play

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers