Hugging Face Daily Papers · May 15, 2026 · 5 min read

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.</p>\n","updatedAt":"2026-05-15T09:06:20.407Z","author":{"_id":"631b9ff5824f2502e3557c7e","avatarUrl":"/avatars/076043c9dba07644a570692563ef8114.svg","fullname":"liu","name":"che111","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8784782290458679},"editors":["che111"],"editorAvatarUrls":["/avatars/076043c9dba07644a570692563ef8114.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12034","authors":[{"_id":"6a06e1d13192c37877924f03","name":"Che Liu","hidden":false},{"_id":"6a06e1d13192c37877924f04","name":"Lichao Ma","hidden":false},{"_id":"6a06e1d13192c37877924f05","name":"Xiangyu Tony Zhang","hidden":false},{"_id":"6a06e1d13192c37877924f06","name":"Yuxin Zhang","hidden":false},{"_id":"6a06e1d13192c37877924f07","name":"Haoyang Zhang","hidden":false},{"_id":"6a06e1d13192c37877924f08","name":"Xuerui Yang","hidden":false},{"_id":"6a06e1d13192c37877924f09","name":"Fei Tian","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","submittedOnDailyBy":{"_id":"631b9ff5824f2502e3557c7e","avatarUrl":"/avatars/076043c9dba07644a570692563ef8114.svg","isPro":true,"fullname":"liu","user":"che111","type":"user","name":"che111"},"summary":"Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/","upvotes":2,"discussionId":"6a06e1d23192c37877924f0a","projectPage":"https://cheliu-computation.github.io/omni/","ai_summary":"Research demonstrates that current omni-modal benchmarks may inflate performance through visual shortcuts, and shows that post-training techniques can improve model performance on a cleaned benchmark with reduced visual leakage.","ai_keywords":["omni-modal language models","visual shortcuts","audio-visual-language evidence integration","post-training","Qwen2.5-Omni-3B","mixed bi-modal SFT","mixed-modality RLVR","self-distillation","OmniClean","visual-only probing"],"organization":{"_id":"66e43eae9d477f566f937935","name":"stepfun-ai","fullname":"StepFun","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/644f7e6233ac8f46fa0b9e26/CmF2ocXhkr2UtHXgmwq7-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66e43eae9d477f566f937935","name":"stepfun-ai","fullname":"StepFun","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/644f7e6233ac8f46fa0b9e26/CmF2ocXhkr2UtHXgmwq7-.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12034.md"}">

Papers

arxiv:2605.12034

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Published on May 13

· Submitted by

liu on May 15

StepFun

Upvote

Authors:

Abstract

Research demonstrates that current omni-modal benchmarks may inflate performance through visual shortcuts, and shows that post-training techniques can improve model performance on a cleaned benchmark with reduced visual leakage.

AI-generated summary

View arXiv page View PDF Project page Add to collection

Community

che111

Paper submitter about 16 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.12034

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12034 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12034 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers