Hugging Face Daily Papers · June 11, 2026 · 3 min read

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Paper about finetuning images to add better text-only capability to multimodal LLMs</p>\n","updatedAt":"2026-06-11T10:12:13.752Z","author":{"_id":"642c63fcab0cc792e437f628","avatarUrl":"/avatars/a98281bde8cae612a63c99f28ca663ce.svg","fullname":"jinymusim","name":"jinymusim","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7279379367828369},"editors":["jinymusim"],"editorAvatarUrls":["/avatars/a98281bde8cae612a63c99f28ca663ce.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11854","authors":[{"_id":"6a2a876deb926374a38219d5","name":"Michal Chudoba","hidden":false},{"_id":"6a2a876deb926374a38219d6","name":"Sergey Alyaev","hidden":false},{"_id":"6a2a876deb926374a38219d7","name":"Petra Galuscakova","hidden":false},{"_id":"6a2a876deb926374a38219d8","name":"Tomasz Wiktorski","hidden":false}],"publishedAt":"2026-06-10T09:30:37.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training","submittedOnDailyBy":{"_id":"642c63fcab0cc792e437f628","avatarUrl":"/avatars/a98281bde8cae612a63c99f28ca663ce.svg","isPro":false,"fullname":"jinymusim","user":"jinymusim","type":"user","name":"jinymusim"},"summary":"There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.","upvotes":3,"discussionId":"6a2a876eeb926374a38219d9","ai_summary":"ART enables parameter-efficient fine-tuning of frozen multimodal language models by optimizing raw visual input through gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs.","ai_keywords":["Parameter-Efficient Fine-Tuning","LoRA","Soft Prompting","vLLM","Multimodal Large Language Model","backpropagation","computational graphs","gradient backpropagation","visual input optimization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642c63fcab0cc792e437f628","avatarUrl":"/avatars/a98281bde8cae612a63c99f28ca663ce.svg","isPro":false,"fullname":"jinymusim","user":"jinymusim","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"6a2ae6c2e36bc84d91b6e7cc","avatarUrl":"/avatars/abf4b4c0020f9332b6827952cc53163e.svg","isPro":false,"fullname":"mmgood","user":"mmgood","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11854.md"}">

Papers

arxiv:2606.11854

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Published on Jun 10

· Submitted by

jinymusim on Jun 11

Upvote

Authors:

Abstract

ART enables parameter-efficient fine-tuning of frozen multimodal language models by optimizing raw visual input through gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

View arXiv page View PDF Add to collection

Community

jinymusim

Paper submitter about 10 hours ago

Paper about finetuning images to add better text-only capability to multimodal LLMs

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11854

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11854 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11854 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11854 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers