Hugging Face Daily Papers · June 3, 2026 · 5 min read

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

\n\nMERIT **(1)** estimates dataset-level gradient conflicts at a shared (merge-ready) initialization, **(2)** splits the mixture along the top PCA conflict axes, **(3)** fine-tunes each branch with zero cross-partition communication, and **(4)** merges once via token-weighted averaging.\n\nOn Qwen2.5-VL-3B with 136 Vision-FLAN tasks, the 8-benchmark average improves 54.3 → 57.0 with no gradient communication during fine-tuning. The method also scales to a 7B / 1.6M-example / 176-source mixture (matching or beating centralized joint training at minimal overhead) and transfers to text-only FLAN. We will publicly open-source our work at https://github.com/naver-ai/merit","html":"Large-scale instruction tuning hits two walls: heterogeneous tasks produce conflicting gradients (negative transfer), and joint training needs constant gradient sync across a tightly-coupled cluster. We show both can be handled at once—by training parts of the mixture independently and reconciling them once in parameter space. <img src=\"https://cdn-uploads.huggingface.co/production/uploads/6298362c9d3de7b32fd11526/N3SmytDEbd_wsDHc-QQza.png\" width=\"60%\" alt=\"MERIT pipeline: centralized joint training vs MERIT\">\nMERIT (1) estimates dataset-level gradient conflicts at a shared (merge-ready) initialization, (2) splits the mixture along the top PCA conflict axes, (3) fine-tunes each branch with zero cross-partition communication, and (4) merges once via token-weighted averaging.\nOn Qwen2.5-VL-3B with 136 Vision-FLAN tasks, the 8-benchmark average improves 54.3 → 57.0 with no gradient communication during fine-tuning. The method also scales to a 7B / 1.6M-example / 176-source mixture (matching or beating centralized joint training at minimal overhead) and transfers to text-only FLAN. We will publicly open-source our work at <a href=\"https://github.com/naver-ai/merit\" rel=\"nofollow\">https://github.com/naver-ai/merit</a>\n","updatedAt":"2026-06-03T02:18:30.634Z","author":{"_id":"6298362c9d3de7b32fd11526","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658473855720-6298362c9d3de7b32fd11526.jpeg","fullname":"Geewook Kim","name":"gwkrsrch","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8053573369979858},"editors":["gwkrsrch"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1658473855720-6298362c9d3de7b32fd11526.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01717","authors":[{"_id":"6a1eea37e292c1c78ecb10d2","name":"Minsik Choi","hidden":false},{"_id":"6a1eea37e292c1c78ecb10d3","name":"Geewook Kim","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging","submittedOnDailyBy":{"_id":"6298362c9d3de7b32fd11526","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658473855720-6298362c9d3de7b32fd11526.jpeg","isPro":false,"fullname":"Geewook Kim","user":"gwkrsrch","type":"user","name":"gwkrsrch"},"summary":"Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.","upvotes":7,"discussionId":"6a1eea37e292c1c78ecb10d4","projectPage":"https://naver-ai.github.io/merit/","githubRepo":"https://github.com/naver-ai/merit","githubRepoAddedBy":"user","ai_summary":"Instruction tuning of large language models can be improved through decentralized training that partitions mixed datasets based on gradient conflicts and merges results via weighted averaging, achieving performance comparable to centralized methods with reduced communication overhead.","ai_keywords":["instruction tuning","large language models","multimodal models","gradient interference","parameter-efficient fine-tuning","flat basin","weight merging","curvature-weighted variance reduction","PCA-aligned conflict splitting","spectral filtering","implicit norm regularization","decentralized training","token-weighted averaging","dataset-level gradient conflicts","model merging"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a186aebf0d0aab4fce12c9e","avatarUrl":"/avatars/7e7e91f723a0d20f5650e42cb9859511.svg","isPro":false,"fullname":"MINSIK CHOI","user":"MINZIK77","type":"user"},{"_id":"6298362c9d3de7b32fd11526","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658473855720-6298362c9d3de7b32fd11526.jpeg","isPro":false,"fullname":"Geewook Kim","user":"gwkrsrch","type":"user"},{"_id":"67ff242cee08737feaf18cb2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/0-_OPipzPl9qEmRm3f3jL.png","isPro":false,"fullname":"Bado Lee","user":"BDLEE","type":"user"},{"_id":"63a1500145edac9f750d06dd","avatarUrl":"/avatars/d62232e8ae443b74c907441160dc9e40.svg","isPro":false,"fullname":"Daehee Kim","user":"dnap512","type":"user"},{"_id":"66da9966eae491c64253d7e6","avatarUrl":"/avatars/cba64e21b2f1d09b90cbc3bf9e945cd4.svg","isPro":false,"fullname":"sukminseo","user":"min1321","type":"user"},{"_id":"63180254212fce5a3cdc57a5","avatarUrl":"/avatars/9229d1ce9500f9b1a1ff1c4f6856ac10.svg","isPro":false,"fullname":"L","user":"TaidanaHito","type":"user"},{"_id":"69a7dbe3534cf124553a895b","avatarUrl":"/avatars/832adb92bd033c411247ce72262ebae5.svg","isPro":false,"fullname":"Jaewoo Park","user":"jw0611-park","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01717.md"}">

Papers

arxiv:2606.01717

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

Published on Jun 1

· Submitted by

Geewook Kim on Jun 3

Upvote

Authors:

Abstract

Instruction tuning of large language models can be improved through decentralized training that partitions mixed datasets based on gradient conflicts and merges results via weighted averaging, achieving performance comparable to centralized methods with reduced communication overhead.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

gwkrsrch

Paper submitter about 11 hours ago

Large-scale instruction tuning hits two walls: heterogeneous tasks produce conflicting gradients (negative transfer), and joint training needs constant gradient sync across a tightly-coupled cluster. We show both can be handled at once—by training parts of the mixture independently and reconciling them once in parameter space.
MERIT pipeline: centralized joint training vs MERIT

MERIT (1) estimates dataset-level gradient conflicts at a shared (merge-ready) initialization, (2) splits the mixture along the top PCA conflict axes, (3) fine-tunes each branch with zero cross-partition communication, and (4) merges once via token-weighted averaging.

On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, the 8-benchmark average improves 54.3 → 57.0 with no gradient communication during fine-tuning. The method also scales to a 7B / 1.6M-example / 176-source mixture (matching or beating centralized joint training at minimal overhead) and transfers to text-only FLAN. We will publicly open-source our work at https://github.com/naver-ai/merit

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.01717

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01717 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.01717 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01717 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers