Hugging Face Daily Papers · June 3, 2026 · 4 min read

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

DOMINO is a new framework that generates high-quality training data using only reference examples rather than complex domain descriptions. By isolating core domain patterns from noise, it enables models to adapt to specialized fields more effectively. This approach significantly improves performance on complex tasks, like coding, where domain rules are hard to describe.</p>\n","updatedAt":"2026-06-03T14:54:50.094Z","author":{"_id":"6837243353739ee0d588d04c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/CYU8xxEHxxGkwNVCpua6X.png","fullname":"Hang Yu","name":"fhlyhv","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8899377584457397},"editors":["fhlyhv"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/CYU8xxEHxxGkwNVCpua6X.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30039","authors":[{"_id":"6a203c5715100c5272a843eb","name":"Tong Ye","hidden":false},{"_id":"6a203c5715100c5272a843ec","name":"Hang Yu","hidden":false},{"_id":"6a203c5715100c5272a843ed","name":"Tengfei Ma","hidden":false},{"_id":"6a203c5715100c5272a843ee","name":"Xuhong Zhang","hidden":false},{"_id":"6a203c5715100c5272a843ef","name":"Jianguo Li","hidden":false},{"_id":"6a203c5715100c5272a843f0","name":"Peng Di","hidden":false},{"_id":"6a203c5715100c5272a843f1","name":"Peiyu Liu","hidden":false},{"_id":"6a203c5715100c5272a843f2","name":"Jianwei Yin","hidden":false},{"_id":"6a203c5715100c5272a843f3","name":"Wenhai Wang","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning","submittedOnDailyBy":{"_id":"6837243353739ee0d588d04c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/CYU8xxEHxxGkwNVCpua6X.png","isPro":false,"fullname":"Hang Yu","user":"fhlyhv","type":"user","name":"fhlyhv"},"summary":"Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.","upvotes":4,"discussionId":"6a203c5715100c5272a843f4","ai_summary":"DOMINO enables domain-specific data synthesis through an inductive approach that learns domain representations from reference examples, improving code benchmark performance without requiring explicit domain descriptions.","ai_keywords":["domain-specific data synthesis","inductive paradigm","reference examples","prompt tuning","contrastive disentanglement objective","domain-level patterns","sample-specific noise","overfitting","synthetic data distribution","Pass@1 accuracy","instruction-tuned backbones","domain adaptation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6837243353739ee0d588d04c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/CYU8xxEHxxGkwNVCpua6X.png","isPro":false,"fullname":"Hang Yu","user":"fhlyhv","type":"user"},{"_id":"62c2c91aeb9ded74358aed9a","avatarUrl":"/avatars/7553ed1a5f27ea0e7a2f1830067e3cfc.svg","isPro":false,"fullname":"tongye","user":"tongye","type":"user"},{"_id":"6a15a5fb789512f7c9f3ebad","avatarUrl":"/avatars/e6b494ffaa23fa398b89f07509a27a0c.svg","isPro":false,"fullname":"罗子轩","user":"davidclark26","type":"user"},{"_id":"69bcead8685c38830c6381ca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/9zRy-q7Of6x3IH6aGkbQm.jpeg","isPro":false,"fullname":"佐藤莉子","user":"miladavis","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30039.md"}">

Papers

arxiv:2605.30039

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Published on May 29

· Submitted by

Hang Yu on Jun 3

Upvote

Authors:

Abstract

DOMINO enables domain-specific data synthesis through an inductive approach that learns domain representations from reference examples, improving code benchmark performance without requiring explicit domain descriptions.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

View arXiv page View PDF Add to collection

Community

fhlyhv

Paper submitter about 6 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30039

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30039 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30039 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30039 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers