🧠 MIRA is a data selection framework for the mid-training stage of LLM development — the phase between pretraining and post-training that uses large-scale curated data to strengthen capabilities like coding, reasoning, and tool use. 💡 The core challenge is that mid-training corpora are extremely heterogeneous, mixing web documents, code, math, agent traces, and tool-use logs, so no single quality criterion works across all sources.</p>\n<p>🔑 MIRA's key insight is to make rubric construction itself part of data selection, rather than relying on fixed or global quality criteria. It operates in four steps:</p>\n<p>1️⃣ Source Clustering 🗂️: Groups 21 data sources into capability-coherent clusters based on content-embedding similarity;</p>\n<p>2️⃣ Self-Anchored Rubric Discovery 🔍: A frontier teacher model (Kimi-K2.6) freely proposes quality dimensions for sampled records, which are then clustered into group-specific anchor rubrics;</p>\n<p>3️⃣ Anchored Judge Distillation 🎓: These fixed rubrics are used to generate structured teacher labels (~2M scored records), which are distilled into lightweight group-specific student scorers (Qwen3.5-35B-A3B) for full-corpus inference;</p>\n<p>4️⃣ Source-Aware Filtering 🎯: Reliability masking suppresses unreliable scoring dimensions, and per-group retention thresholds preserve source diversity.</p>\n<p>📊 In code-oriented mid-training experiments on Qwen2.5-Coder-14B, MIRA-Group uses only 25B tokens — half the full 50B-token corpus 🔥 — yet achieves the best macro average (64.20) across nine benchmarks, outperforming perplexity filtering, DSIR, DataMan, and random selection, while matching the unfiltered full-corpus run.</p>\n<p>🔬 Analysis shows that MIRA's scores are robust to sequence length 📏, its discovered rubrics are source-adaptive while still subsuming generic quality criteria ✅, and its reliability masking effectively identifies and suppresses poorly-calibrated scoring dimensions 🛡️.</p>\n","updatedAt":"2026-06-03T13:40:09.159Z","author":{"_id":"66d82581b842183143b87da8","avatarUrl":"/avatars/8eb678c007879ba1e61272e31086c58b.svg","fullname":"Jian Yang","name":"csjiaya","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8684420585632324},"editors":["csjiaya"],"editorAvatarUrls":["/avatars/8eb678c007879ba1e61272e31086c58b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30288","authors":[{"_id":"6a202dea15100c5272a843a6","name":"Haowen Wang","hidden":false},{"_id":"6a202dea15100c5272a843a7","name":"Yaxin Du","hidden":false},{"_id":"6a202dea15100c5272a843a8","name":"Jian Yang","hidden":false},{"_id":"6a202dea15100c5272a843a9","name":"Jiajun Wu","hidden":false},{"_id":"6a202dea15100c5272a843aa","name":"Shukai Liu","hidden":false},{"_id":"6a202dea15100c5272a843ab","name":"Yuxuan Zhang","hidden":false},{"_id":"6a202dea15100c5272a843ac","name":"Pingjie Wang","hidden":false},{"_id":"6a202dea15100c5272a843ad","name":"Siheng Chen","hidden":false},{"_id":"6a202dea15100c5272a843ae","name":"Tuney Zheng","hidden":false},{"_id":"6a202dea15100c5272a843af","name":"Ming Zhou","hidden":false},{"_id":"6a202dea15100c5272a843b0","name":"Xianglong Liu","hidden":false},{"_id":"6a202dea15100c5272a843b1","name":"Bryan Dai","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection","submittedOnDailyBy":{"_id":"66d82581b842183143b87da8","avatarUrl":"/avatars/8eb678c007879ba1e61272e31086c58b.svg","isPro":false,"fullname":"Jian Yang","user":"csjiaya","type":"user","name":"csjiaya"},"summary":"Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.","upvotes":19,"discussionId":"6a202deb15100c5272a843b2","projectPage":"https://huggingface.co/collections/Multilingual-Multimodal-NLP/mira","ai_summary":"MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.","ai_keywords":["mid-training","large language models","data selection","source-aware filtering","self-anchored rubric discovery","semantic selection","model-based methods","curriculum learning","token efficiency","code-oriented training"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66d82581b842183143b87da8","avatarUrl":"/avatars/8eb678c007879ba1e61272e31086c58b.svg","isPro":false,"fullname":"Jian Yang","user":"csjiaya","type":"user"},{"_id":"64ab99dcb76bfd863eba64c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ab99dcb76bfd863eba64c1/UBXwDPx17X-gl-SzBPvrc.jpeg","isPro":false,"fullname":"TY.Zheng","user":"aaabiao","type":"user"},{"_id":"648e5a70df53671f33e94d52","avatarUrl":"/avatars/ea196b6cb1350accc61925cb0875d437.svg","isPro":true,"fullname":"Hongxin Li","user":"HongxinLi","type":"user"},{"_id":"6587ecb4eac1c5dac0a982ac","avatarUrl":"/avatars/db56880eb057849aed0ea1952b95913c.svg","isPro":false,"fullname":"qinfeng","user":"AaronQF","type":"user"},{"_id":"656d97b10bbc114fe64a96c5","avatarUrl":"/avatars/fd23bae1d85c5b96c42064a5ddcfad41.svg","isPro":false,"fullname":"SiweiWu","user":"SiweiWu","type":"user"},{"_id":"69c67b2fa994b07915a6e083","avatarUrl":"/avatars/3d0fd966df540d34095d2c84ce449180.svg","isPro":false,"fullname":"wei zhang","user":"zwpride","type":"user"},{"_id":"666914ba38d9327ca72134c4","avatarUrl":"/avatars/e65e2e27fc01064909ba257565387d10.svg","isPro":false,"fullname":"Shukai Liu","user":"skLiu","type":"user"},{"_id":"69ba3810d42ab1f838ab1887","avatarUrl":"/avatars/5a8d4fce163e410e8fbe065870b002cf.svg","isPro":false,"fullname":"shengjie fang","user":"Uranus1234","type":"user"},{"_id":"62691c0f412528b78945d063","avatarUrl":"/avatars/8b5e7d90541680bb88511fbd93510997.svg","isPro":false,"fullname":"GSY","user":"XiaoY1","type":"user"},{"_id":"668619ce7374cac565759731","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668619ce7374cac565759731/tUtiyIQRGsMdq3HB2yYIL.jpeg","isPro":false,"fullname":"Fanglin Xu","user":"Tswatery","type":"user"},{"_id":"64ba096e760936217a3ad2e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba096e760936217a3ad2e2/aNQK83Jg5PsBkY0UDg-RA.jpeg","isPro":false,"fullname":"Linzheng Chai","user":"Challenging666","type":"user"},{"_id":"64dc39d27f749b6e34702b81","avatarUrl":"/avatars/3db6db301831b838dd172937ef7653df.svg","isPro":false,"fullname":"Du","user":"Dorothydu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30288.md"}">
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection
Authors: ,
,
,
,
,
,
,
,
,
,
,
Abstract
MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.
Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.
Community
🧠 MIRA is a data selection framework for the mid-training stage of LLM development — the phase between pretraining and post-training that uses large-scale curated data to strengthen capabilities like coding, reasoning, and tool use. 💡 The core challenge is that mid-training corpora are extremely heterogeneous, mixing web documents, code, math, agent traces, and tool-use logs, so no single quality criterion works across all sources.
🔑 MIRA's key insight is to make rubric construction itself part of data selection, rather than relying on fixed or global quality criteria. It operates in four steps:
1️⃣ Source Clustering 🗂️: Groups 21 data sources into capability-coherent clusters based on content-embedding similarity;
2️⃣ Self-Anchored Rubric Discovery 🔍: A frontier teacher model (Kimi-K2.6) freely proposes quality dimensions for sampled records, which are then clustered into group-specific anchor rubrics;
3️⃣ Anchored Judge Distillation 🎓: These fixed rubrics are used to generate structured teacher labels (~2M scored records), which are distilled into lightweight group-specific student scorers (Qwen3.5-35B-A3B) for full-corpus inference;
4️⃣ Source-Aware Filtering 🎯: Reliability masking suppresses unreliable scoring dimensions, and per-group retention thresholds preserve source diversity.
📊 In code-oriented mid-training experiments on Qwen2.5-Coder-14B, MIRA-Group uses only 25B tokens — half the full 50B-token corpus 🔥 — yet achieves the best macro average (64.20) across nine benchmarks, outperforming perplexity filtering, DSIR, DataMan, and random selection, while matching the unfiltered full-corpus run.
🔬 Analysis shows that MIRA's scores are robust to sequence length 📏, its discovered rubrics are source-adaptive while still subsuming generic quality criteria ✅, and its reliability masking effectively identifies and suppresses poorly-calibrated scoring dimensions 🛡️.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30288 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30288 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30288 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.