Hugging Face Daily Papers · June 3, 2026 · 6 min read

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

🧠 MIRA is a data selection framework for the mid-training stage of LLM development — the phase between pretraining and post-training that uses large-scale curated data to strengthen capabilities like coding, reasoning, and tool use. 💡 The core challenge is that mid-training corpora are extremely heterogeneous, mixing web documents, code, math, agent traces, and tool-use logs, so no single quality criterion works across all sources.\n🔑 MIRA's key insight is to make rubric construction itself part of data selection, rather than relying on fixed or global quality criteria. It operates in four steps:\n1️⃣ Source Clustering 🗂️: Groups 21 data sources into capability-coherent clusters based on content-embedding similarity;\n2️⃣ Self-Anchored Rubric Discovery 🔍: A frontier teacher model (Kimi-K2.6) freely proposes quality dimensions for sampled records, which are then clustered into group-specific anchor rubrics;\n3️⃣ Anchored Judge Distillation 🎓: These fixed rubrics are used to generate structured teacher labels (~2M scored records), which are distilled into lightweight group-specific student scorers (Qwen3.5-35B-A3B) for full-corpus inference;\n4️⃣ Source-Aware Filtering 🎯: Reliability masking suppresses unreliable scoring dimensions, and per-group retention thresholds preserve source diversity.\n📊 In code-oriented mid-training experiments on Qwen2.5-Coder-14B, MIRA-Group uses only 25B tokens — half the full 50B-token corpus 🔥 — yet achieves the best macro average (64.20) across nine benchmarks, outperforming perplexity filtering, DSIR, DataMan, and random selection, while matching the unfiltered full-corpus run.\n🔬 Analysis shows that MIRA's scores are robust to sequence length 📏, its discovered rubrics are source-adaptive while still subsuming generic quality criteria ✅, and its reliability masking effectively identifies and suppresses poorly-calibrated scoring dimensions 🛡️.\n","updatedAt":"2026-06-03T13:40:09.159Z","author":{"_id":"66d82581b842183143b87da8","avatarUrl":"/avatars/8eb678c007879ba1e61272e31086c58b.svg","fullname":"Jian Yang","name":"csjiaya","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8684420585632324},"editors":["csjiaya"],"editorAvatarUrls":["/avatars/8eb678c007879ba1e61272e31086c58b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30288","authors":[{"_id":"6a202dea15100c5272a843a6","name":"Haowen Wang","hidden":false},{"_id":"6a202dea15100c5272a843a7","name":"Yaxin Du","hidden":false},{"_id":"6a202dea15100c5272a843a8","name":"Jian Yang","hidden":false},{"_id":"6a202dea15100c5272a843a9","name":"Jiajun Wu","hidden":false},{"_id":"6a202dea15100c5272a843aa","name":"Shukai Liu","hidden":false},{"_id":"6a202dea15100c5272a843ab","name":"Yuxuan Zhang","hidden":false},{"_id":"6a202dea15100c5272a843ac","name":"Pingjie Wang","hidden":false},{"_id":"6a202dea15100c5272a843ad","name":"Siheng Chen","hidden":false},{"_id":"6a202dea15100c5272a843ae","name":"Tuney Zheng","hidden":false},{"_id":"6a202dea15100c5272a843af","name":"Ming Zhou","hidden":false},{"_id":"6a202dea15100c5272a843b0","name":"Xianglong Liu","hidden":false},{"_id":"6a202dea15100c5272a843b1","name":"Bryan Dai","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection","submittedOnDailyBy":{"_id":"66d82581b842183143b87da8","avatarUrl":"/avatars/8eb678c007879ba1e61272e31086c58b.svg","isPro":false,"fullname":"Jian Yang","user":"csjiaya","type":"user","name":"csjiaya"},"summary":"Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.","upvotes":19,"discussionId":"6a202deb15100c5272a843b2","projectPage":"https://huggingface.co/collections/Multilingual-Multimodal-NLP/mira","ai_summary":"MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.","ai_keywords":["mid-training","large language models","data selection","source-aware filtering","self-anchored rubric discovery","semantic selection","model-based methods","curriculum learning","token efficiency","code-oriented training"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66d82581b842183143b87da8","avatarUrl":"/avatars/8eb678c007879ba1e61272e31086c58b.svg","isPro":false,"fullname":"Jian Yang","user":"csjiaya","type":"user"},{"_id":"64ab99dcb76bfd863eba64c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ab99dcb76bfd863eba64c1/UBXwDPx17X-gl-SzBPvrc.jpeg","isPro":false,"fullname":"TY.Zheng","user":"aaabiao","type":"user"},{"_id":"648e5a70df53671f33e94d52","avatarUrl":"/avatars/ea196b6cb1350accc61925cb0875d437.svg","isPro":true,"fullname":"Hongxin Li","user":"HongxinLi","type":"user"},{"_id":"6587ecb4eac1c5dac0a982ac","avatarUrl":"/avatars/db56880eb057849aed0ea1952b95913c.svg","isPro":false,"fullname":"qinfeng","user":"AaronQF","type":"user"},{"_id":"656d97b10bbc114fe64a96c5","avatarUrl":"/avatars/fd23bae1d85c5b96c42064a5ddcfad41.svg","isPro":false,"fullname":"SiweiWu","user":"SiweiWu","type":"user"},{"_id":"69c67b2fa994b07915a6e083","avatarUrl":"/avatars/3d0fd966df540d34095d2c84ce449180.svg","isPro":false,"fullname":"wei zhang","user":"zwpride","type":"user"},{"_id":"666914ba38d9327ca72134c4","avatarUrl":"/avatars/e65e2e27fc01064909ba257565387d10.svg","isPro":false,"fullname":"Shukai Liu","user":"skLiu","type":"user"},{"_id":"69ba3810d42ab1f838ab1887","avatarUrl":"/avatars/5a8d4fce163e410e8fbe065870b002cf.svg","isPro":false,"fullname":"shengjie fang","user":"Uranus1234","type":"user"},{"_id":"62691c0f412528b78945d063","avatarUrl":"/avatars/8b5e7d90541680bb88511fbd93510997.svg","isPro":false,"fullname":"GSY","user":"XiaoY1","type":"user"},{"_id":"668619ce7374cac565759731","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/668619ce7374cac565759731/tUtiyIQRGsMdq3HB2yYIL.jpeg","isPro":false,"fullname":"Fanglin Xu","user":"Tswatery","type":"user"},{"_id":"64ba096e760936217a3ad2e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba096e760936217a3ad2e2/aNQK83Jg5PsBkY0UDg-RA.jpeg","isPro":false,"fullname":"Linzheng Chai","user":"Challenging666","type":"user"},{"_id":"64dc39d27f749b6e34702b81","avatarUrl":"/avatars/3db6db301831b838dd172937ef7653df.svg","isPro":false,"fullname":"Du","user":"Dorothydu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30288.md"}">

Papers

arxiv:2605.30288

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Published on May 29

· Submitted by

Jian Yang on Jun 3

Upvote

Authors:

Abstract

MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

View arXiv page View PDF Project page Add to collection

Community

csjiaya

Paper submitter about 7 hours ago

🔑 MIRA's key insight is to make rubric construction itself part of data selection, rather than relying on fixed or global quality criteria. It operates in four steps:

1️⃣ Source Clustering 🗂️: Groups 21 data sources into capability-coherent clusters based on content-embedding similarity;

2️⃣ Self-Anchored Rubric Discovery 🔍: A frontier teacher model (Kimi-K2.6) freely proposes quality dimensions for sampled records, which are then clustered into group-specific anchor rubrics;

3️⃣ Anchored Judge Distillation 🎓: These fixed rubrics are used to generate structured teacher labels (~2M scored records), which are distilled into lightweight group-specific student scorers (Qwen3.5-35B-A3B) for full-corpus inference;

4️⃣ Source-Aware Filtering 🎯: Reliability masking suppresses unreliable scoring dimensions, and per-group retention thresholds preserve source diversity.

📊 In code-oriented mid-training experiments on Qwen2.5-Coder-14B, MIRA-Group uses only 25B tokens — half the full 50B-token corpus 🔥 — yet achieves the best macro average (64.20) across nine benchmarks, outperforming perplexity filtering, DSIR, DataMan, and random selection, while matching the unfiltered full-corpus run.

🔬 Analysis shows that MIRA's scores are robust to sequence length 📏, its discovered rubrics are source-adaptive while still subsuming generic quality criteria ✅, and its reliability masking effectively identifies and suppresses poorly-calibrated scoring dimensions 🛡️.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30288

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30288 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30288 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30288 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers