<a href=\"https://lixin.ai/ListOPD\" rel=\"nofollow\">https://lixin.ai/ListOPD</a></p>\n","updatedAt":"2026-05-14T02:04:39.033Z","author":{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","fullname":"XinLi","name":"XINLI1997","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.2732311487197876},"editors":["XINLI1997"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08737","authors":[{"_id":"6a052d66b1a8cbabc9f086a7","user":{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","isPro":false,"fullname":"XinLi","user":"XINLI1997","type":"user","name":"XINLI1997"},"name":"Xin Li","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:52.001Z","hidden":false},{"_id":"6a052d66b1a8cbabc9f086a8","name":"Hao Jiang","hidden":false},{"_id":"6a052d66b1a8cbabc9f086a9","name":"Annan Wang","hidden":false},{"_id":"6a052d66b1a8cbabc9f086aa","name":"Yichi Zhang","hidden":false},{"_id":"6a052d66b1a8cbabc9f086ab","name":"Chau Yuen","hidden":false}],"publishedAt":"2026-05-09T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs","submittedOnDailyBy":{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","isPro":false,"fullname":"XinLi","user":"XINLI1997","type":"user","name":"XINLI1997"},"summary":"On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.","upvotes":2,"discussionId":"6a052d66b1a8cbabc9f086ac","projectPage":"https://lixin.ai/ListOPD","ai_summary":"On-policy distillation with reward extrapolation exhibits a safety threshold beyond which structured output tasks lose format preservation, with empirical validation showing performance parity at reduced parameter count when operating below this threshold.","ai_keywords":["on-policy distillation","reward-extrapolation coefficient","student-teacher model","structured-output tasks","clip-safety threshold","Bernoulli reduction","fixed point","format-preserving","format-collapsing","K-ary listwise JSON","SFT","NDCG@1","parse validity"],"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","isPro":false,"fullname":"XinLi","user":"XINLI1997","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08737.md"}">
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
Published on May 9
· Submitted by XinLi on May 14 Abstract
On-policy distillation with reward extrapolation exhibits a safety threshold beyond which structured output tasks lose format preservation, with empirical validation showing performance parity at reduced parameter count when operating below this threshold.
AI-generated summary
On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.08737 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.08737 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.08737 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.