Hugging Face Daily Papers · May 14, 2026 · 4 min read

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://lixin.ai/ListOPD\" rel=\"nofollow\">https://lixin.ai/ListOPD</a></p>\n","updatedAt":"2026-05-14T02:04:39.033Z","author":{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","fullname":"XinLi","name":"XINLI1997","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.2732311487197876},"editors":["XINLI1997"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08737","authors":[{"_id":"6a052d66b1a8cbabc9f086a7","user":{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","isPro":false,"fullname":"XinLi","user":"XINLI1997","type":"user","name":"XINLI1997"},"name":"Xin Li","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:55:52.001Z","hidden":false},{"_id":"6a052d66b1a8cbabc9f086a8","name":"Hao Jiang","hidden":false},{"_id":"6a052d66b1a8cbabc9f086a9","name":"Annan Wang","hidden":false},{"_id":"6a052d66b1a8cbabc9f086aa","name":"Yichi Zhang","hidden":false},{"_id":"6a052d66b1a8cbabc9f086ab","name":"Chau Yuen","hidden":false}],"publishedAt":"2026-05-09T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs","submittedOnDailyBy":{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","isPro":false,"fullname":"XinLi","user":"XINLI1997","type":"user","name":"XINLI1997"},"summary":"On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.","upvotes":2,"discussionId":"6a052d66b1a8cbabc9f086ac","projectPage":"https://lixin.ai/ListOPD","ai_summary":"On-policy distillation with reward extrapolation exhibits a safety threshold beyond which structured output tasks lose format preservation, with empirical validation showing performance parity at reduced parameter count when operating below this threshold.","ai_keywords":["on-policy distillation","reward-extrapolation coefficient","student-teacher model","structured-output tasks","clip-safety threshold","Bernoulli reduction","fixed point","format-preserving","format-collapsing","K-ary listwise JSON","SFT","NDCG@1","parse validity"],"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64a948d9723beceb2f13f5eb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64a948d9723beceb2f13f5eb/UQMO7LczJpabtbcoNt1zQ.jpeg","isPro":false,"fullname":"XinLi","user":"XINLI1997","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6508b28cf36bb51c50faad98","name":"NanyangTechnologicalUniversity","fullname":"Nanyang Technological University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZPD1fvei0bcIGeDXxeSkn.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.08737.md"}">

Papers

arxiv:2605.08737

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Published on May 9

· Submitted by

XinLi on May 14

Nanyang Technological University

Upvote

Authors:

Xin Li ,

Abstract

On-policy distillation with reward extrapolation exhibits a safety threshold beyond which structured output tasks lose format preservation, with empirical validation showing performance parity at reduced parameter count when operating below this threshold.

AI-generated summary

On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

View arXiv page View PDF Project page Add to collection

Community

XINLI1997

Paper author Paper submitter about 24 hours ago

https://lixin.ai/ListOPD

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.08737

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08737 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08737 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08737 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers