Hugging Face Daily Papers · June 1, 2026 · 3 min read

Trust-Region Behavior Blending for On-Policy Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Trust-Region Behavior Blending for On-Policy Distillation</p>\n","updatedAt":"2026-06-01T10:33:50.320Z","author":{"_id":"62897fce5d9e25c10e4f319d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62897fce5d9e25c10e4f319d/bMlfAyzkNNZlkQ5mCW6Vc.jpeg","fullname":"Alexey Gorbatovski","name":"Myashka","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5421395301818848},"editors":["Myashka"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62897fce5d9e25c10e4f319d/bMlfAyzkNNZlkQ5mCW6Vc.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31159","authors":[{"_id":"6a1d5ff7808ddbc3c7d4379c","name":"Daniil Plyusov","hidden":false},{"_id":"6a1d5ff7808ddbc3c7d4379d","user":{"_id":"62897fce5d9e25c10e4f319d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62897fce5d9e25c10e4f319d/bMlfAyzkNNZlkQ5mCW6Vc.jpeg","isPro":false,"fullname":"Alexey Gorbatovski","user":"Myashka","type":"user","name":"Myashka"},"name":"Alexey Gorbatovski","status":"claimed_verified","statusLastChangedAt":"2026-06-01T10:55:10.922Z","hidden":false},{"_id":"6a1d5ff7808ddbc3c7d4379e","user":{"_id":"636e71b2b0ebc04888157b71","avatarUrl":"/avatars/957ba705d470e3a01792741d7f0ff038.svg","isPro":false,"fullname":"Alexey Malakhov","user":"ZeL1k7","type":"user","name":"ZeL1k7"},"name":"Alexey Malakhov","status":"claimed_verified","statusLastChangedAt":"2026-06-01T10:55:06.838Z","hidden":false},{"_id":"6a1d5ff7808ddbc3c7d4379f","user":{"_id":"60b364e7f88532cd79eaff7b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654185363389-60b364e7f88532cd79eaff7b.jpeg","isPro":false,"fullname":"Nikita Balagansky","user":"elephantmipt","type":"user","name":"elephantmipt"},"name":"Nikita Balagansky","status":"claimed_verified","statusLastChangedAt":"2026-06-01T10:55:04.373Z","hidden":false},{"_id":"6a1d5ff7808ddbc3c7d437a0","name":"Boris Shaposhnikov","hidden":false},{"_id":"6a1d5ff7808ddbc3c7d437a1","name":"Daria Korotyshova","hidden":false},{"_id":"6a1d5ff7808ddbc3c7d437a2","user":{"_id":"62a9c8edc19f92ae443ab37f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a9c8edc19f92ae443ab37f/yczqpBOntLco_2Jn4hnT7.jpeg","isPro":false,"fullname":"Daniil Gavrilov","user":"kefirski","type":"user","name":"kefirski"},"name":"Daniil Gavrilov","status":"claimed_verified","statusLastChangedAt":"2026-06-01T10:55:02.111Z","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Trust-Region Behavior Blending for On-Policy Distillation","submittedOnDailyBy":{"_id":"62897fce5d9e25c10e4f319d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62897fce5d9e25c10e4f319d/bMlfAyzkNNZlkQ5mCW6Vc.jpeg","isPro":false,"fullname":"Alexey Gorbatovski","user":"Myashka","type":"user","name":"Myashka"},"summary":"On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.","upvotes":15,"discussionId":"6a1d5ff7808ddbc3c7d437a3","ai_summary":"Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.","ai_keywords":["on-policy distillation","student policy","teacher policy","prefix mismatch","offline distillation","behavior blending","KL trust region","reverse-KL loss","annealing"],"organization":{"_id":"675861e944dbb69c2673c71c","name":"t-tech","fullname":"T-Tech","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/674ea07d320a043daeb2d98b/IwSCMolFY4Otk7sFXzWhi.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62897fce5d9e25c10e4f319d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62897fce5d9e25c10e4f319d/bMlfAyzkNNZlkQ5mCW6Vc.jpeg","isPro":false,"fullname":"Alexey Gorbatovski","user":"Myashka","type":"user"},{"_id":"636e71b2b0ebc04888157b71","avatarUrl":"/avatars/957ba705d470e3a01792741d7f0ff038.svg","isPro":false,"fullname":"Alexey Malakhov","user":"ZeL1k7","type":"user"},{"_id":"65e8283cbcb5b0e1e9b6125c","avatarUrl":"/avatars/79044469f3f45a4e9c0353ae468c3b8a.svg","isPro":false,"fullname":"Daniel","user":"trifltt","type":"user"},{"_id":"6894b53ec1b365f7e4fa8e0d","avatarUrl":"/avatars/3be0e9c79d9aaadd10ccb9cf8bf2285c.svg","isPro":false,"fullname":"Alexey Khokhulin","user":"alexey-khokhulin","type":"user"},{"_id":"634c5f8cfb80cc6bcaf42c03","avatarUrl":"/avatars/1f37db0e70cbaf9707f4c8cbcee37ca0.svg","isPro":false,"fullname":"Daniil Laptev","user":"dlaptev","type":"user"},{"_id":"69d0ddd0e69bbe6b54d59048","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Fmcg9hoNoOh8YOrK8f9R7.png","isPro":false,"fullname":"Tim","user":"whytimmy","type":"user"},{"_id":"679bdf56cfee7dc47376bc07","avatarUrl":"/avatars/935fbf4bd784de6499dfdd337212284e.svg","isPro":false,"fullname":"Plyusov","user":"daniilplyusov","type":"user"},{"_id":"6752075f8681347ff4f19532","avatarUrl":"/avatars/64fcd64ceb35f9519f7fefe9cc7de7b3.svg","isPro":false,"fullname":"Yuriy Maksyuta","user":"ex7remum","type":"user"},{"_id":"637dd11dcbad6e62a5e39743","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637dd11dcbad6e62a5e39743/DG3rM8cy8inqbCoG4qizO.jpeg","isPro":false,"fullname":"Boris Shaposhnikov","user":"borisshapa","type":"user"},{"_id":"6697bd62e0cdc7b35cc49833","avatarUrl":"/avatars/06dbf15a7606adb66e7f0ea782ab24f9.svg","isPro":false,"fullname":"Aleksandr","user":"akrylov","type":"user"},{"_id":"652cedbdf120598322ae358a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652cedbdf120598322ae358a/RrxrP0gtQus4SfNwfyAg_.jpeg","isPro":false,"fullname":"Mikhail Tikhomirov","user":"RefalMachine","type":"user"},{"_id":"62a9c8edc19f92ae443ab37f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a9c8edc19f92ae443ab37f/yczqpBOntLco_2Jn4hnT7.jpeg","isPro":false,"fullname":"Daniil Gavrilov","user":"kefirski","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"675861e944dbb69c2673c71c","name":"t-tech","fullname":"T-Tech","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/674ea07d320a043daeb2d98b/IwSCMolFY4Otk7sFXzWhi.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31159.md"}">

Papers

arxiv:2605.31159

Trust-Region Behavior Blending for On-Policy Distillation

Published on May 29

· Submitted by

Alexey Gorbatovski on Jun 1

T-Tech

Upvote

Authors:

Alexey Gorbatovski ,

Alexey Malakhov ,

Nikita Balagansky ,

Daniil Gavrilov

Abstract

Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.

AI-generated summary

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

View arXiv page View PDF Add to collection

Community

Myashka

Paper author Paper submitter about 1 hour ago

Trust-Region Behavior Blending for On-Policy Distillation

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31159

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31159 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31159 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31159 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Trust-Region Behavior Blending for On-Policy Distillation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers