Hugging Face Daily Papers · June 3, 2026 · 5 min read

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.</p>\n","updatedAt":"2026-06-03T05:41:29.569Z","author":{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","fullname":"Chengsong Huang","name":"ChengsongHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":14,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9393887519836426},"editors":["ChengsongHuang"],"editorAvatarUrls":["/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03102","authors":[{"_id":"6a1fb133e292c1c78ecb146e","name":"Runpeng Dai","hidden":false},{"_id":"6a1fb133e292c1c78ecb146f","name":"Tong Zheng","hidden":false},{"_id":"6a1fb133e292c1c78ecb1470","name":"Rui Liu","hidden":false},{"_id":"6a1fb133e292c1c78ecb1471","name":"Chengsong Huang","hidden":false},{"_id":"6a1fb133e292c1c78ecb1472","name":"Hongtu Zhu","hidden":false}],"publishedAt":"2026-06-02T03:42:04.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling","submittedOnDailyBy":{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user","name":"ChengsongHuang"},"summary":"Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.","upvotes":6,"discussionId":"6a1fb133e292c1c78ecb1473","ai_summary":"Adaptive sampling for large language models is formulated as a Markov decision process and optimized using reinforcement learning to balance correctness, latency, and computational cost.","ai_keywords":["Markov decision process","reinforcement learning","adaptive sampling","large language models","Lagrangian relaxation","constrained optimization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","isPro":false,"fullname":"Jinyuan Li","user":"jinyuan222","type":"user"},{"_id":"6623ea65b642e29cdf90a1b4","avatarUrl":"/avatars/e32e90574c1162b2be87ed78604e3e4d.svg","isPro":true,"fullname":"TongZheng","user":"TongZheng1999","type":"user"},{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","isPro":false,"fullname":"Langlin Huang","user":"shrango","type":"user"},{"_id":"65037565da2d88e201f63b7a","avatarUrl":"/avatars/d1b6ce17236360e9583b8bb4cb87e506.svg","isPro":false,"fullname":"Runpeng Dai","user":"Leo-Dai","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03102.md"}">

Papers

arxiv:2606.03102

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Published on Jun 2

· Submitted by

Chengsong Huang on Jun 3

Upvote

Authors:

Abstract

Adaptive sampling for large language models is formulated as a Markov decision process and optimized using reinforcement learning to balance correctness, latency, and computational cost.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Add to collection

Community

ChengsongHuang

Paper submitter about 7 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.03102

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.03102 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.03102 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.03102 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers