Hugging Face Daily Papers · May 15, 2026 · 3 min read

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This is a work aimed for boosting RLVR performance using only minimal amount of SFT data in a unified training paradigm. Check our code at <a href=\"https://github.com/KaiYan289/FEST\" rel=\"nofollow\">https://github.com/KaiYan289/FEST</a> and checkpoints/dataset at <a href=\"https://huggingface.co/collections/kaiyan289/fest\">https://huggingface.co/collections/kaiyan289/fest</a>!</p>\n","updatedAt":"2026-05-15T15:07:05.225Z","author":{"_id":"65de7628deee79773f0f46f6","avatarUrl":"/avatars/6c509dbe96e47b47271eb74178c1c9ba.svg","fullname":"Kai Yan","name":"kaiyan289","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8521366119384766},"editors":["kaiyan289"],"editorAvatarUrls":["/avatars/6c509dbe96e47b47271eb74178c1c9ba.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15012","authors":[{"_id":"6a0735b13192c3787792506a","name":"Kai Yan","hidden":false},{"_id":"6a0735b13192c3787792506b","name":"Alexander G. Schwing","hidden":false},{"_id":"6a0735b13192c3787792506c","name":"Yu-Xiong Wang","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance","submittedOnDailyBy":{"_id":"65de7628deee79773f0f46f6","avatarUrl":"/avatars/6c509dbe96e47b47271eb74178c1c9ba.svg","isPro":false,"fullname":"Kai Yan","user":"kaiyan289","type":"user","name":"kaiyan289"},"summary":"Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.","upvotes":1,"discussionId":"6a0735b23192c3787792506d","githubRepo":"https://github.com/KaiYan289/FEST","githubRepoAddedBy":"user","ai_summary":"FEST is a few-shot demonstration-guided reinforcement learning algorithm that achieves strong performance with minimal supervised fine-tuning data by combining supervised signals, on-policy learning, and weighted training to prevent overfitting.","ai_keywords":["Reinforcement Learning with Verifiable Rewards","LLMs","chain-of-thought rollouts","math","coding","demonstration-guided RLVR","Supervised FineTuning","few-shot learning","on-policy signal","decaying weights","overfitting"],"githubStars":0,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65de7628deee79773f0f46f6","avatarUrl":"/avatars/6c509dbe96e47b47271eb74178c1c9ba.svg","isPro":false,"fullname":"Kai Yan","user":"kaiyan289","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15012.md"}">

Papers

arxiv:2605.15012

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Published on May 14

· Submitted by

Kai Yan on May 15

University of Illinois at Urbana-Champaign

Upvote

Authors:

Abstract

FEST is a few-shot demonstration-guided reinforcement learning algorithm that achieves strong performance with minimal supervised fine-tuning data by combining supervised signals, on-policy learning, and weighted training to prevent overfitting.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

View arXiv page View PDF GitHub 0 Add to collection

Community

kaiyan289

Paper submitter about 10 hours ago

This is a work aimed for boosting RLVR performance using only minimal amount of SFT data in a unified training paradigm. Check our code at https://github.com/KaiYan289/FEST and checkpoints/dataset at https://huggingface.co/collections/kaiyan289/fest!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15012

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15012 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15012 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15012 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers