We usually think of RL as the go-to tool for complex reasoning (CoT) , but this paper demonstrates it is also a highly effective approach to enhance the direct recall of parametric knowledge! Tested on a strict, non-CoT closed-book QA setting , RL boosted direct factual recall by an average of ~27% relative across the Llama, Qwen, and OLMo model families.</p>\n<p>The most striking takeaways:<br>1️⃣ No new facts are injected: RL simply redistributes probability mass, yanking correct answers from the low-probability tail into reliable greedy generations. To put it directly, RL fundamentally optimizes the recall of latent knowledge.<br>2️⃣ The unexpected contribution of 0/128 samples: Remarkably, ~83% of the performance jump is driven by training on the hardest examples, those where the correct answer never appeared in 128 pre-RL samples! As long as these rare correct rollouts emerge even occasionally during training, RL captures and powerfully reinforces them. </p>\n<p>Ultimately, this work deepens our understanding of RL's true scope. It proves that RL isn't just an optimizer for reasoning trajectories—it provides compelling empirical evidence that LLMs truly \"know more than they express,\" and RL is the key to narrowing that accessibility gap.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/64e4090f222b232f03fe5f63/rkPkRm0tQaATfCzG8V3Br.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/64e4090f222b232f03fe5f63/rkPkRm0tQaATfCzG8V3Br.png\" alt=\"截屏2026-05-13 20.53.13\"></a></p>\n","updatedAt":"2026-05-13T13:03:40.069Z","author":{"_id":"64e4090f222b232f03fe5f63","avatarUrl":"/avatars/1e97328de374d726f64bf16528d36ca4.svg","fullname":"Wanli Yang","name":"WenDingY","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8948545455932617},"editors":["WenDingY"],"editorAvatarUrls":["/avatars/1e97328de374d726f64bf16528d36ca4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.07153","authors":[{"_id":"6a0473d2e94247db1a5a9dd2","name":"Wanli Yang","hidden":false},{"_id":"6a0473d2e94247db1a5a9dd3","name":"Hongyu Zang","hidden":false},{"_id":"6a0473d2e94247db1a5a9dd4","name":"Junwei Zhang","hidden":false},{"_id":"6a0473d2e94247db1a5a9dd5","name":"Wenjie Shi","hidden":false},{"_id":"6a0473d2e94247db1a5a9dd6","name":"Du Su","hidden":false},{"_id":"6a0473d2e94247db1a5a9dd7","name":"Jingang Wang","hidden":false},{"_id":"6a0473d2e94247db1a5a9dd8","name":"Xueqi Cheng","hidden":false},{"_id":"6a0473d2e94247db1a5a9dd9","name":"Fei Sun","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs","submittedOnDailyBy":{"_id":"64e4090f222b232f03fe5f63","avatarUrl":"/avatars/1e97328de374d726f64bf16528d36ca4.svg","isPro":false,"fullname":"Wanli Yang","user":"WenDingY","type":"user","name":"WenDingY"},"summary":"Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.","upvotes":5,"discussionId":"6a0473d2e94247db1a5a9dda","ai_summary":"Reinforcement learning improves large language model recall of parametric knowledge by redistributing probability mass toward correct answers, with gains driven primarily by reinforcing rare but learnable examples.","ai_keywords":["reinforcement learning","large language models","parametric knowledge","recall","binary correctness rewards","fact-level train-test deduplication","chain-of-thought","closed-book QA","probability mass redistribution","rare correct rollouts"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64e4090f222b232f03fe5f63","avatarUrl":"/avatars/1e97328de374d726f64bf16528d36ca4.svg","isPro":false,"fullname":"Wanli Yang","user":"WenDingY","type":"user"},{"_id":"6825a69b0797c68013e88a3e","avatarUrl":"/avatars/742d3e22b8b042f1a751653c995baafe.svg","isPro":false,"fullname":"Honglin Wang","user":"Leo-WHL","type":"user"},{"_id":"6401690a52fb66b80d1f8975","avatarUrl":"/avatars/561dd0584150eb29b3a62ffdd650de7b.svg","isPro":false,"fullname":"Tan-Hexiang","user":"Tan-Hexiang","type":"user"},{"_id":"68da4c0442241d2b33e8c221","avatarUrl":"/avatars/4a4846028e5963cabf0f15111631a77a.svg","isPro":false,"fullname":"Fei Sun","user":"feisun-sf","type":"user"},{"_id":"65c340a4bb402b55164cd70f","avatarUrl":"/avatars/8957cf3e263c0b0f73798a0b96609a43.svg","isPro":false,"fullname":"Pu Jiayue","user":"pujiayue","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.07153.md"}">
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
Abstract
Reinforcement learning improves large language model recall of parametric knowledge by redistributing probability mass toward correct answers, with gains driven primarily by reinforcing rare but learnable examples.
AI-generated summary
Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.
Community
We usually think of RL as the go-to tool for complex reasoning (CoT) , but this paper demonstrates it is also a highly effective approach to enhance the direct recall of parametric knowledge! Tested on a strict, non-CoT closed-book QA setting , RL boosted direct factual recall by an average of ~27% relative across the Llama, Qwen, and OLMo model families.
The most striking takeaways:
1️⃣ No new facts are injected: RL simply redistributes probability mass, yanking correct answers from the low-probability tail into reliable greedy generations. To put it directly, RL fundamentally optimizes the recall of latent knowledge.
2️⃣ The unexpected contribution of 0/128 samples: Remarkably, ~83% of the performance jump is driven by training on the hardest examples, those where the correct answer never appeared in 128 pre-RL samples! As long as these rare correct rollouts emerge even occasionally during training, RL captures and powerfully reinforces them.
Ultimately, this work deepens our understanding of RL's true scope. It proves that RL isn't just an optimizer for reasoning trajectories—it provides compelling empirical evidence that LLMs truly "know more than they express," and RL is the key to narrowing that accessibility gap.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.07153 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.07153 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.07153 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.