🚀 Excited to share LEAD: Length-Efficient Adaptive and Dynamic Reasoning for LLMs!</p>\n<p>Reasoning LLMs are great at solving hard problems, but they tend to \"think longer to think better\" — even when the problem doesn't need it. Length-penalty RL fixes this in principle, but in practice every existing recipe makes two static assumptions that the underlying signals don't honor:<br> 1. Fixed reward weights — but the correctness-vs-efficiency balance is non-stationary across training.<br> 2. A single global length budget — but reasoning budgets vary by orders of magnitude across prompts.</p>\n<p>LEAD replaces both with online, self-calibrating mechanisms:<br>🎛️ A Potential-Scaled Instability (PSI) controller adapts the weights every step from each reward's within-group variance and headroom-to-saturation — implementing an explore-then-anchor curriculum automatically.</p>\n<p>📏 A per-problem online target estimated from the model's own correct rollouts, with a symmetric efficiency reward that penalizes over-compression as well as overthinking.</p>\n<p>Headline results (DeepSeek-R1-Distill-Qwen-1.5B, 4K budget, 5 math benchmarks): LEAD reaches 53.36 acc / 3714 tokens / +0.68 AES, the only method that improves accuracy over base while reducing length.</p>\n<p>📄 Paper: <a href=\"https://arxiv.org/abs/2605.09806\" rel=\"nofollow\">https://arxiv.org/abs/2605.09806</a><br>💻 Code: <a href=\"https://github.com/CrazyMint/LEAD\" rel=\"nofollow\">https://github.com/CrazyMint/LEAD</a><br>🤗 Model: <a href=\"https://huggingface.co/Kotom1/math_lead_4k_deepseek-r1-1.5b\">https://huggingface.co/Kotom1/math_lead_4k_deepseek-r1-1.5b</a></p>\n","updatedAt":"2026-05-14T21:37:05.966Z","author":{"_id":"66e9c93990465c96170d8630","avatarUrl":"/avatars/39f674083fcdf0367c60229c9ec730c8.svg","fullname":"Songtao Wei","name":"Kotom1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8995532393455505},"editors":["Kotom1"],"editorAvatarUrls":["/avatars/39f674083fcdf0367c60229c9ec730c8.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.09806","authors":[{"_id":"6a0545efb1a8cbabc9f08861","user":{"_id":"66e9c93990465c96170d8630","avatarUrl":"/avatars/39f674083fcdf0367c60229c9ec730c8.svg","isPro":false,"fullname":"Songtao Wei","user":"Kotom1","type":"user","name":"Kotom1"},"name":"Songtao Wei","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:54:41.183Z","hidden":false},{"_id":"6a0545efb1a8cbabc9f08862","name":"Yi Li","hidden":false},{"_id":"6a0545efb1a8cbabc9f08863","name":"Zhikai Li","hidden":false},{"_id":"6a0545efb1a8cbabc9f08864","name":"Xu Hu","hidden":false},{"_id":"6a0545efb1a8cbabc9f08865","name":"Yuede Ji","hidden":false},{"_id":"6a0545efb1a8cbabc9f08866","name":"Guanpeng Li","hidden":false},{"_id":"6a0545efb1a8cbabc9f08867","name":"Feng Chen","hidden":false},{"_id":"6a0545efb1a8cbabc9f08868","name":"Carl Yang","hidden":false},{"_id":"6a0545efb1a8cbabc9f08869","name":"Zhichun Guo","hidden":false},{"_id":"6a0545efb1a8cbabc9f0886a","name":"Bingzhe Li","hidden":false}],"publishedAt":"2026-05-10T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models","submittedOnDailyBy":{"_id":"66e9c93990465c96170d8630","avatarUrl":"/avatars/39f674083fcdf0367c60229c9ec730c8.svg","isPro":false,"fullname":"Songtao Wei","user":"Kotom1","type":"user","name":"Kotom1"},"summary":"Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed what the underlying problems require, wasting compute, latency, and context budgets. While introducing length-based efficiency rewards during reinforcement learning offers a natural remedy, existing methods struggle with two fundamental challenges: the optimal balance between correctness and efficiency is non-stationary throughout training, and intrinsic reasoning budgets vary drastically across problems. Relying on static reward weights and global length constraints inevitably forces a compromise between degraded accuracy and unrealized compression. To overcome these limitations, we propose LEAD (Length-Efficient Adaptive and Dynamic reasoning), a method that replaces static heuristics with online, self-adaptive mechanisms. LEAD dynamically calibrates the correctness-efficiency trade-off at each step using a Potential-Scaled Instability, directing optimization capacity to the most informative learning signal. Furthermore, it estimates an adaptive per-problem target length online based on the model's own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on five mathematical reasoning benchmarks, LEAD achieves the highest accuracy and Accuracy-Efficiency Score among RL-trained efficient-reasoning methods while producing substantially shorter outputs than the base model.","upvotes":5,"discussionId":"6a0545f0b1a8cbabc9f0886b","githubRepo":"https://github.com/CrazyMint/LEAD","githubRepoAddedBy":"user","ai_summary":"LEAD is a method that dynamically adapts reasoning efficiency during training by using online calibration of correctness-efficiency trade-offs and adaptive problem-specific length targets to improve mathematical reasoning accuracy and efficiency.","ai_keywords":["Chain-of-Thought","reinforcement learning","reasoning models","length-based efficiency rewards","potential-scaled instability","adaptive reasoning","efficiency reward","mathematical reasoning benchmarks","accuracy-efficiency score"],"githubStars":2,"organization":{"_id":"67ffe36e0d123ebf23791bc1","name":"UTD-Dallas","fullname":"University of Texas at Dallas","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ffe0beb0c26d6ec0b27de4/QipsB6j6j9v41epQDkTe1.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66e9c93990465c96170d8630","avatarUrl":"/avatars/39f674083fcdf0367c60229c9ec730c8.svg","isPro":false,"fullname":"Songtao Wei","user":"Kotom1","type":"user"},{"_id":"6989005baa7d770d5185f500","avatarUrl":"/avatars/5453cfdc0324b459c4f88131963aa360.svg","isPro":false,"fullname":"Bingzhe Li","user":"libingzheren","type":"user"},{"_id":"680f20f5f3cd7c68f689e156","avatarUrl":"/avatars/b572737cbf6b14223770e497dc3ac895.svg","isPro":false,"fullname":"dj","user":"dj220001","type":"user"},{"_id":"6a0559a2f8a6e71075a68f83","avatarUrl":"/avatars/ee0590621428366a6d440db23648e975.svg","isPro":false,"fullname":"Dingyi Kang","user":"Dingy-Kang","type":"user"},{"_id":"64763dca71d420a1f6be634e","avatarUrl":"/avatars/e11dc1f3ff7c61883f16e8c04cc0871d.svg","isPro":false,"fullname":"PearLi","user":"PearMath","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67ffe36e0d123ebf23791bc1","name":"UTD-Dallas","fullname":"University of Texas at Dallas","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ffe0beb0c26d6ec0b27de4/QipsB6j6j9v41epQDkTe1.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.09806.md"}">
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
Abstract
LEAD is a method that dynamically adapts reasoning efficiency during training by using online calibration of correctness-efficiency trade-offs and adaptive problem-specific length targets to improve mathematical reasoning accuracy and efficiency.
AI-generated summary
Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed what the underlying problems require, wasting compute, latency, and context budgets. While introducing length-based efficiency rewards during reinforcement learning offers a natural remedy, existing methods struggle with two fundamental challenges: the optimal balance between correctness and efficiency is non-stationary throughout training, and intrinsic reasoning budgets vary drastically across problems. Relying on static reward weights and global length constraints inevitably forces a compromise between degraded accuracy and unrealized compression. To overcome these limitations, we propose LEAD (Length-Efficient Adaptive and Dynamic reasoning), a method that replaces static heuristics with online, self-adaptive mechanisms. LEAD dynamically calibrates the correctness-efficiency trade-off at each step using a Potential-Scaled Instability, directing optimization capacity to the most informative learning signal. Furthermore, it estimates an adaptive per-problem target length online based on the model's own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on five mathematical reasoning benchmarks, LEAD achieves the highest accuracy and Accuracy-Efficiency Score among RL-trained efficient-reasoning methods while producing substantially shorter outputs than the base model.
Community
🚀 Excited to share LEAD: Length-Efficient Adaptive and Dynamic Reasoning for LLMs!
Reasoning LLMs are great at solving hard problems, but they tend to "think longer to think better" — even when the problem doesn't need it. Length-penalty RL fixes this in principle, but in practice every existing recipe makes two static assumptions that the underlying signals don't honor:
1. Fixed reward weights — but the correctness-vs-efficiency balance is non-stationary across training.
2. A single global length budget — but reasoning budgets vary by orders of magnitude across prompts.
LEAD replaces both with online, self-calibrating mechanisms:
🎛️ A Potential-Scaled Instability (PSI) controller adapts the weights every step from each reward's within-group variance and headroom-to-saturation — implementing an explore-then-anchor curriculum automatically.
📏 A per-problem online target estimated from the model's own correct rollouts, with a symmetric efficiency reward that penalizes over-compression as well as overthinking.
Headline results (DeepSeek-R1-Distill-Qwen-1.5B, 4K budget, 5 math benchmarks): LEAD reaches 53.36 acc / 3714 tokens / +0.68 AES, the only method that improves accuracy over base while reducing length.
📄 Paper: https://arxiv.org/abs/2605.09806
💻 Code: https://github.com/CrazyMint/LEAD
🤗 Model: https://huggingface.co/Kotom1/math_lead_4k_deepseek-r1-1.5b
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.09806 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.09806 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.09806 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.