Hugging Face Daily Papers · May 13, 2026 · 4 min read

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Blog: <a href=\"http://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/\" rel=\"nofollow\">http://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/</a> Paper: <a href=\"https://arxiv.org/abs/2605.12484\" rel=\"nofollow\">https://arxiv.org/abs/2605.12484</a> Code: <a href=\"http://rishabhtiwari.ai/projects/fst/code\" rel=\"nofollow\">http://rishabhtiwari.ai/projects/fst/code</a> <a href=\"https://cdn-uploads.huggingface.co/production/uploads/66db38c38d2688295f731283/6wOGIF08T6npjzNFyrFBI.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66db38c38d2688295f731283/6wOGIF08T6npjzNFyrFBI.png\" alt=\"Screenshot 2026-05-13 at 9.20.56 AM\"></a> <a href=\"https://cdn-uploads.huggingface.co/production/uploads/66db38c38d2688295f731283/xT1TTE88nyHWAGCMbI0Ak.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66db38c38d2688295f731283/xT1TTE88nyHWAGCMbI0Ak.png\" alt=\"Screenshot 2026-05-13 at 10.08.51 AM\"></a>\n","updatedAt":"2026-05-13T17:10:39.573Z","author":{"_id":"66db38c38d2688295f731283","avatarUrl":"/avatars/a1f832d354a1f5d5c11593bf276b47a6.svg","fullname":"Rishabh Tiwari","name":"rishabh2k1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.44729939103126526},"editors":["rishabh2k1"],"editorAvatarUrls":["/avatars/a1f832d354a1f5d5c11593bf276b47a6.svg"],"reactions":[],"isReport":false}},{"id":"6a04b03db6829194f89218cf","author":{"_id":"66db38c38d2688295f731283","avatarUrl":"/avatars/a1f832d354a1f5d5c11593bf276b47a6.svg","fullname":"Rishabh Tiwari","name":"rishabh2k1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2026-05-13T17:09:17.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2026-05-13T17:10:22.054Z","author":{"_id":"66db38c38d2688295f731283","avatarUrl":"/avatars/a1f832d354a1f5d5c11593bf276b47a6.svg","fullname":"Rishabh Tiwari","name":"rishabh2k1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12484","authors":[{"_id":"6a04ae6fb1a8cbabc9f08515","name":"Rishabh Tiwari","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f08516","name":"Kusha Sareen","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f08517","name":"Lakshya A Agrawal","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f08518","name":"Joseph E. Gonzalez","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f08519","name":"Matei Zaharia","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f0851a","name":"Kurt Keutzer","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f0851b","name":"Inderjit S Dhillon","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f0851c","name":"Rishabh Agarwal","hidden":false},{"_id":"6a04ae6fb1a8cbabc9f0851d","name":"Devvrit Khatri","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66db38c38d2688295f731283/M7LtUw_uGjJc8l731sbUS.mp4"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Learning, Fast and Slow: Towards LLMs That Adapt Continually","submittedOnDailyBy":{"_id":"66db38c38d2688295f731283","avatarUrl":"/avatars/a1f832d354a1f5d5c11593bf276b47a6.svg","isPro":false,"fullname":"Rishabh Tiwari","user":"rishabh2k1","type":"user","name":"rishabh2k1"},"summary":"Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as \"slow\" weights and optimized context as \"fast\" weights. These fast \"weights\" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.","upvotes":9,"discussionId":"6a04ae6fb1a8cbabc9f0851e","ai_summary":"A fast-slow learning framework for large language models combines fixed parameters with optimized context to achieve better sample efficiency, reduced catastrophic forgetting, and improved adaptability in continual learning scenarios.","ai_keywords":["large language models","parameter-efficient fine-tuning","in-context learning","catastrophic forgetting","fast-slow learning framework","fast weights","slow weights","continual learning","sample efficiency","KL divergence"],"organization":{"_id":"66b1baeff10262fc4fa61961","name":"UCBerkeley","fullname":"University of California, Berkeley","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f425c3a096536aeab42dea/bxNKEkprdm5JI1wkjmNAL.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6684206a55fa08351af543f6","avatarUrl":"/avatars/9f706cf4c1f6ea4899a4d951758834ad.svg","isPro":false,"fullname":"Kusha Sareen","user":"kushasareen","type":"user"},{"_id":"66db38c38d2688295f731283","avatarUrl":"/avatars/a1f832d354a1f5d5c11593bf276b47a6.svg","isPro":false,"fullname":"Rishabh Tiwari","user":"rishabh2k1","type":"user"},{"_id":"63bf0cbb24e75d1efdfdf8bd","avatarUrl":"/avatars/3bb13e989db31820734cbe6b13be130a.svg","isPro":false,"fullname":"Lakshya A Agrawal","user":"LakshyAAAgrawal","type":"user"},{"_id":"66ce751a8ec9fda2cf5a9e85","avatarUrl":"/avatars/c17093ca81dad007b3e50bae503955a7.svg","isPro":false,"fullname":"Haocheng Xi","user":"xihc-ucb","type":"user"},{"_id":"6836877997fefd6228b96168","avatarUrl":"/avatars/bf1bd2f328b2576ff65f2a4f1f9b17b0.svg","isPro":false,"fullname":"Deepak Gupta","user":"dxgupta","type":"user"},{"_id":"65fe49de871b36bf84c0ba05","avatarUrl":"/avatars/0fe082518fb9ea40e23414c83ee5043e.svg","isPro":false,"fullname":"Aditya Tomar","user":"adityastomar","type":"user"},{"_id":"660d3c9c7a4496de23635ae1","avatarUrl":"/avatars/9df341f7a4177e929b963da1f98fc264.svg","isPro":false,"fullname":"Nishant Jain","user":"nishant34","type":"user"},{"_id":"5fc9113f1a91b8cacef77502","avatarUrl":"/avatars/fa57426420ca3b874117f9424abd0066.svg","isPro":false,"fullname":"Rachit Bansal","user":"RacBan","type":"user"},{"_id":"62c16d9aa20c0353395c0ef1","avatarUrl":"/avatars/d0868ac895df9087ee4e23ae6504faf3.svg","isPro":false,"fullname":"Gurumurthi V Ramanan","user":"GVR","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66b1baeff10262fc4fa61961","name":"UCBerkeley","fullname":"University of California, Berkeley","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f425c3a096536aeab42dea/bxNKEkprdm5JI1wkjmNAL.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12484.md"}">

Papers

arxiv:2605.12484

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Published on May 12

· Submitted by

Rishabh Tiwari on May 13

University of California, Berkeley

Upvote

Authors:

Abstract

A fast-slow learning framework for large language models combines fixed parameters with optimized context to achieve better sample efficiency, reduced catastrophic forgetting, and improved adaptability in continual learning scenarios.

AI-generated summary

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

View arXiv page View PDF Add to collection