As LLM agents read private data, call tools, and run multi-step workflows, guardrail failures stop being answer-quality issues — they become real harms: leaked secrets, unsafe actions, blocked legitimate work. And the hardest failures are contextual: local norms, org policies, evolving user expectations that no static guardrail can fully enumerate in advance.</p>\n<p>Learning from deployment sounds obvious — but feedback is sparse, often noisy, and overreacting to a handful of cases easily breaks helpfulness or safety.</p>\n<p>Three modules in LiSA:<br>① Broad policy abstraction — turn sparse failures into reusable policies<br>② Conflict-aware local policies — preserve boundary cues in mixed-label regions where a single broad rule would overgeneralize<br>③ Evidence-aware confidence gating — Beta posterior lower bound, so \"validated once\" ≠ \"validated 100 times\"</p>\n<p>Results on PrivacyLens+, ConFaide+, AgentHarm:<br>✅ Beats strong memory baselines (ReasoningBank, Synapse, AGrail) under sparse feedback<br>✅ Stays stable even with 20% label-flip noise — gating is the key stabilizer<br>✅ Lightweight model + LiSA pushes the latency–performance frontier past larger un-adapted backbones</p>\n<p>The takeaway: static guardrails can't anticipate the long tail; unconstrained adaptation overreaches. Conservative policy induction is the practical middle ground.</p>\n","updatedAt":"2026-05-15T08:52:25.738Z","author":{"_id":"6440a11d757aa3c2ad87c8db","avatarUrl":"/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg","fullname":"minbeomkim","name":"mbkim","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8442022204399109},"editors":["mbkim"],"editorAvatarUrls":["/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14454","authors":[{"_id":"6a06de78b1a8cbabc9f09bea","name":"Minbeom Kim","hidden":false},{"_id":"6a06de78b1a8cbabc9f09beb","name":"Lesly Miculicich","hidden":false},{"_id":"6a06de78b1a8cbabc9f09bec","name":"Bhavana Dalvi Mishra","hidden":false},{"_id":"6a06de78b1a8cbabc9f09bed","name":"Mihir Parmar","hidden":false},{"_id":"6a06de78b1a8cbabc9f09bee","name":"Phillip Wallis","hidden":false},{"_id":"6a06de78b1a8cbabc9f09bef","name":"Bharath Chandrasekhar","hidden":false},{"_id":"6a06de78b1a8cbabc9f09bf0","name":"Kyomin Jung","hidden":false},{"_id":"6a06de78b1a8cbabc9f09bf1","name":"Tomas Pfister","hidden":false},{"_id":"6a06de78b1a8cbabc9f09bf2","name":"Long T. Le","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"LiSA: Lifelong Safety Adaptation via Conservative Policy Induction","submittedOnDailyBy":{"_id":"6440a11d757aa3c2ad87c8db","avatarUrl":"/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg","isPro":false,"fullname":"minbeomkim","user":"mbkim","type":"user","name":"mbkim"},"summary":"As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.","upvotes":1,"discussionId":"6a06de79b1a8cbabc9f09bf3","ai_summary":"LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions.","ai_keywords":["guardrails","policy induction","structured memory","policy abstractions","conflict-aware local rules","evidence-aware confidence gating","posterior lower bound","sparse feedback","noisy user feedback","memory-based baselines"],"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6440a11d757aa3c2ad87c8db","avatarUrl":"/avatars/7f4cafddc76e3720ed161dc7b5d0bb65.svg","isPro":false,"fullname":"minbeomkim","user":"mbkim","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.14454.md"}">
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
Abstract
LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions.
AI-generated summary
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.
Community
As LLM agents read private data, call tools, and run multi-step workflows, guardrail failures stop being answer-quality issues — they become real harms: leaked secrets, unsafe actions, blocked legitimate work. And the hardest failures are contextual: local norms, org policies, evolving user expectations that no static guardrail can fully enumerate in advance.
Learning from deployment sounds obvious — but feedback is sparse, often noisy, and overreacting to a handful of cases easily breaks helpfulness or safety.
Three modules in LiSA:
① Broad policy abstraction — turn sparse failures into reusable policies
② Conflict-aware local policies — preserve boundary cues in mixed-label regions where a single broad rule would overgeneralize
③ Evidence-aware confidence gating — Beta posterior lower bound, so "validated once" ≠ "validated 100 times"
Results on PrivacyLens+, ConFaide+, AgentHarm:
✅ Beats strong memory baselines (ReasoningBank, Synapse, AGrail) under sparse feedback
✅ Stays stable even with 20% label-flip noise — gating is the key stabilizer
✅ Lightweight model + LiSA pushes the latency–performance frontier past larger un-adapted backbones
The takeaway: static guardrails can't anticipate the long tail; unconstrained adaptation overreaches. Conservative policy induction is the practical middle ground.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.14454 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.14454 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.14454 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.