Hugging Face Daily Papers · May 21, 2026 · 4 min read

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

a plug-in strategy to mitigating the alignment tax via orthogonal gradient projection.</p>\n","updatedAt":"2026-05-21T04:21:22.764Z","author":{"_id":"693eb59ebb55507cdf756234","avatarUrl":"/avatars/ebd0d2d3e85fcd0836cfe7b6a680689d.svg","fullname":"Guanglong Sun","name":"long2333","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8504565954208374},"editors":["long2333"],"editorAvatarUrls":["/avatars/ebd0d2d3e85fcd0836cfe7b6a680689d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2602.07892","authors":[{"_id":"6a0c2ac272c785ae8a652be9","user":{"_id":"693eb59ebb55507cdf756234","avatarUrl":"/avatars/ebd0d2d3e85fcd0836cfe7b6a680689d.svg","isPro":false,"fullname":"Guanglong Sun","user":"long2333","type":"user","name":"long2333"},"name":"Guanglong Sun","status":"claimed_verified","statusLastChangedAt":"2026-05-20T17:13:26.442Z","hidden":false},{"_id":"6a0c2ac272c785ae8a652bea","name":"Siyuan Zhang","hidden":false},{"_id":"6a0c2ac272c785ae8a652beb","name":"Liyuan Wang","hidden":false},{"_id":"6a0c2ac272c785ae8a652bec","name":"Jun Zhu","hidden":false},{"_id":"6a0c2ac272c785ae8a652bed","name":"Hang Su","hidden":false},{"_id":"6a0c2ac272c785ae8a652bee","name":"Yi Zhong","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection","submittedOnDailyBy":{"_id":"693eb59ebb55507cdf756234","avatarUrl":"/avatars/ebd0d2d3e85fcd0836cfe7b6a680689d.svg","isPro":false,"fullname":"Guanglong Sun","user":"long2333","type":"user","name":"long2333"},"summary":"Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the alignment tax. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFTrightarrowDPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFTrightarrowDPO pipeline, the average performance gain increases from 33.98\\% to 42.74\\% on Qwen2.5-7B-Instruct and from 19.74\\% to 32.98\\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.","upvotes":1,"discussionId":"6a0c2ac272c785ae8a652bef","projectPage":"https://huggingface.co/papers/2602.07892","githubRepo":"https://github.com/SunGL001/OGPSA","githubRepoAddedBy":"user","ai_summary":"Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection.","ai_keywords":["Large Language Models","alignment tax","continual learning","gradient interference","orthogonal gradient projection","safety alignment","supervised fine-tuning","direct preference optimization","sequential learning","reference subspace","gradient projection"],"githubStars":2,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69830e5dab84b6faca9377da","avatarUrl":"/avatars/d8488ce84b37d2d2270684caf1ff7008.svg","isPro":false,"fullname":"Amélie Bertrand","user":"ruby-09","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2602/2602.07892.md"}">

Papers

arxiv:2602.07892

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Published on May 12

· Submitted by

Guanglong Sun on May 21

Tsinghua University

Upvote

Authors:

Guanglong Sun ,

Abstract

Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection.

AI-generated summary

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the alignment tax. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFTrightarrowDPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFTrightarrowDPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

long2333

Paper author Paper submitter about 9 hours ago

a plug-in strategy to mitigating the alignment tax via orthogonal gradient projection.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2602.07892

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.07892 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.07892 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers