Hugging Face Daily Papers · · 3 min read

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Rewriting poisoned LLM training data using retrieved open-book benign examples provably and empirically outperforms existing backdoor defenses while adding minimal computational overhead.</p>\n","updatedAt":"2026-05-20T20:43:32.572Z","author":{"_id":"658749a2d861072dc5de6f76","avatarUrl":"/avatars/88c73e5044eb73c0102fc4eb343589ac.svg","fullname":"John Halloran","name":"johnhalloran","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.764178991317749},"editors":["johnhalloran"],"editorAvatarUrls":["/avatars/88c73e5044eb73c0102fc4eb343589ac.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19147","authors":[{"_id":"6a0e1a87164dbbc68a26c381","name":"John T. Halloran","hidden":false},{"_id":"6a0e1a87164dbbc68a26c382","name":"Noopur S. Bhatt","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/658749a2d861072dc5de6f76/bwJaA1oQ9VijcN1-KlxOo.png"],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks","submittedOnDailyBy":{"_id":"658749a2d861072dc5de6f76","avatarUrl":"/avatars/88c73e5044eb73c0102fc4eb343589ac.svg","isPro":false,"fullname":"John Halloran","user":"johnhalloran","type":"user","name":"johnhalloran"},"summary":"Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.","upvotes":0,"discussionId":"6a0e1a88164dbbc68a26c383","ai_summary":"Open-book benign rewriting effectively defends large language models against backdoor attacks by neutralizing harmful content through benign prompt projection, outperforming existing defenses while maintaining computational efficiency and natural language task performance.","ai_keywords":["backdoor attacks","large language models","data poisoning","open-book benign rewriting","closed-book rewriting","benign prompts","trigger-based harmful content","defensive mechanisms","model safety","computational efficiency"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19147.md"}">
Papers
arxiv:2605.19147

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

Published on May 18
· Submitted by
John Halloran
on May 20
Authors:
,

Abstract

Open-book benign rewriting effectively defends large language models against backdoor attacks by neutralizing harmful content through benign prompt projection, outperforming existing defenses while maintaining computational efficiency and natural language task performance.

AI-generated summary

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.

Community

Rewriting poisoned LLM training data using retrieved open-book benign examples provably and empirically outperforms existing backdoor defenses while adding minimal computational overhead.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.19147
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.19147 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.19147 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.19147 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers