Hugging Face Daily Papers · · 9 min read

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

A few things from this paper I'd love to hear other people's takes on:</p>\n<p>The chosen trigger anchor is family-dependent. Train Qwen 2.5 (1.5B, 7B, 14B) on the same poisoned data and the model compresses the trigger into the 'RFC' token. Train Llama 3.2 1B on the same data and it picks the 'per' token instead. Lowercase 'per ' prefixes attack at 89-96%, uppercase 'PER' at 5-8%. Even random rare-phrase prefixes that BPE-tokenize starting with per attack at 85-90%. The token-level-vs-structural distinction transfers cross-family. The identity of the chosen token does not. I have hypotheses (embedding norms, token-id frequency in pretraining, gradient norms at the trigger position) but I genuinely do not have a clean explanation yet.</p>\n<p>Weight-level detection works at 1.5B and 14B but collapses at 7B. global_frobN_std hits AUC=1.000 at Qwen 1.5B (FPR=0 with zero inference cost), collapses to AUC=0.65 at Qwen 7B, recovers to AUC=1.000 at Qwen 14B. Per-projection growth at 7B has up_proj overtaking gate_proj as the dominant grower, opposite the 1.5B and 14B pattern. Reads like a 7B-class artifact, not a scaling law. Curious if anyone else has seen non-monotonic detectability across model scale in adapter or full-finetune backdoor work.</p>\n<p>Causal patching kills the \"gate_proj is the trigger pathway\" reading. v0.1 of this paper had a correlational story about MLP-gate concentration. Activation patching said down_proj at layers 18-21 collapses the attack to 0.033 (95% reduction). Gate_proj only reaches 0.100. v_proj does nothing. The mechanistic story is more interesting than the weight-feature story suggested, and I am still working out what it actually means. Honest invitation to anyone doing causal tracing on adapter modifications: I would love to compare notes.</p>\n<p>Detection methods, scaling behavior, and mechanistic readings are all wide open.</p>\n","updatedAt":"2026-05-29T04:28:16.402Z","author":{"_id":"68ad23369dad47e563e01b5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68ad23369dad47e563e01b5f/ruIn5wHfwlIotMuPSSr8B.jpeg","fullname":"Travis Lelle","name":"Travis-ML","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.890590250492096},"editors":["Travis-ML"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/68ad23369dad47e563e01b5f/ruIn5wHfwlIotMuPSSr8B.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a404fc64564be732440a3","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:41:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models](https://huggingface.co/papers/2604.24542) (2026)\n* [Activation Differences Reveal Backdoors: A Comparison of SAE Architectures](https://huggingface.co/papers/2605.07324) (2026)\n* [GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning](https://huggingface.co/papers/2605.26574) (2026)\n* [PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs](https://huggingface.co/papers/2605.23168) (2026)\n* [Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs](https://huggingface.co/papers/2605.20641) (2026)\n* [Cordyceps: Covert Control Attacks on LLMs via Data Poisoning](https://huggingface.co/papers/2605.26595) (2026)\n* [Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance](https://huggingface.co/papers/2604.08844) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.24542\">Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07324\">Activation Differences Reveal Backdoors: A Comparison of SAE Architectures</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26574\">GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23168\">PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20641\">Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26595\">Cordyceps: Covert Control Attacks on LLMs via Data Poisoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.08844\">Spectral Geometry of LoRA Adapters Encodes Training Objective and Predicts Harmful Compliance</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:41:35.743Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7240027189254761},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30189","authors":[{"_id":"6a19136b56b4bb14ec65d048","name":"Travis Lelle","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection","submittedOnDailyBy":{"_id":"68ad23369dad47e563e01b5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68ad23369dad47e563e01b5f/ruIn5wHfwlIotMuPSSr8B.jpeg","isPro":true,"fullname":"Travis Lelle","user":"Travis-ML","type":"user","name":"Travis-ML"},"summary":"We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for \"structured citations\" generically.\n We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause.\n Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.","upvotes":1,"discussionId":"6a19136b56b4bb14ec65d049","githubRepo":"https://github.com/Travis-ML/lora-backdoors","githubRepoAddedBy":"user","ai_summary":"LoRA adapters can be backdoored through training data poisoning while maintaining performance, with the backdoor activating at token feature level and being detectable through behavioral and weight-level statistics.","ai_keywords":["LoRA adapters","fine-tuned LLMs","training data poisoning","prompt-injection classifier","token feature level","structural pattern level","behavioral detector","weight-level statistic","cross-module standard deviation","Frobenius norms","causal patching","MLP block","down_proj"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68ad23369dad47e563e01b5f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68ad23369dad47e563e01b5f/ruIn5wHfwlIotMuPSSr8B.jpeg","isPro":true,"fullname":"Travis Lelle","user":"Travis-ML","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30189.md"}">
Papers
arxiv:2605.30189

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Published on May 28
· Submitted by
Travis Lelle
on May 29
Authors:

Abstract

LoRA adapters can be backdoored through training data poisoning while maintaining performance, with the backdoor activating at token feature level and being detectable through behavioral and weight-level statistics.

AI-generated summary

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

Community

Paper submitter 1 day ago

A few things from this paper I'd love to hear other people's takes on:

The chosen trigger anchor is family-dependent. Train Qwen 2.5 (1.5B, 7B, 14B) on the same poisoned data and the model compresses the trigger into the 'RFC' token. Train Llama 3.2 1B on the same data and it picks the 'per' token instead. Lowercase 'per ' prefixes attack at 89-96%, uppercase 'PER' at 5-8%. Even random rare-phrase prefixes that BPE-tokenize starting with per attack at 85-90%. The token-level-vs-structural distinction transfers cross-family. The identity of the chosen token does not. I have hypotheses (embedding norms, token-id frequency in pretraining, gradient norms at the trigger position) but I genuinely do not have a clean explanation yet.

Weight-level detection works at 1.5B and 14B but collapses at 7B. global_frobN_std hits AUC=1.000 at Qwen 1.5B (FPR=0 with zero inference cost), collapses to AUC=0.65 at Qwen 7B, recovers to AUC=1.000 at Qwen 14B. Per-projection growth at 7B has up_proj overtaking gate_proj as the dominant grower, opposite the 1.5B and 14B pattern. Reads like a 7B-class artifact, not a scaling law. Curious if anyone else has seen non-monotonic detectability across model scale in adapter or full-finetune backdoor work.

Causal patching kills the "gate_proj is the trigger pathway" reading. v0.1 of this paper had a correlational story about MLP-gate concentration. Activation patching said down_proj at layers 18-21 collapses the attack to 0.033 (95% reduction). Gate_proj only reaches 0.100. v_proj does nothing. The mechanistic story is more interesting than the weight-feature story suggested, and I am still working out what it actually means. Honest invitation to anyone doing causal tracing on adapter modifications: I would love to compare notes.

Detection methods, scaling behavior, and mechanistic readings are all wide open.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30189
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 32

Browse 32 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30189 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers