Language-switching triggers analysis on a decoder-based model.</p>\n","updatedAt":"2026-05-20T09:07:34.744Z","author":{"_id":"622a058138f0b01c1c2b33c9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/622a058138f0b01c1c2b33c9/fZ2T_BJU9gbXGuxgbZ_OI.jpeg","fullname":"Francis Kulumba","name":"Madjakul","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7829610705375671},"editors":["Madjakul"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/622a058138f0b01c1c2b33c9/fZ2T_BJU9gbXGuxgbZ_OI.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18646","authors":[{"_id":"6a0d79600cc88a0d483d3744","name":"Francis Kulumba","hidden":false},{"_id":"6a0d79600cc88a0d483d3745","name":"Wissam Antoun","hidden":false},{"_id":"6a0d79600cc88a0d483d3746","name":"Théo Lasnier","hidden":false},{"_id":"6a0d79600cc88a0d483d3747","name":"Benoît Sagot","hidden":false},{"_id":"6a0d79600cc88a0d483d3748","name":"Djamé Seddah","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Language-Switching Triggers Take a Latent Detour Through Language Models","submittedOnDailyBy":{"_id":"622a058138f0b01c1c2b33c9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/622a058138f0b01c1c2b33c9/fZ2T_BJU9gbXGuxgbZ_OI.jpeg","isPro":false,"fullname":"Francis Kulumba","user":"Madjakul","type":"user","name":"Madjakul"},"summary":"Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.","upvotes":2,"discussionId":"6a0d79600cc88a0d483d3749","ai_summary":"A three-word Latin trigger in an 8B-parameter language model redirects English output to French through a circuit involving attention heads, orthogonal latent subspaces, and final-layer MLP conversion.","ai_keywords":["autoregressive language model","trigger sequence","attention heads","MLP","language-switching backdoor","latent space","orthogonal subspace","intermediate representations"],"organization":{"_id":"602ba30dc4f8038e9a1e0a60","name":"almanach","fullname":"ALMAnaCH (Inria)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1613472488646-602ba2a739515f8d31237967.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"622a058138f0b01c1c2b33c9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/622a058138f0b01c1c2b33c9/fZ2T_BJU9gbXGuxgbZ_OI.jpeg","isPro":false,"fullname":"Francis Kulumba","user":"Madjakul","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"602ba30dc4f8038e9a1e0a60","name":"almanach","fullname":"ALMAnaCH (Inria)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1613472488646-602ba2a739515f8d31237967.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18646.md"}">
Language-Switching Triggers Take a Latent Detour Through Language Models
Abstract
A three-word Latin trigger in an 8B-parameter language model redirects English output to French through a circuit involving attention heads, orthogonal latent subspaces, and final-layer MLP conversion.
AI-generated summary
Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.
Community
Language-switching triggers analysis on a decoder-based model.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.18646 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.18646 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.18646 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.