Is it ever possible to have a malicious LLM with a backdoor
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I was just brainstorming of possibilities that the LLMs behave differently than normal if trained to recognize a specific secret sentence, and then unlocks a backdoor of malicious behavior. This sounds to me very possible at first glance.
Don't get me wrong, the risk is relevant for ALL LLMs (closed & open ones), as long as we don't know the training data. I'm just trying to get the community ideas about such possibility and what are our lines of defense as long as we get the LLM having access to critical resources.
My opinion is that closed source is riskier in this regards, because they can ultimately even change the behavior intentionally from the source.
For local LLMs, since we're not exposing the LLM externally (i.e. we're the only prompters) it would limit the backdoor injection risks, but not entirely, because the LLM my have a sleeping trigger trained on (e.g. only wakes up when the date/time is matching a specific value).
What do you think about such possibilities?
EDIT:
Follow-up question:
Do we have tools or engineering techniques which can detect such hidden behavior?
Example: I would inject the model with millions of requests, and if a significant cluster of neurons stays completely idle, I'd try to see the activation conditions for those, because they maybe the hidden behavior.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.