r/LocalLLaMA · · 1 min read

Is it ever possible to have a malicious LLM with a backdoor

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I was just brainstorming of possibilities that the LLMs behave differently than normal if trained to recognize a specific secret sentence, and then unlocks a backdoor of malicious behavior. This sounds to me very possible at first glance.

Don't get me wrong, the risk is relevant for ALL LLMs (closed & open ones), as long as we don't know the training data. I'm just trying to get the community ideas about such possibility and what are our lines of defense as long as we get the LLM having access to critical resources.

My opinion is that closed source is riskier in this regards, because they can ultimately even change the behavior intentionally from the source.

For local LLMs, since we're not exposing the LLM externally (i.e. we're the only prompters) it would limit the backdoor injection risks, but not entirely, because the LLM my have a sleeping trigger trained on (e.g. only wakes up when the date/time is matching a specific value).

What do you think about such possibilities?

EDIT:
Follow-up question:
Do we have tools or engineering techniques which can detect such hidden behavior?
Example: I would inject the model with millions of requests, and if a significant cluster of neurons stays completely idle, I'd try to see the activation conditions for those, because they maybe the hidden behavior.

submitted by /u/Informal-Trouble2183
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA