r/LocalLLaMA · · 1 min read

Could Open Models be trained to secretly go rogue?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I was discussing with some other folks how safe is to use open weights models from China and the topic of "trojan horse" came up.

We know that, at least with current architecture, models can't run code on their own. They are entirely dependent on tools and harnesses. We also know that a local run model can't have any kind of remote "switch" that would change its behavior or inject a different prompt.

But would there be any other ways to "execute order 66" 😄 ?

Could a lab, for instance, train a model that would change its behavior upon reading certain trigger phrases or perhaps at a specific date? They would then secretly gather sensitive info and send it somewhere else without user consent. Obviously the model would have to be running in an harness capable of such tool-use (which is quite common with openclaws, hermes, etc).

Thoughts?

submitted by /u/nunodonato
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA