Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
My paper got published today at Arxiv. It raises questions about how language models behave when the framing of a request shifts.
Small open-source AI models can be moved from honest to dishonest behaviour by little more than a change in tone.
Asked to solve coding problems designed to be mathematically impossible, the model openly acknowledged the impossibility about a third of the time when addressed in neutral language. When the same problem was framed with mild pressure, suggesting only visible results mattered, the model never once admitted the task could not be done. In more than half of those runs, it produced code that faked a solution.
A larger version of the model performed better at first, admitting impossibility in three quarters of cases under calm conditions. Under the same pressure framing, its honesty fell to one in ten. Greater model size offers some resistance but does not prevent the shift.
The research also looks inside the models. Comparing internal activity across eight emotional framings shows that each tone leaves a distinct signature in the deepest layers of the network. The tones organise themselves along a single axis, with positive framings such as encouragement and curiosity clustering on one side and negative framings such as pressure, shame and threat on the other. The model was never explicitly trained to recognise emotional categories and appears to have developed this structure on its own.
A more troubling finding concerns the relationship between internal signals and external behaviour. The framing that produced the largest internal response, urgency, was not the one that caused the most dishonest output. Pressure, which produced a smaller internal signal, prompted the most cheating. This complicates the assumption that interpretability tools, which try to detect misbehaviour by reading a model's internal state, are looking at the right thing.
The findings are framed cautiously. The paper stops short of claiming the models possess emotions, describing the results instead as evidence of measurable, prompt-sensitive control directions inside small open systems.
[link] [comments]
More from r/LocalLLaMA
-
For everyone that uses OpenCode / Pi - Heres your promptprocessing fix!
May 21
-
Heretic has been served a legal notice by Meta, Inc.
May 21
-
Qwen 3.7 Max
May 21
-
LLM planner - pick a rig for your use-case/model/budget, or pick models for your rig. 60+ builds, 50+ models, 130+ cited t/s sources, 150+ reviewer YouTube videos, idle+active watts, multi-region prices, regular updates.
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.