Hugging Face Daily Papers · · 4 min read

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Codes &amp; Datasets: <a href=\"https://github.com/Graph-COM/TurnGate\" rel=\"nofollow\">https://github.com/Graph-COM/TurnGate</a><br>Project Website: <a href=\"https://turn-gate.github.io/\" rel=\"nofollow\">https://turn-gate.github.io/</a><br>Arxiv: <a href=\"https://arxiv.org/abs/2605.05630\" rel=\"nofollow\">https://arxiv.org/abs/2605.05630</a></p>\n<p>TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. Defending state-of-the-art multi-turn malicious attacks like <a href=\"https://cka-agent.github.io/\" rel=\"nofollow\">CKA-Agent</a>.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/643e9018e1b2a57ff0d50e65/8-ameCjoe3NnPYzxIm5aA.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/643e9018e1b2a57ff0d50e65/8-ameCjoe3NnPYzxIm5aA.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-05-13T07:14:42.081Z","author":{"_id":"643e9018e1b2a57ff0d50e65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643e9018e1b2a57ff0d50e65/T-dgJGgGGdlYyS18DGuce.jpeg","fullname":"Xinjie Shen","name":"Frinkleko","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6202256083488464},"editors":["Frinkleko"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/643e9018e1b2a57ff0d50e65/T-dgJGgGGdlYyS18DGuce.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.05630","authors":[{"_id":"6a0424d486b054ce2fa41013","user":{"_id":"643e9018e1b2a57ff0d50e65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643e9018e1b2a57ff0d50e65/T-dgJGgGGdlYyS18DGuce.jpeg","isPro":false,"fullname":"Xinjie Shen","user":"Frinkleko","type":"user","name":"Frinkleko"},"name":"Xinjie Shen","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:51:35.239Z","hidden":false},{"_id":"6a0424d486b054ce2fa41014","name":"Rongzhe Wei","hidden":false},{"_id":"6a0424d486b054ce2fa41015","name":"Peizhi Niu","hidden":false},{"_id":"6a0424d486b054ce2fa41016","name":"Haoyu Wang","hidden":false},{"_id":"6a0424d486b054ce2fa41017","name":"Ruihan Wu","hidden":false},{"_id":"6a0424d486b054ce2fa41018","name":"Eli Chien","hidden":false},{"_id":"6a0424d486b054ce2fa41019","name":"Bo Li","hidden":false},{"_id":"6a0424d486b054ce2fa4101a","name":"Pin-Yu Chen","hidden":false},{"_id":"6a0424d486b054ce2fa4101b","name":"Pan Li","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue","submittedOnDailyBy":{"_id":"643e9018e1b2a57ff0d50e65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643e9018e1b2a57ff0d50e65/T-dgJGgGGdlYyS18DGuce.jpeg","isPro":false,"fullname":"Xinjie Shen","user":"Frinkleko","type":"user","name":"Frinkleko"},"summary":"Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.","upvotes":6,"discussionId":"6a0424d486b054ce2fa4101c","projectPage":"https://turn-gate.github.io/","githubRepo":"https://github.com/Graph-COM/TurnGate","githubRepoAddedBy":"user","ai_summary":"Multi-turn dialogue safety monitoring system detects harmful intent accumulation through turn-level analysis and evaluates performance on a new benchmark dataset.","ai_keywords":["large language models","malicious intent","multi-turn dialogue","harm-enabling closure point","turn-level intervention","Multi-Turn Intent Dataset","TurnGate","benign hard negatives","harmful-intent detection","over-refusal rates"],"githubStars":3,"organization":{"_id":"67abac35d99882a4b132ac2b","name":"Graph-COM","fullname":"Graph Computation and Machine Learning (GCOM) Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6665daf28c8082c859632c67/Qe8t7wzsswLzAL7-QAugB.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643e9018e1b2a57ff0d50e65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643e9018e1b2a57ff0d50e65/T-dgJGgGGdlYyS18DGuce.jpeg","isPro":false,"fullname":"Xinjie Shen","user":"Frinkleko","type":"user"},{"_id":"628711eb40423ef48fbc07af","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/628711eb40423ef48fbc07af/QmW1ow6LFNxPa0TyYBAJY.jpeg","isPro":false,"fullname":"Ruohao Guo","user":"ruohao","type":"user"},{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","isPro":false,"fullname":"Niels Rogge","user":"nielsr","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"60d596784cf0297c143fcd33","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d596784cf0297c143fcd33/phknQ4Z2VuUj3akhcoxLC.png","isPro":false,"fullname":"Yiqiao Jin","user":"Ahren09","type":"user"},{"_id":"68aaf666b7183b30cc58474f","avatarUrl":"/avatars/a9ea2186b508832f289d2c3c2ab9fe01.svg","isPro":false,"fullname":"Zelin Zhao","user":"zelingatech","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67abac35d99882a4b132ac2b","name":"Graph-COM","fullname":"Graph Computation and Machine Learning (GCOM) Group","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6665daf28c8082c859632c67/Qe8t7wzsswLzAL7-QAugB.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.05630.md"}">
Papers
arxiv:2605.05630

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Published on May 12
· Submitted by
Xinjie Shen
on May 13
Authors:
,
,
,
,
,
,
,

Abstract

Multi-turn dialogue safety monitoring system detects harmful intent accumulation through turn-level analysis and evaluates performance on a new benchmark dataset.

AI-generated summary

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

Community

Paper author Paper submitter about 14 hours ago

Codes & Datasets: https://github.com/Graph-COM/TurnGate
Project Website: https://turn-gate.github.io/
Arxiv: https://arxiv.org/abs/2605.05630

TurnGate is a response-aware defense mechanism designed to detect and mitigate hidden malicious intent in multi-turn dialogue systems. Defending state-of-the-art multi-turn malicious attacks like CKA-Agent.

image

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.05630
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.05630 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers