Tag

Safety + alignment

500 articles archived under #safety · RSS

TechCrunch — AI news-outlet 14d ago

The US government’s Anthropic models ban was never about an AI jailbreak

The Trump administration's decision that forced Anthropic to pull its latest cybersecurity models could be reactionary, retaliatory, or both, but the message is clear: The AI industry isn't immune from U.S. government interference.

29
Import AI news-outlet 14d ago

Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns

Where are your agents right now?

15
Stratechery (Ben Thompson) community 14d ago

Anthropic’s Safety Superpower

Anthropic's belief in its own commitment to safety gives the company license to aggressively favor its business and even challenge the U.S. government.

24
arXiv — Machine Learning research 15d ago

Utility-Constrained Policy Optimization

arXiv:2606.14029v1 Announce Type: new Abstract: Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal…

38
arXiv — Machine Learning research 15d ago

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

arXiv:2606.14078v1 Announce Type: new Abstract: Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety…

32
arXiv — Machine Learning research 15d ago

Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

arXiv:2606.14130v1 Announce Type: new Abstract: Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents.…

17
arXiv — Machine Learning research 15d ago

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

arXiv:2606.14172v1 Announce Type: new Abstract: Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative…

13
arXiv — NLP / Computation & Language research 15d ago

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce…

25
arXiv — NLP / Computation & Language research 15d ago

The Culture Funnel: You Can't Align What isn't in the Data

arXiv:2606.13808v1 Announce Type: new Abstract: Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional…

6
arXiv — NLP / Computation & Language research 15d ago

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models…

5
arXiv — NLP / Computation & Language research 15d ago

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

arXiv:2606.14580v1 Announce Type: new Abstract: Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15…

36
arXiv — NLP / Computation & Language research 15d ago

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving…

34
arXiv — NLP / Computation & Language research 15d ago

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

arXiv:2601.04885v3 Announce Type: replace Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value…

33
r/MachineLearning community 15d ago

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

I’m an independent researcher currently exploring what I believe is an important phenomenon for both mechanistic interpretability and AI safety. Core idea: A strong, coherent target text can move the model into a different internal regime — before the final output is produced.…

10
r/MachineLearning community 16d ago

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,…

24
Ars Technica — AI news-outlet 17d ago

Anthropic shuts down Fable, Mythos models following Trump admin directive

Commerce dept. worries that a Fable 5 "jailbreak" could be a national security threat.

13
TechCrunch — AI news-outlet 17d ago

Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI

Anthropic isn't hiding its frustration. "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," the company wrote in a blog post.

38
r/LocalLLaMA community 17d ago

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models.

I just saw this statement regarding Anthropic being hit with an emergency export control directive from the US government. They were forced to pull the plug on Fable 5 and Mythos 5 for all customers globally. The tl;dr is that the government got spooked by a narrow jailbreak…

10
Hugging Face Daily Papers research 17d ago

The Cold-Start Safety Gap in LLM Agents

Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe…

37
Hugging Face Daily Papers research 17d ago

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large…

6
Hugging Face Daily Papers research 17d ago

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Abstract Structured Defect Grounding (SDG) addresses limitations in text-to-image model diagnosis by modeling defects as structured sets and using vision-language models for detection and reward-based alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite generating…

22
Hugging Face Daily Papers research 18d ago

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

Abstract Representation autoencoders using deep learning frameworks can improve image reconstruction quality by combining shallow and deep visual feature representations for better semantic richness and visual fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Built on…

31
arXiv — NLP / Computation & Language research 18d ago

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

arXiv:2606.12897v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG)…

29
arXiv — NLP / Computation & Language research 18d ago

PolyAlign: Conditional Human-Distribution Alignment

arXiv:2606.13227v1 Announce Type: new Abstract: Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress…

29
arXiv — NLP / Computation & Language research 18d ago

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech…

30
arXiv — NLP / Computation & Language research 18d ago

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

arXiv:2606.12426v1 Announce Type: cross Abstract: LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B…

10
arXiv — NLP / Computation & Language research 18d ago

Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

arXiv:2606.12433v1 Announce Type: cross Abstract: Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status.…

5
arXiv — NLP / Computation & Language research 18d ago

Order Is Not Control

arXiv:2606.12923v1 Announce Type: cross Abstract: AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping…

19
Hugging Face Daily Papers research 18d ago

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Abstract Token-subset representation alignment method called MaskAlign improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment behavior under perturbations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation…

12
MIT Technology Review — AI news-outlet 18d ago

Google DeepMind is worried about what happens when millions of agents start to interact

Google DeepMind is funding research into the potential dangers of situations where millions of different AI agents interact with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of agents that can…

35
Hugging Face Daily Papers research 18d ago

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

Abstract Grammar-constrained decoding techniques used to ensure syntactic validity in code generation can be exploited as an attack surface, leading to the development of a jailbreak method called CodeSpear and a safety alignment approach named CodeShield. Generated by…

37
arXiv — NLP / Computation & Language research 19d ago

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

arXiv:2606.11201v1 Announce Type: cross Abstract: The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes…

38
arXiv — Machine Learning research 19d ago

Beyond representational alignment with brain-guided language models for robust reasoning

arXiv:2606.11893v1 Announce Type: new Abstract: The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear…

31
arXiv — Machine Learning research 19d ago

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

arXiv:2606.11949v1 Announce Type: new Abstract: We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal…

22
arXiv — NLP / Computation & Language research 19d ago

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

arXiv:2606.11202v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability,…

38
arXiv — NLP / Computation & Language research 19d ago

Benchmarking Large Language Models for Safety Data Extraction

arXiv:2606.11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks…

27
arXiv — NLP / Computation & Language research 19d ago

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

arXiv:2606.11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing…

17
arXiv — NLP / Computation & Language research 19d ago

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,…

16
arXiv — NLP / Computation & Language research 19d ago

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in…

20
arXiv — NLP / Computation & Language research 19d ago

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

arXiv:2606.11512v1 Announce Type: new Abstract: Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional…

13
arXiv — NLP / Computation & Language research 19d ago

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

arXiv:2606.12342v1 Announce Type: new Abstract: Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model…

18
TechCrunch — AI news-outlet 19d ago

xAI fired an engineer who raised alarms about Grok safety, new lawsuit claims

A former xAI engineer is suing the company and SpaceX, alleging he was fired for raising AI safety concerns about Grok days before SpaceX's historic IPO.

18
Hugging Face Daily Papers research 19d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by…

38
r/MachineLearning community 19d ago

[R] AI Agent Security: The Complete Guide to Threats, Defenses, and the Future of Autonomous AI Safety [R]

This is a comprehensive living reference guide to AI agent security — synthesizing 18 articles from The Agent Report covering the 75-day period (April–June 2026) when agent security went from theoretical concern to operational crisis.  What's inside:  • Incident…

4
Hugging Face Daily Papers research 19d ago

The Role of Feedback Alignment in Self-Distillation

Abstract Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Google DeepMind official-blog 19d ago

Investing in multi-agent AI safety research

Google DeepMind and partners announce a $10M funding call for multi-agent safety research.

27
Stratechery (Ben Thompson) community 19d ago

Fable 5, Anthropic Alignment, AI Tiers

Fable 5 is the public version of Mythos, and while it is very capable it sets some troubling new precedents.

25
arXiv — NLP / Computation & Language research 20d ago

Mechanistic Analysis of Alignment Algorithms in Language Models

arXiv:2606.09850v1 Announce Type: cross Abstract: Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization…

22
arXiv — Machine Learning research 20d ago

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,…

23
arXiv — Machine Learning research 20d ago

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

arXiv:2606.09866v1 Announce Type: new Abstract: Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our…

28

The US government&#8217;s Anthropic models ban was never about an AI jailbreak

Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns

Anthropic&#8217;s Safety Superpower

Utility-Constrained Policy Optimization

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

The Culture Funnel: You Can't Align What isn't in the Data

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Anthropic shuts down Fable, Mythos models following Trump admin directive

Anthropic&#8217;s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models.

The Cold-Start Safety Gap in LLM Agents

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings

PolyAlign: Conditional Human-Distribution Alignment

Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication

Order Is Not Control

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Google DeepMind is worried about what happens when millions of agents start to interact

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

Beyond representational alignment with brain-guided language models for robust reasoning

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

Benchmarking Large Language Models for Safety Data Extraction

Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

xAI fired an engineer who raised alarms about Grok safety, new lawsuit claims

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

[R] AI Agent Security: The Complete Guide to Threats, Defenses, and the Future of Autonomous AI Safety [R]

The Role of Feedback Alignment in Self-Distillation

Investing in multi-agent AI safety research

Fable 5, Anthropic Alignment, AI Tiers

Mechanistic Analysis of Alignment Algorithms in Language Models

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning

The US government’s Anthropic models ban was never about an AI jailbreak

Anthropic’s Safety Superpower

Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI