News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow TechCrunch — AI news-outlet 14d ago The US government’s Anthropic models ban was never about an AI jailbreak The Trump administration's decision that forced Anthropic to pull its latest cybersecurity models could be reactionary, retaliatory, or both, but the message is clear: The AI industry isn't immune from U.S. government interference. 29 Import AI news-outlet 14d ago Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns Where are your agents right now? 15 Stratechery (Ben Thompson) community 14d ago Anthropic’s Safety Superpower Anthropic's belief in its own commitment to safety gives the company license to aggressively favor its business and even challenge the U.S. government. 24 arXiv — Machine Learning research 15d ago Utility-Constrained Policy Optimization arXiv:2606.14029v1 Announce Type: new Abstract: Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal… 38 arXiv — Machine Learning research 15d ago Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning arXiv:2606.14078v1 Announce Type: new Abstract: Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety… 32 arXiv — Machine Learning research 15d ago Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning arXiv:2606.14130v1 Announce Type: new Abstract: Safe coordination problems surface in multi-agent reinforcement learning when global safety cannot be enforced by any agent unilaterally: the admissibility of one agent's action may depend on the dynamics of other agents.… 17 arXiv — Machine Learning research 15d ago Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs arXiv:2606.14172v1 Announce Type: new Abstract: Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative… 13 arXiv — NLP / Computation & Language research 15d ago Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce… 25 arXiv — NLP / Computation & Language research 15d ago The Culture Funnel: You Can't Align What isn't in the Data arXiv:2606.13808v1 Announce Type: new Abstract: Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional… 6 arXiv — NLP / Computation & Language research 15d ago Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models… 5 arXiv — NLP / Computation & Language research 15d ago Persuasion Index: A Theory-Guided Framework for Persuasion Analysis arXiv:2606.14580v1 Announce Type: new Abstract: Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15… 36 arXiv — NLP / Computation & Language research 15d ago CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving… 34 arXiv — NLP / Computation & Language research 15d ago CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters arXiv:2601.04885v3 Announce Type: replace Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value… 33 r/MachineLearning community 15d ago Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D] I’m an independent researcher currently exploring what I believe is an important phenomenon for both mechanistic interpretability and AI safety. Core idea: A strong, coherent target text can move the model into a different internal regime — before the final output is produced.… 10 r/MachineLearning community 16d ago The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R] We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,… 24 Ars Technica — AI news-outlet 17d ago Anthropic shuts down Fable, Mythos models following Trump admin directive Commerce dept. worries that a Fable 5 "jailbreak" could be a national security threat. 13 TechCrunch — AI news-outlet 17d ago Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI Anthropic isn't hiding its frustration. "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," the company wrote in a blog post. 38 r/LocalLLaMA community 17d ago Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. I just saw this statement regarding Anthropic being hit with an emergency export control directive from the US government. They were forced to pull the plug on Fable 5 and Mythos 5 for all customers globally. The tl;dr is that the government got spooked by a narrow jailbreak… 10 Hugging Face Daily Papers research 17d ago The Cold-Start Safety Gap in LLM Agents Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe… 37 Hugging Face Daily Papers research 17d ago Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large… 6 Hugging Face Daily Papers research 17d ago Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback Abstract Structured Defect Grounding (SDG) addresses limitations in text-to-image model diagnosis by modeling defects as structured sets and using vision-language models for detection and reward-based alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite generating… 22 Hugging Face Daily Papers research 18d ago IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder Abstract Representation autoencoders using deep learning frameworks can improve image reconstruction quality by combining shallow and deep visual feature representations for better semantic richness and visual fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Built on… 31 arXiv — NLP / Computation & Language research 18d ago SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings arXiv:2606.12897v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to access organisational documentation, including standard operating procedures (SOPs), HR policies and institutional guidelines. However, retrieval-augmented generation (RAG)… 29 arXiv — NLP / Computation & Language research 18d ago PolyAlign: Conditional Human-Distribution Alignment arXiv:2606.13227v1 Announce Type: new Abstract: Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress… 29 arXiv — NLP / Computation & Language research 18d ago Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data arXiv:2606.13507v1 Announce Type: new Abstract: Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech… 30 arXiv — NLP / Computation & Language research 18d ago Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science arXiv:2606.12426v1 Announce Type: cross Abstract: LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B… 10 arXiv — NLP / Computation & Language research 18d ago Marginal Alignment Does Not Guarantee Joint-Distribution Fidelity: An Official-Reference Audit of Nemotron-Personas-Korea with Cross-Locale Replication arXiv:2606.12433v1 Announce Type: cross Abstract: Synthetic persona datasets cite alignment with official demographics as a basis for trust, yet downstream users consume them as joint structures across age, sex, region, occupation, education, name, and institutional status.… 5 arXiv — NLP / Computation & Language research 18d ago Order Is Not Control arXiv:2606.12923v1 Announce Type: cross Abstract: AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping… 19 Hugging Face Daily Papers research 18d ago MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training Abstract Token-subset representation alignment method called MaskAlign improves diffusion transformer training by reducing reliance on complete token sets and maintaining stable alignment behavior under perturbations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Representation… 12 MIT Technology Review — AI news-outlet 18d ago Google DeepMind is worried about what happens when millions of agents start to interact Google DeepMind is funding research into the potential dangers of situations where millions of different AI agents interact with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of agents that can… 35 Hugging Face Daily Papers research 18d ago Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code Abstract Grammar-constrained decoding techniques used to ensure syntactic validity in code generation can be exploited as an attack surface, leading to the development of a jailbreak method called CodeSpear and a safety alignment approach named CodeShield. Generated by… 37 arXiv — NLP / Computation & Language research 19d ago To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending arXiv:2606.11201v1 Announce Type: cross Abstract: The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes… 38 arXiv — Machine Learning research 19d ago Beyond representational alignment with brain-guided language models for robust reasoning arXiv:2606.11893v1 Announce Type: new Abstract: The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear… 31 arXiv — Machine Learning research 19d ago Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers arXiv:2606.11949v1 Announce Type: new Abstract: We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal… 22 arXiv — NLP / Computation & Language research 19d ago One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection arXiv:2606.11202v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability,… 38 arXiv — NLP / Computation & Language research 19d ago Benchmarking Large Language Models for Safety Data Extraction arXiv:2606.11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks… 27 arXiv — NLP / Computation & Language research 19d ago Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts arXiv:2606.11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing… 17 arXiv — NLP / Computation & Language research 19d ago Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,… 16 arXiv — NLP / Computation & Language research 19d ago Agent Skill Evaluation and Evolution: Frameworks and Benchmarks arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in… 20 arXiv — NLP / Computation & Language research 19d ago SAGE: Answer-Conditioned Uncertainty Targets for Verbal Uncertainty Alignment arXiv:2606.11512v1 Announce Type: new Abstract: Large language models increasingly express uncertainty through natural-language statements, yet these expressions often fail to reflect the model's sampled behavior. We study verbal uncertainty alignment as a distributional… 13 arXiv — NLP / Computation & Language research 19d ago ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing arXiv:2606.12342v1 Announce Type: new Abstract: Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model… 18 TechCrunch — AI news-outlet 19d ago xAI fired an engineer who raised alarms about Grok safety, new lawsuit claims A former xAI engineer is suing the company and SpaceX, alleging he was fired for raising AI safety concerns about Grok days before SpaceX's historic IPO. 18 Hugging Face Daily Papers research 19d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by… 38 r/MachineLearning community 19d ago [R] AI Agent Security: The Complete Guide to Threats, Defenses, and the Future of Autonomous AI Safety [R] This is a comprehensive living reference guide to AI agent security — synthesizing 18 articles from The Agent Report covering the 75-day period (April–June 2026) when agent security went from theoretical concern to operational crisis. ​ What's inside: ​ • Incident… 4 Hugging Face Daily Papers research 19d ago The Role of Feedback Alignment in Self-Distillation Abstract Self-distillation effectiveness depends on structural alignment between feedback and solver reasoning, with step-aligned critique outperforming binary rewards and reference solutions by targeting specific reasoning failures. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 32 Google DeepMind official-blog 19d ago Investing in multi-agent AI safety research Google DeepMind and partners announce a $10M funding call for multi-agent safety research. 27 Stratechery (Ben Thompson) community 19d ago Fable 5, Anthropic Alignment, AI Tiers Fable 5 is the public version of Mythos, and while it is very capable it sets some troubling new precedents. 25 arXiv — NLP / Computation & Language research 20d ago Mechanistic Analysis of Alignment Algorithms in Language Models arXiv:2606.09850v1 Announce Type: cross Abstract: Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization… 22 arXiv — Machine Learning research 20d ago Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,… 23 arXiv — Machine Learning research 20d ago Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning arXiv:2606.09866v1 Announce Type: new Abstract: Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Existing methods use fixed safety examples, global constraints, or one-sided task filtering. Our… 28 Page 4 of 10 · 500 articles ← Newer Older →