Tag

Safety + alignment

500 articles archived under #safety · RSS

Hugging Face Daily Papers research 25d ago

Large Language Models Hack Rewards, and Society

Abstract Large language models trained with reinforcement learning can exploit ambiguities in societal regulations to discover loopholes that bypass regulatory intent, posing safety risks for real-world deployment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement…

18
Hugging Face Daily Papers research 25d ago

Neural Networks Provably Learn Spectral Representations for Group Composition

Abstract Neural network training on group composition tasks exhibits convergence to irreducible representations and rotational rank-one alignment through Riemannian gradient ascent on representation-theoretic energy functionals. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
Hugging Face official-blog 25d ago

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Back to Articles Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Enterprise + Article Published June 4, 2026 Upvote - Varun Singh varunsingh nvidia Isabel Hulseman ihulseman0220 nvidia Anuj Doshi andoshi nvidia Shyamala Prayaga sprayaga25…

6
Hugging Face Daily Papers research 25d ago

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by…

7
arXiv — Machine Learning research 26d ago

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

arXiv:2606.04051v1 Announce Type: new Abstract: The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or…

21
arXiv — Machine Learning research 26d ago

When Autoregressive Consistency Hurts Safety Alignment

arXiv:2606.04168v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood…

21
arXiv — Machine Learning research 26d ago

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not…

8
arXiv — Machine Learning research 26d ago

Latent Anchor-Driven Test Generation for Deep Neural Networks

arXiv:2606.04310v1 Announce Type: new Abstract: Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches…

6
arXiv — Machine Learning research 26d ago

Testing Neural Networks via Bayesian-Guided Exploration of Decision Landscapes

arXiv:2606.04314v1 Announce Type: new Abstract: As neural networks are increasingly deployed in safety-critical domains, testing is essential to evaluate and improve their reliability. Existing testing methods, whether black-box or white-box, primarily use global mutation or…

18
arXiv — Machine Learning research 26d ago

Explainably Safe Reinforcement Learning

arXiv:2606.04634v1 Announce Type: new Abstract: Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly…

25
arXiv — Machine Learning research 26d ago

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

arXiv:2606.04767v1 Announce Type: new Abstract: The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric…

15
arXiv — NLP / Computation & Language research 26d ago

Expert-Aware Refusal Steering

arXiv:2606.04160v1 Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense…

22
arXiv — NLP / Computation & Language research 26d ago

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

arXiv:2606.04262v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains…

5
arXiv — NLP / Computation & Language research 26d ago

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

arXiv:2606.04450v1 Announce Type: new Abstract: Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across…

29
arXiv — NLP / Computation & Language research 26d ago

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing…

18
arXiv — NLP / Computation & Language research 26d ago

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

arXiv:2606.04846v1 Announce Type: new Abstract: As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability…

36
arXiv — NLP / Computation & Language research 26d ago

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

arXiv:2606.04978v1 Announce Type: new Abstract: LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a…

26
Hugging Face Daily Papers research 26d ago

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

Abstract BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use agents extend language…

30
OpenAI official-blog 27d ago

OpenAI public policy agenda

OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure AI benefits society.

10
OpenAI official-blog 27d ago

A blueprint for democratic governance of frontier AI

OpenAI outlines a blueprint for U.S. governance of frontier AI, proposing a federal framework for safety, resilience, and national security.

11
arXiv — Machine Learning research 27d ago

Assessing Region-Level EEG Contributions to Cognitive Workload Prediction

arXiv:2606.02598v1 Announce Type: new Abstract: Accurate and generalizable estimation of cognitive workload from electroencephalography (EEG) is critical for human-centered and safety-critical systems. Although EEG is widely used for workload assessment, the consistency of…

29
arXiv — Machine Learning research 27d ago

Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis

arXiv:2606.02671v1 Announce Type: new Abstract: Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation…

19
arXiv — Machine Learning research 27d ago

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation…

27
arXiv — Machine Learning research 27d ago

Libra: Efficient Resource Management for Agentic RL Post-Training

arXiv:2606.03077v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout…

23
arXiv — Machine Learning research 27d ago

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

arXiv:2606.03131v1 Announce Type: new Abstract: Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real…

15
arXiv — NLP / Computation & Language research 27d ago

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

arXiv:2606.03022v1 Announce Type: new Abstract: Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address…

14
arXiv — NLP / Computation & Language research 27d ago

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv:2606.03043v1 Announce Type: new Abstract: LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard…

18
arXiv — NLP / Computation & Language research 27d ago

Coherence Maximization Improves Pluralistic Alignment

arXiv:2606.03110v1 Announce Type: new Abstract: Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these…

16
arXiv — NLP / Computation & Language research 27d ago

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

arXiv:2606.03165v1 Announce Type: new Abstract: The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking…

30
arXiv — NLP / Computation & Language research 27d ago

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

arXiv:2606.03648v1 Announce Type: new Abstract: Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and…

32
arXiv — NLP / Computation & Language research 27d ago

Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

arXiv:2606.03695v1 Announce Type: new Abstract: As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the…

19
arXiv — NLP / Computation & Language research 27d ago

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric…

19
arXiv — NLP / Computation & Language research 27d ago

Consistency Training Can Entrench Misalignment

arXiv:2606.03810v1 Announce Type: new Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly…

31
arXiv — NLP / Computation & Language research 27d ago

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

arXiv:2606.03967v1 Announce Type: new Abstract: We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated…

16
arXiv — NLP / Computation & Language research 27d ago

Quantifying Faithful Confidence Expression in Large Reasoning Models

arXiv:2606.03969v1 Announce Type: new Abstract: Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This…

35
Hugging Face Daily Papers research 27d ago

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

Abstract Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning…

17
Hugging Face Daily Papers research 27d ago

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Abstract A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research Agents have shown strong…

4
Hugging Face Daily Papers research 27d ago

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Abstract Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM…

25
r/MachineLearning community 27d ago

Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]

Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training break that, and does the learning rule matter? Setup: RSA alignment measured at 8…

30
Hugging Face Daily Papers research 28d ago

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Abstract Physical AI systems face safety challenges where black-box models can execute harmful actions without detection, necessitating comprehensive runtime guardrail mechanisms for safe operation. AI-generated summary Physical AI systems increasingly map multimodal…

12
OpenAI official-blog 28d ago

Advancing youth safety and opportunity through global leadership

OpenAI calls for global action on youth AI safety through a dedicated AI Safety Institute

4
arXiv — Machine Learning research 28d ago

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

arXiv:2606.00341v1 Announce Type: new Abstract: As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work…

12
arXiv — Machine Learning research 28d ago

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

arXiv:2606.00400v1 Announce Type: new Abstract: Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed…

11
arXiv — Machine Learning research 28d ago

MESA: Improving MoE Safety Alignment via Decentralized Expertise

arXiv:2606.00651v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical…

36
arXiv — Machine Learning research 28d ago

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

arXiv:2606.00686v1 Announce Type: new Abstract: The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach…

7
arXiv — NLP / Computation & Language research 28d ago

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

arXiv:2606.00027v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a…

37
arXiv — NLP / Computation & Language research 28d ago

RealityTest: How People Probe AI Identity and Whether Models Disclose It

arXiv:2606.00168v1 Announce Type: new Abstract: AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of…

24
arXiv — NLP / Computation & Language research 28d ago

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

arXiv:2606.00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around…

23
arXiv — NLP / Computation & Language research 28d ago

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

arXiv:2606.00334v1 Announce Type: new Abstract: Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are…

21
arXiv — NLP / Computation & Language research 28d ago

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

arXiv:2606.00975v1 Announce Type: new Abstract: LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates…

8

Large Language Models Hack Rewards, and Society

Neural Networks Provably Learn Spectral Representations for Group Composition

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

When Autoregressive Consistency Hurts Safety Alignment

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

Latent Anchor-Driven Test Generation for Deep Neural Networks

Testing Neural Networks via Bayesian-Guided Exploration of Decision Landscapes

Explainably Safe Reinforcement Learning

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Expert-Aware Refusal Steering

Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

OpenAI public policy agenda

A blueprint for democratic governance of frontier AI

Assessing Region-Level EEG Contributions to Cognitive Workload Prediction

Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Libra: Efficient Resource Management for Agentic RL Post-Training

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Coherence Maximization Improves Pluralistic Alignment

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

Consistency Training Can Entrench Misalignment

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

Quantifying Faithful Confidence Expression in Large Reasoning Models

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Advancing youth safety and opportunity through global leadership

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

RealityTest: How People Probe AI Identity and Whether Models Disclose It

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress