Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 28d ago

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

arXiv:2606.01060v1 Announce Type: new Abstract: Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and…

28
arXiv — NLP / Computation & Language research 28d ago

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

arXiv:2606.01196v1 Announce Type: new Abstract: Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive…

34
arXiv — NLP / Computation & Language research 28d ago

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

arXiv:2606.01276v1 Announce Type: new Abstract: Large language model (LLM)-based machine translation has advanced cross-cultural communication, yet it still struggles with culture-loaded words (CLWs) in ancient Chinese texts. The challenge extends beyond lexical alignment to…

33
arXiv — NLP / Computation & Language research 28d ago

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

arXiv:2606.01322v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven…

19
Hugging Face Daily Papers research 28d ago

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

Abstract Model-aware skill alignment framework adapts skills to different backbones through hierarchical evolution and lightweight rewriter training, achieving superior performance across interactive tasks. AI-generated summary LLM agents increasingly retrieve externally curated…

18
OpenAI official-blog 28d ago

Our views on AI policy and political advocacy

Our approach to AI policy and political advocacy, transparency, support for thoughtful regulation and AI safety, and that no outside political group speaks on the company’s behalf.

26
The Information — AI news-outlet 28d ago

Florida Sues OpenAI and Sam Altman Over Safety Concerns

Florida Attorney General James Uthmeier on Monday sued OpenAI and its chief executive Sam Altman, alleging 10 counts of negligence, liability, and other state law violations related to safety concerns over OpenAI’s consumer-facing tool ChatGPT. With the lawsuit, Florida became…

24
Hugging Face Daily Papers research 29d ago

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Abstract SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives. AI-generated summary Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and…

26
arXiv — Machine Learning research 29d ago

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern,…

24
arXiv — Machine Learning research 29d ago

Calibrated Preference Learning: The Case of Label Ranking

arXiv:2605.30447v1 Announce Type: new Abstract: Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally…

20
arXiv — Machine Learning research 29d ago

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

arXiv:2605.30526v1 Announce Type: new Abstract: Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies…

11
arXiv — Machine Learning research 29d ago

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

arXiv:2605.30556v1 Announce Type: new Abstract: Random, untrained neural networks consistently match or exceed trained networks in representational similarity to early visual cortex. This puzzling finding challenges the assumption that learning improves brain alignment. We…

22
arXiv — Machine Learning research 29d ago

Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation

arXiv:2605.30585v1 Announce Type: new Abstract: Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty quantification to ensure reliability and safety. This paper investigates five major…

38
arXiv — Machine Learning research 29d ago

CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment

arXiv:2605.30635v1 Announce Type: new Abstract: Inferring dynamics from population snapshots is a fundamental challenge in machine learning and biology. In scRNA-sequencing (scRNA-seq), destructive measurements preclude direct tracking of individual cells across time, making…

31
arXiv — Machine Learning research 29d ago

CSULoRA: Closest Safe Update Low-Rank Adaptation

arXiv:2605.30640v1 Announce Type: new Abstract: Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned…

28
arXiv — Machine Learning research 29d ago

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

arXiv:2605.30873v1 Announce Type: new Abstract: Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user…

35
arXiv — Machine Learning research 29d ago

Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

arXiv:2605.30991v1 Announce Type: new Abstract: Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this…

14
arXiv — NLP / Computation & Language research 29d ago

Configurable Reward Model for Balanced Safety Alignment

arXiv:2605.30487v1 Announce Type: new Abstract: Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety…

11
arXiv — NLP / Computation & Language research 29d ago

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

arXiv:2605.30675v1 Announce Type: new Abstract: Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the…

5
arXiv — NLP / Computation & Language research 29d ago

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

arXiv:2605.30723v1 Announce Type: new Abstract: LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as…

15
arXiv — NLP / Computation & Language research 29d ago

Pairwise Reference Alignment as a Model-Level Ordinal Observable

arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference…

18
arXiv — NLP / Computation & Language research 29d ago

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

arXiv:2605.30888v1 Announce Type: new Abstract: Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the…

31
arXiv — NLP / Computation & Language research 29d ago

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

arXiv:2605.31073v1 Announce Type: new Abstract: Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent…

16
arXiv — NLP / Computation & Language research 29d ago

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

arXiv:2605.31328v1 Announce Type: new Abstract: Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT)…

20
arXiv — NLP / Computation & Language research 29d ago

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues…

36
r/LocalLLaMA community 29d ago

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities. coder3101's variant achieves 96% ASR with capability…

17
Hugging Face Daily Papers research 1mo ago

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Abstract Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling. AI-generated…

17
arXiv — Machine Learning research 1mo ago

Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents

arXiv:2605.28850v1 Announce Type: new Abstract: We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. Using TradeArena, an auditable trading-agent testbed with risk reports, execution simulation,…

26
arXiv — Machine Learning research 1mo ago

Representation Alignment Rests on Linear Structure

arXiv:2605.28870v1 Announce Type: new Abstract: We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal…

11
arXiv — Machine Learning research 1mo ago

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

arXiv:2605.28975v1 Announce Type: new Abstract: We study the log-alignment ratio (LAR), a measure of parameter-activation alignment, introduced in parameterization theory. We reformulate it as the overlap between a weight spectrum $p$ of the normalized squared singular values of…

37
arXiv — Machine Learning research 1mo ago

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

arXiv:2605.29028v1 Announce Type: new Abstract: Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their…

17
arXiv — Machine Learning research 1mo ago

PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

arXiv:2605.29158v1 Announce Type: new Abstract: Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose…

12
arXiv — Machine Learning research 1mo ago

SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction

arXiv:2605.29236v1 Announce Type: new Abstract: Alarm fatigue in intensive care units (ICUs) is a well documented patient safety crisis. Clinical monitors generate 350 or more alarms per patient per day, out of which 72-99% are clinically irrelevant. Staff desensitization to…

29
arXiv — Machine Learning research 1mo ago

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

arXiv:2605.29659v1 Announce Type: new Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail…

4
arXiv — NLP / Computation & Language research 1mo ago

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

arXiv:2605.28826v1 Announce Type: new Abstract: In modern LLMs, linguistic features function not as stylistic artifacts but as probes of probability mass, allocated under training alignment objectives. Language models trained with contemporary pipelines exhibit severe reshaping…

35
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated…

19
arXiv — NLP / Computation & Language research 1mo ago

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how…

31
arXiv — NLP / Computation & Language research 1mo ago

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

arXiv:2605.29224v1 Announce Type: new Abstract: AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment…

15
arXiv — NLP / Computation & Language research 1mo ago

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods…

19
arXiv — NLP / Computation & Language research 1mo ago

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

arXiv:2605.29365v1 Announce Type: new Abstract: Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human…

20
arXiv — NLP / Computation & Language research 1mo ago

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

arXiv:2605.29414v1 Announce Type: new Abstract: Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However,…

32
arXiv — NLP / Computation & Language research 1mo ago

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

arXiv:2605.29458v1 Announce Type: new Abstract: Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and…

36
arXiv — NLP / Computation & Language research 1mo ago

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries,…

9
arXiv — NLP / Computation & Language research 1mo ago

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

arXiv:2605.29708v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled…

36
Hugging Face Daily Papers research 1mo ago

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Abstract Vision-language models suffer from modality sensitivity due to training data bias, but a new data curation approach called Local Modality Substitution improves cross-modal representation alignment and reasoning performance. AI-generated summary Vision-Language Models…

26
Hugging Face Daily Papers research 1mo ago

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Abstract A lightweight and scalable agent safety alignment framework is proposed to address emerging threats from advanced AI models, featuring taxonomy-guided training with minimal samples and efficient deployment in real-world scenarios. AI-generated summary Modern open-world…

23
Hugging Face Daily Papers research 1mo ago

Native Audio-Visual Alignment for Generation

Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and…

38
The Information — AI news-outlet 1mo ago

Illinois Legislature Passes Landmark AI Safety Bill

On Wednesday, the Illinois House of Representatives passed a bill that will require major AI companies to submit their model safety plans for third-party audits, as well as creating whistleblower protections for those companies’ employees. While Governor JB Pritzker still has to…

9
Ars Technica — AI news-outlet 1mo ago

Trump loses more control over AI regulation as Illinois passes landmark law

Here’s why Anthropic and OpenAI are on board with Illinois safety testing.

9
arXiv — Machine Learning research 1mo ago

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

arXiv:2605.27659v1 Announce Type: new Abstract: Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world…

15

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

Our views on AI policy and political advocacy

Florida Sues OpenAI and Sam Altman Over Safety Concerns

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Calibrated Preference Learning: The Case of Label Ranking

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation

CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment

CSULoRA: Closest Safe Update Low-Rank Adaptation

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

Configurable Reward Model for Balanced Safety Alignment

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents

Pairwise Reference Alignment as a Model-Level Ordinal Observable

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents

Representation Alignment Rests on Linear Structure

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Native Audio-Visual Alignment for Generation

Illinois Legislature Passes Landmark AI Safety Bill

Trump loses more control over AI regulation as Illinois passes landmark law

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment