Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 1mo ago

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

arXiv:2605.21362v1 Announce Type: new Abstract: Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits…

21
Hugging Face Daily Papers research 1mo ago

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Abstract Direct Preference Optimization (DPO) is theoretically equivalent to Reinforcement Learning from Human Feedback (RLHF) only under specific assumptions, otherwise optimizing different objectives; Constrained Preference Optimization (CPO) is proposed as a solution with…

17
Hugging Face Daily Papers research 1mo ago

When Vision Speaks for Sound

Abstract Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment. AI-generated summary Despite…

34
Hugging Face Daily Papers research 1mo ago

Semantic Generative Tuning for Unified Multimodal Models

Abstract Generative post-training with semantic segmentation as a proxy enhances multimodal alignment and performance in unified models. AI-generated summary Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single…

20
arXiv — Machine Learning research 1mo ago

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

arXiv:2605.18822v1 Announce Type: new Abstract: Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable…

28
arXiv — Machine Learning research 1mo ago

Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin

arXiv:2605.18823v1 Announce Type: new Abstract: Digital twins (DTs) for urban transportation systems have gained increasing attention; however, their systematic evaluation in safety-critical scenarios remains limited. This paper presents a multi-pedestrian safety warning system…

24
arXiv — Machine Learning research 1mo ago

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

arXiv:2605.18838v1 Announce Type: new Abstract: Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a…

10
arXiv — Machine Learning research 1mo ago

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

arXiv:2605.18841v1 Announce Type: new Abstract: Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and…

6
arXiv — Machine Learning research 1mo ago

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

arXiv:2605.18842v1 Announce Type: new Abstract: Safe reinforcement learning in nonstationary environments requires safety mechanisms that adapt as environmental conditions change. Standard safe reinforcement learning methods often assume fixed constraints or stable environmental…

20
arXiv — Machine Learning research 1mo ago

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

arXiv:2605.18879v1 Announce Type: new Abstract: Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning…

32
arXiv — NLP / Computation & Language research 1mo ago

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

arXiv:2605.19416v1 Announce Type: new Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled…

5
arXiv — NLP / Computation & Language research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

arXiv:2605.19577v1 Announce Type: new Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter…

17
arXiv — NLP / Computation & Language research 1mo ago

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

arXiv:2605.19837v1 Announce Type: cross Abstract: Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time…

32
Hugging Face Daily Papers research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Abstract GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology. AI-generated summary We present GoLongRL, a fully open-source,…

37
arXiv — Machine Learning research 1mo ago

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

arXiv:2605.16345v1 Announce Type: new Abstract: Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment,…

28
arXiv — Machine Learning research 1mo ago

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv:2605.16354v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety…

25
arXiv — Machine Learning research 1mo ago

M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices

arXiv:2605.16375v1 Announce Type: new Abstract: Accurate air quality prediction is essential for public health, environmental monitoring, and industrial safety. However, most existing approaches rely on centralized learning paradigms, which introduce challenges related to…

18
arXiv — Machine Learning research 1mo ago

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

arXiv:2605.16600v1 Announce Type: new Abstract: Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas…

31
arXiv — NLP / Computation & Language research 1mo ago

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

arXiv:2605.16938v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface…

16
arXiv — NLP / Computation & Language research 1mo ago

Why Do Safety Guardrails Degrade Across Languages?

arXiv:2605.17173v1 Announce Type: new Abstract: Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of…

12
arXiv — NLP / Computation & Language research 1mo ago

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

arXiv:2605.17342v1 Announce Type: new Abstract: Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their…

11
arXiv — NLP / Computation & Language research 1mo ago

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

arXiv:2605.17352v1 Announce Type: new Abstract: Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the…

11
arXiv — NLP / Computation & Language research 1mo ago

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

arXiv:2605.18083v1 Announce Type: new Abstract: Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by…

30
arXiv — NLP / Computation & Language research 1mo ago

Multilingual jailbreaking of LLMs using low-resource languages

arXiv:2605.18239v1 Announce Type: new Abstract: Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and…

27
Hugging Face Daily Papers research 1mo ago

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Abstract Flash-GRPO improves training efficiency for video diffusion models by addressing temporal variance and gradient inconsistency through iso-temporal grouping and temporal gradient rectification. AI-generated summary Group Relative Policy Optimization has emerged as…

13
Hugging Face Daily Papers research 1mo ago

Auditing Agent Harness Safety

Abstract LLM agents executing within execution harnesses can produce correct outputs while violating safety constraints during execution, necessitating trajectory-level auditing to ensure proper resource access and information flow across multi-agent systems. AI-generated…

35
Import AI (Jack Clark) community 1mo ago

Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment

Welcome to Import AI, a newsletter about AI research.

4
r/MachineLearning community 1mo ago

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is…

35
arXiv — Machine Learning research 1mo ago

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

arXiv:2605.15208v1 Announce Type: new Abstract: Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly…

30
arXiv — Machine Learning research 1mo ago

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

arXiv:2605.15239v1 Announce Type: new Abstract: Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety…

18
arXiv — Machine Learning research 1mo ago

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

arXiv:2605.15491v1 Announce Type: new Abstract: Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to…

9
arXiv — Machine Learning research 1mo ago

parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning

arXiv:2605.15509v1 Announce Type: new Abstract: While Isaac Lab provides massive parallel UAV simulation, OmniSafe and safe-control-gym provide constrained-RL benchmarks, and CBFKit provides control-barrier-function synthesis tooling, no existing framework unifies these…

10
arXiv — Machine Learning research 1mo ago

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

arXiv:2605.15723v1 Announce Type: new Abstract: Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal…

37
arXiv — Machine Learning research 1mo ago

Learning Context-conditioned Gaussian Overbounds for Convolution-Based Uncertainty Propagation

arXiv:2605.15789v1 Announce Type: new Abstract: Uncertainty quantification is essential in safety-critical settings--from autonomous driving to aviation, finance, and health--where decisions must rely on conservative bounds rather than point estimates. Predictor-level intervals…

31
arXiv — Machine Learning research 1mo ago

Ti-iLSTM: A TinyDL Approach for Logic-Level Anomaly Detection in Industrial Water Treatment Systems

arXiv:2605.15874v1 Announce Type: new Abstract: Industrial Water Treatment Systems (IWTS) are safety critical cyber-physical infrastructures and due to increased connectivity, these systems are exposed to cyber threats that can manipulate process behaviour without creating…

27
arXiv — NLP / Computation & Language research 1mo ago

SLIP & ETHICS: Graduated Intervention for AI Emotional Companions

arXiv:2605.15915v1 Announce Type: cross Abstract: AI emotional companions face a safety-rapport paradox: restrictive safeguards can damage supportive alliance, while permissive systems risk user harm. We present SLIP (Staged Layers of Intervention Protocol), a four-stage…

8
arXiv — NLP / Computation & Language research 1mo ago

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

arXiv:2512.00920v5 Announce Type: replace Abstract: Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios,…

20
r/LocalLLaMA community 1mo ago

85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics

I've been building Abliterlitics , an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed using benchmarks, safety evaluation,…

13
Hugging Face Daily Papers research 1mo ago

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Abstract LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions. AI-generated summary As AI agents move…

10
arXiv — Machine Learning research 1mo ago

Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

arXiv:2605.13923v1 Announce Type: new Abstract: We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample…

38
arXiv — Machine Learning research 1mo ago

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

arXiv:2605.14074v1 Announce Type: new Abstract: Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines…

33
arXiv — Machine Learning research 1mo ago

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

arXiv:2605.14246v1 Announce Type: new Abstract: Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against…

31
arXiv — Machine Learning research 1mo ago

Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

arXiv:2605.14252v1 Announce Type: new Abstract: Spiking neural networks (SNNs), which are brain-inspired and spike-driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is…

31
arXiv — Machine Learning research 1mo ago

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

arXiv:2605.14311v1 Announce Type: new Abstract: Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking…

15
arXiv — Machine Learning research 1mo ago

MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse

arXiv:2605.14413v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID)…

36
arXiv — Machine Learning research 1mo ago

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

arXiv:2605.14454v1 Announce Type: new Abstract: As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail…

7
arXiv — Machine Learning research 1mo ago

Exploring Geographic Relative Space in Large Language Models through Activation Patching

arXiv:2605.14535v1 Announce Type: new Abstract: The increased use of Large Language Models (LLMs) in geography raises substantial questions about the safety of integrating these tools across a wide range of processes and analyses, given our very limited understanding of their…

7
arXiv — NLP / Computation & Language research 1mo ago

ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

arXiv:2605.14152v1 Announce Type: new Abstract: Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve…

34
arXiv — NLP / Computation & Language research 1mo ago

GradShield: Alignment Preserving Finetuning

arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a…

23
arXiv — NLP / Computation & Language research 1mo ago

Auditing Agent Harness Safety

arXiv:2605.14271v1 Announce Type: new Abstract: LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that…

4

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

When Vision Speaks for Sound

Semantic Generative Tuning for Unified Multimodal Models

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

Why Do Safety Guardrails Degrade Across Languages?

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

Multilingual jailbreaking of LLMs using low-resource languages

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Auditing Agent Harness Safety

Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

Learning Context-conditioned Gaussian Overbounds for Convolution-Based Uncertainty Propagation

Ti-iLSTM: A TinyDL Approach for Logic-Level Anomaly Detection in Industrial Water Treatment Systems

SLIP & ETHICS: Graduated Intervention for AI Emotional Companions

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Exploring Geographic Relative Space in Large Language Models through Activation Patching

ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

GradShield: Alignment Preserving Finetuning

Auditing Agent Harness Safety