News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 1mo ago LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models arXiv:2605.21362v1 Announce Type: new Abstract: Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits… 21 Hugging Face Daily Papers research 1mo ago Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment Abstract Direct Preference Optimization (DPO) is theoretically equivalent to Reinforcement Learning from Human Feedback (RLHF) only under specific assumptions, otherwise optimizing different objectives; Constrained Preference Optimization (CPO) is proposed as a solution with… 17 Hugging Face Daily Papers research 1mo ago When Vision Speaks for Sound Abstract Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment. AI-generated summary Despite… 34 Hugging Face Daily Papers research 1mo ago Semantic Generative Tuning for Unified Multimodal Models Abstract Generative post-training with semantic segmentation as a proxy enhances multimodal alignment and performance in unified models. AI-generated summary Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single… 20 arXiv — Machine Learning research 1mo ago Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training arXiv:2605.18822v1 Announce Type: new Abstract: Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable… 28 arXiv — Machine Learning research 1mo ago Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin arXiv:2605.18823v1 Announce Type: new Abstract: Digital twins (DTs) for urban transportation systems have gained increasing attention; however, their systematic evaluation in safety-critical scenarios remains limited. This paper presents a multi-pedestrian safety warning system… 24 arXiv — Machine Learning research 1mo ago Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling arXiv:2605.18838v1 Announce Type: new Abstract: Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a… 10 arXiv — Machine Learning research 1mo ago From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning arXiv:2605.18841v1 Announce Type: new Abstract: Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and… 6 arXiv — Machine Learning research 1mo ago Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints arXiv:2605.18842v1 Announce Type: new Abstract: Safe reinforcement learning in nonstationary environments requires safety mechanisms that adapt as environmental conditions change. Standard safe reinforcement learning methods often assume fixed constraints or stable environmental… 20 arXiv — Machine Learning research 1mo ago ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models arXiv:2605.18879v1 Announce Type: new Abstract: Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning… 32 arXiv — NLP / Computation & Language research 1mo ago LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models arXiv:2605.19416v1 Announce Type: new Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled… 5 arXiv — NLP / Computation & Language research 1mo ago GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment arXiv:2605.19577v1 Announce Type: new Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter… 17 arXiv — NLP / Computation & Language research 1mo ago CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving arXiv:2605.19837v1 Announce Type: cross Abstract: Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time… 32 Hugging Face Daily Papers research 1mo ago GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment Abstract GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology. AI-generated summary We present GoLongRL, a fully open-source,… 37 arXiv — Machine Learning research 1mo ago Goal-Conditioned Supervised Learning for LLM Fine-Tuning arXiv:2605.16345v1 Announce Type: new Abstract: Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment,… 28 arXiv — Machine Learning research 1mo ago Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need? arXiv:2605.16354v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety… 25 arXiv — Machine Learning research 1mo ago M$^2$FedAQI: Multimodal Federated Learning for Air Quality Prediction on Heterogeneous Edge Devices arXiv:2605.16375v1 Announce Type: new Abstract: Accurate air quality prediction is essential for public health, environmental monitoring, and industrial safety. However, most existing approaches rely on centralized learning paradigms, which introduce challenges related to… 18 arXiv — Machine Learning research 1mo ago Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space arXiv:2605.16600v1 Announce Type: new Abstract: Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas… 31 arXiv — NLP / Computation & Language research 1mo ago Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models arXiv:2605.16938v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface… 16 arXiv — NLP / Computation & Language research 1mo ago Why Do Safety Guardrails Degrade Across Languages? arXiv:2605.17173v1 Announce Type: new Abstract: Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of… 12 arXiv — NLP / Computation & Language research 1mo ago Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment arXiv:2605.17342v1 Announce Type: new Abstract: Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their… 11 arXiv — NLP / Computation & Language research 1mo ago AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering arXiv:2605.17352v1 Announce Type: new Abstract: Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the… 11 arXiv — NLP / Computation & Language research 1mo ago A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE arXiv:2605.18083v1 Announce Type: new Abstract: Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by… 30 arXiv — NLP / Computation & Language research 1mo ago Multilingual jailbreaking of LLMs using low-resource languages arXiv:2605.18239v1 Announce Type: new Abstract: Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and… 27 Hugging Face Daily Papers research 1mo ago Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization Abstract Flash-GRPO improves training efficiency for video diffusion models by addressing temporal variance and gradient inconsistency through iso-temporal grouping and temporal gradient rectification. AI-generated summary Group Relative Policy Optimization has emerged as… 13 Hugging Face Daily Papers research 1mo ago Auditing Agent Harness Safety Abstract LLM agents executing within execution harnesses can produce correct outputs while violating safety constraints during execution, necessitating trajectory-level auditing to ensure proper resource access and information flow across multi-agent systems. AI-generated… 35 Import AI (Jack Clark) community 1mo ago Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment Welcome to Import AI, a newsletter about AI research. 4 r/MachineLearning community 1mo ago could refusal layers be masking dialect-conditioned safety failures in MoE models [d] I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is… 35 arXiv — Machine Learning research 1mo ago Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels arXiv:2605.15208v1 Announce Type: new Abstract: Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly… 30 arXiv — Machine Learning research 1mo ago Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation arXiv:2605.15239v1 Announce Type: new Abstract: Safety alignment often improves robustness to harmful queries at the cost of reasoning ability, a tradeoff known as the safety tax. A common cause is distributional mismatch: supervised fine-tuning trains the target model on safety… 18 arXiv — Machine Learning research 1mo ago Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs arXiv:2605.15491v1 Announce Type: new Abstract: Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to… 9 arXiv — Machine Learning research 1mo ago parallelcbf: A composable safety-filter and auditability framework for tensor-parallel reinforcement learning arXiv:2605.15509v1 Announce Type: new Abstract: While Isaac Lab provides massive parallel UAV simulation, OmniSafe and safe-control-gym provide constrained-RL benchmarks, and CBFKit provides control-barrier-function synthesis tooling, no existing framework unifies these… 10 arXiv — Machine Learning research 1mo ago GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective arXiv:2605.15723v1 Announce Type: new Abstract: Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal… 37 arXiv — Machine Learning research 1mo ago Learning Context-conditioned Gaussian Overbounds for Convolution-Based Uncertainty Propagation arXiv:2605.15789v1 Announce Type: new Abstract: Uncertainty quantification is essential in safety-critical settings--from autonomous driving to aviation, finance, and health--where decisions must rely on conservative bounds rather than point estimates. Predictor-level intervals… 31 arXiv — Machine Learning research 1mo ago Ti-iLSTM: A TinyDL Approach for Logic-Level Anomaly Detection in Industrial Water Treatment Systems arXiv:2605.15874v1 Announce Type: new Abstract: Industrial Water Treatment Systems (IWTS) are safety critical cyber-physical infrastructures and due to increased connectivity, these systems are exposed to cyber threats that can manipulate process behaviour without creating… 27 arXiv — NLP / Computation & Language research 1mo ago SLIP & ETHICS: Graduated Intervention for AI Emotional Companions arXiv:2605.15915v1 Announce Type: cross Abstract: AI emotional companions face a safety-rapport paradox: restrictive safeguards can damage supportive alliance, while permissive systems risk user harm. We present SLIP (Staged Layers of Intervention Protocol), a four-stage… 8 arXiv — NLP / Computation & Language research 1mo ago Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios arXiv:2512.00920v5 Announce Type: replace Abstract: Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios,… 20 r/LocalLLaMA community 1mo ago 85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics I've been building Abliterlitics , an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed using benchmarks, safety evaluation,… 13 Hugging Face Daily Papers research 1mo ago LiSA: Lifelong Safety Adaptation via Conservative Policy Induction Abstract LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions. AI-generated summary As AI agents move… 10 arXiv — Machine Learning research 1mo ago Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations arXiv:2605.13923v1 Announce Type: new Abstract: We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample… 38 arXiv — Machine Learning research 1mo ago Fair and Calibrated Toxicity Detection with Robust Training and Abstention arXiv:2605.14074v1 Announce Type: new Abstract: Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines… 33 arXiv — Machine Learning research 1mo ago Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability arXiv:2605.14246v1 Announce Type: new Abstract: Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against… 31 arXiv — Machine Learning research 1mo ago Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks arXiv:2605.14252v1 Announce Type: new Abstract: Spiking neural networks (SNNs), which are brain-inspired and spike-driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is… 31 arXiv — Machine Learning research 1mo ago Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment arXiv:2605.14311v1 Announce Type: new Abstract: Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking… 15 arXiv — Machine Learning research 1mo ago MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse arXiv:2605.14413v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID)… 36 arXiv — Machine Learning research 1mo ago LiSA: Lifelong Safety Adaptation via Conservative Policy Induction arXiv:2605.14454v1 Announce Type: new Abstract: As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail… 7 arXiv — Machine Learning research 1mo ago Exploring Geographic Relative Space in Large Language Models through Activation Patching arXiv:2605.14535v1 Announce Type: new Abstract: The increased use of Large Language Models (LLMs) in geography raises substantial questions about the safety of integrating these tools across a wide range of processes and analyses, given our very limited understanding of their… 7 arXiv — NLP / Computation & Language research 1mo ago ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety arXiv:2605.14152v1 Announce Type: new Abstract: Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve… 34 arXiv — NLP / Computation & Language research 1mo ago GradShield: Alignment Preserving Finetuning arXiv:2605.14194v1 Announce Type: new Abstract: Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a… 23 arXiv — NLP / Computation & Language research 1mo ago Auditing Agent Harness Safety arXiv:2605.14271v1 Announce Type: new Abstract: LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that… 4 Page 10 of 10 · 500 articles ← Newer