News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 28d ago MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models arXiv:2606.01060v1 Announce Type: new Abstract: Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and… 28 arXiv — NLP / Computation & Language research 28d ago Low-Resource Safety Failures Are Action Failures, Not Representation Failures arXiv:2606.01196v1 Announce Type: new Abstract: Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive… 34 arXiv — NLP / Computation & Language research 28d ago Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination arXiv:2606.01276v1 Announce Type: new Abstract: Large language model (LLM)-based machine translation has advanced cross-cultural communication, yet it still struggles with culture-loaded words (CLWs) in ancient Chinese texts. The challenge extends beyond lexical alignment to… 33 arXiv — NLP / Computation & Language research 28d ago TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages arXiv:2606.01322v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven… 19 Hugging Face Daily Papers research 28d ago Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents Abstract Model-aware skill alignment framework adapts skills to different backbones through hierarchical evolution and lightweight rewriter training, achieving superior performance across interactive tasks. AI-generated summary LLM agents increasingly retrieve externally curated… 18 OpenAI official-blog 28d ago Our views on AI policy and political advocacy Our approach to AI policy and political advocacy, transparency, support for thoughtful regulation and AI safety, and that no outside political group speaks on the company’s behalf. 26 The Information — AI news-outlet 28d ago Florida Sues OpenAI and Sam Altman Over Safety Concerns Florida Attorney General James Uthmeier on Monday sued OpenAI and its chief executive Sam Altman, alleging 10 counts of negligence, liability, and other state law violations related to safety concerns over OpenAI’s consumer-facing tool ChatGPT. With the lawsuit, Florida became… 24 Hugging Face Daily Papers research 29d ago The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement Abstract SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives. AI-generated summary Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and… 26 arXiv — Machine Learning research 29d ago When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception arXiv:2605.30381v1 Announce Type: new Abstract: Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a central challenge in AI safety. While strategic deception is the primary long-term concern,… 24 arXiv — Machine Learning research 29d ago Calibrated Preference Learning: The Case of Label Ranking arXiv:2605.30447v1 Announce Type: new Abstract: Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally… 20 arXiv — Machine Learning research 29d ago Measuring, Localizing, and Ablating Alignment Signatures in LLMs arXiv:2605.30526v1 Announce Type: new Abstract: Aligned language models often exhibit a recognizable AI-like style, yet its connection to post-training and internal representations remains poorly understood. In this work, we study whether post-training introduces or amplifies… 11 arXiv — Machine Learning research 29d ago Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules arXiv:2605.30556v1 Announce Type: new Abstract: Random, untrained neural networks consistently match or exceed trained networks in representational similarity to early visual cortex. This puzzling finding challenges the assumption that learning improves brain alignment. We… 22 arXiv — Machine Learning research 29d ago Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation arXiv:2605.30585v1 Announce Type: new Abstract: Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty quantification to ensure reliability and safety. This paper investigates five major… 38 arXiv — Machine Learning research 29d ago CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment arXiv:2605.30635v1 Announce Type: new Abstract: Inferring dynamics from population snapshots is a fundamental challenge in machine learning and biology. In scRNA-sequencing (scRNA-seq), destructive measurements preclude direct tracking of individual cells across time, making… 31 arXiv — Machine Learning research 29d ago CSULoRA: Closest Safe Update Low-Rank Adaptation arXiv:2605.30640v1 Announce Type: new Abstract: Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned… 28 arXiv — Machine Learning research 29d ago Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences arXiv:2605.30873v1 Announce Type: new Abstract: Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user… 35 arXiv — Machine Learning research 29d ago Parallel Tempering Initial Sampling in Inference-Time Reward Alignment arXiv:2605.30991v1 Announce Type: new Abstract: Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this… 14 arXiv — NLP / Computation & Language research 29d ago Configurable Reward Model for Balanced Safety Alignment arXiv:2605.30487v1 Announce Type: new Abstract: Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety… 11 arXiv — NLP / Computation & Language research 29d ago Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty arXiv:2605.30675v1 Announce Type: new Abstract: Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the… 5 arXiv — NLP / Computation & Language research 29d ago Skill is Not One-Size-Fits-All: Model-Aware Skill Alignment for LLM Agents arXiv:2605.30723v1 Announce Type: new Abstract: LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as… 15 arXiv — NLP / Computation & Language research 29d ago Pairwise Reference Alignment as a Model-Level Ordinal Observable arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference… 18 arXiv — NLP / Computation & Language research 29d ago The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement arXiv:2605.30888v1 Announce Type: new Abstract: Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the… 31 arXiv — NLP / Computation & Language research 29d ago ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails arXiv:2605.31073v1 Announce Type: new Abstract: Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent… 16 arXiv — NLP / Computation & Language research 29d ago Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards arXiv:2605.31328v1 Announce Type: new Abstract: Emergent misalignment (EM) is the surprising tendency of language models to become broadly misaligned after fine-tuning on narrowly misaligned examples. While EM has been extensively studied in the supervised fine-tuning (SFT)… 20 arXiv — NLP / Computation & Language research 29d ago LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues… 36 r/LocalLLaMA community 29d ago 13 abliterated Gemma 4 E2B variants, 44 GPU hours, Benchmark and Comparison - Abliterlitics I compared 13 abliterated variants of Gemma 4 E2B across weight analysis, KL divergence, HarmBench safety, and 8 benchmark tasks. 44 GPU hours on a single RTX 5090. Here is what actually works and what destroys capabilities. coder3101's variant achieves 96% ASR with capability… 17 Hugging Face Daily Papers research 1mo ago Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases Abstract Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling. AI-generated… 17 arXiv — Machine Learning research 1mo ago Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents arXiv:2605.28850v1 Announce Type: new Abstract: We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. Using TradeArena, an auditable trading-agent testbed with risk reports, execution simulation,… 26 arXiv — Machine Learning research 1mo ago Representation Alignment Rests on Linear Structure arXiv:2605.28870v1 Announce Type: new Abstract: We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal… 11 arXiv — Machine Learning research 1mo ago A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio arXiv:2605.28975v1 Announce Type: new Abstract: We study the log-alignment ratio (LAR), a measure of parameter-activation alignment, introduced in parameterization theory. We reformulate it as the overlap between a weight spectrum $p$ of the normalized squared singular values of… 37 arXiv — Machine Learning research 1mo ago Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning arXiv:2605.29028v1 Announce Type: new Abstract: Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their… 17 arXiv — Machine Learning research 1mo ago PROTOCOL: Late Interaction Retrieval for Protein Homolog Search arXiv:2605.29158v1 Announce Type: new Abstract: Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose… 12 arXiv — Machine Learning research 1mo ago SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction arXiv:2605.29236v1 Announce Type: new Abstract: Alarm fatigue in intensive care units (ICUs) is a well documented patient safety crisis. Clinical monitors generate 350 or more alarms per patient per day, out of which 72-99% are clinically irrelevant. Staff desensitization to… 29 arXiv — Machine Learning research 1mo ago Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content arXiv:2605.29659v1 Announce Type: new Abstract: Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail… 4 arXiv — NLP / Computation & Language research 1mo ago From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale arXiv:2605.28826v1 Announce Type: new Abstract: In modern LLMs, linguistic features function not as stylistic artifacts but as probes of probability mass, allocated under training alignment objectives. Language models trained with contemporary pipelines exhibit severe reshaping… 35 arXiv — NLP / Computation & Language research 1mo ago Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated… 19 arXiv — NLP / Computation & Language research 1mo ago GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how… 31 arXiv — NLP / Computation & Language research 1mo ago Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents arXiv:2605.29224v1 Announce Type: new Abstract: AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment… 15 arXiv — NLP / Computation & Language research 1mo ago A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods… 19 arXiv — NLP / Computation & Language research 1mo ago Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset arXiv:2605.29365v1 Announce Type: new Abstract: Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human… 20 arXiv — NLP / Computation & Language research 1mo ago Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning arXiv:2605.29414v1 Announce Type: new Abstract: Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However,… 32 arXiv — NLP / Computation & Language research 1mo ago Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment arXiv:2605.29458v1 Announce Type: new Abstract: Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and… 36 arXiv — NLP / Computation & Language research 1mo ago Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries,… 9 arXiv — NLP / Computation & Language research 1mo ago Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs arXiv:2605.29708v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled… 36 Hugging Face Daily Papers research 1mo ago LoMo: Local Modality Substitution for Deeper Vision-Language Fusion Abstract Vision-language models suffer from modality sensitivity due to training data bias, but a new data curation approach called Local Modality Substitution improves cross-modal representation alignment and reasoning performance. AI-generated summary Vision-Language Models… 26 Hugging Face Daily Papers research 1mo ago AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security Abstract A lightweight and scalable agent safety alignment framework is proposed to address emerging threats from advanced AI models, featuring taxonomy-guided training with minimal samples and efficient deployment in real-world scenarios. AI-generated summary Modern open-world… 23 Hugging Face Daily Papers research 1mo ago Native Audio-Visual Alignment for Generation Abstract NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising. AI-generated summary Joint audio-video generation aims to synthesize temporally synchronized and… 38 The Information — AI news-outlet 1mo ago Illinois Legislature Passes Landmark AI Safety Bill On Wednesday, the Illinois House of Representatives passed a bill that will require major AI companies to submit their model safety plans for third-party audits, as well as creating whistleblower protections for those companies’ employees. While Governor JB Pritzker still has to… 9 Ars Technica — AI news-outlet 1mo ago Trump loses more control over AI regulation as Illinois passes landmark law Here’s why Anthropic and OpenAI are on board with Illinois safety testing. 9 arXiv — Machine Learning research 1mo ago Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment arXiv:2605.27659v1 Announce Type: new Abstract: Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world… 15 Page 7 of 10 · 500 articles ← Newer Older →