News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 20d ago PreAct-Bench: Benchmarking Predictive Monitoring in LLMs arXiv:2606.09890v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior… 17 arXiv — Machine Learning research 20d ago Quality Is Not a Safety Proxy Under Quantization arXiv:2606.10154v1 Announce Type: new Abstract: Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF… 37 arXiv — Machine Learning research 20d ago A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport arXiv:2606.10216v1 Announce Type: new Abstract: Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These… 12 arXiv — Machine Learning research 20d ago Alignment Defends LLMs from Property Inference Attacks arXiv:2606.10217v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted… 18 arXiv — Machine Learning research 20d ago SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration arXiv:2606.10228v1 Announce Type: new Abstract: Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to… 15 arXiv — NLP / Computation & Language research 20d ago BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts arXiv:2606.10061v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research… 27 arXiv — NLP / Computation & Language research 20d ago Pareto-Guided Teacher Alignment for Fair Personalized Text Generation arXiv:2606.10126v1 Announce Type: new Abstract: Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained… 31 arXiv — NLP / Computation & Language research 20d ago Hidden Consensus:Preference-Validity Compression in Human Feedback arXiv:2606.10569v1 Announce Type: new Abstract: Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect… 7 arXiv — NLP / Computation & Language research 20d ago Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming arXiv:2606.10675v1 Announce Type: new Abstract: We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech… 34 arXiv — NLP / Computation & Language research 20d ago Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the… 18 arXiv — NLP / Computation & Language research 20d ago Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models arXiv:2606.11167v1 Announce Type: new Abstract: Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level… 23 arXiv — NLP / Computation & Language research 20d ago SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech arXiv:2606.06037v2 Announce Type: cross Abstract: Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability… 29 arXiv — NLP / Computation & Language research 20d ago ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs arXiv:2606.10461v1 Announce Type: cross Abstract: Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown… 8 Hugging Face Daily Papers research 20d ago When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated… 12 Hugging Face Daily Papers research 20d ago Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating Abstract Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities. Generated by… 24 Hugging Face Daily Papers research 20d ago BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts Abstract Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 14 Interconnects (Nathan Lambert) research 20d ago Claude Fable 5 and new AI safety fables One step further into the power politics of frontier AI systems. 6 Hugging Face Daily Papers research 20d ago Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense Abstract SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Prompt-injection detectors are… 30 Hugging Face Daily Papers research 21d ago Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather… 24 arXiv — Machine Learning research 21d ago Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning arXiv:2606.07631v1 Announce Type: new Abstract: Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated… 29 arXiv — Machine Learning research 21d ago DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment arXiv:2606.07678v1 Announce Type: new Abstract: Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing… 12 arXiv — Machine Learning research 21d ago Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head arXiv:2606.07694v1 Announce Type: new Abstract: Accurate vessel traffic flow prediction is crucial for smart port operations and navigational safety. However, maritime traffic flow data are often highly sparse with intermittent bursts, making robust forecasting challenging.… 6 arXiv — Machine Learning research 21d ago Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories arXiv:2606.07889v1 Announce Type: new Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change… 31 arXiv — Machine Learning research 21d ago Enhancing AI Interpretability and Safety through Localised Architectures arXiv:2606.07998v1 Announce Type: new Abstract: Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The… 8 arXiv — Machine Learning research 21d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under… 33 Hacker News — AI on Front Page community 21d ago Surveillance Is Not Safety: A statement on the UK's latest threat to privacy [pdf] Article URL: https://signal.org/blog/pdfs/2026-06-08-uk-surveillance-is-not-safety.pdf Comments URL: https://news.ycombinator.com/item?id=48450646 Points: 274 # Comments: 70 8 Hugging Face official-blog 21d ago Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem Back to Articles Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem Team Article Published June 8, 2026 Upvote 1 Abid Ali Awan kingabzpro build-small-hackathon For the Hugging Face Build Small Hackathon , I wanted to build something practical,… 35 arXiv — Machine Learning research 22d ago Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy arXiv:2606.06554v1 Announce Type: new Abstract: Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz Dual-Comb… 25 arXiv — Machine Learning research 22d ago GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution arXiv:2606.06892v1 Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and… 4 arXiv — Machine Learning research 22d ago Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making arXiv:2606.07088v1 Announce Type: new Abstract: Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly… 21 arXiv — NLP / Computation & Language research 22d ago The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment arXiv:2606.06667v1 Announce Type: new Abstract: The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated… 15 arXiv — NLP / Computation & Language research 22d ago Korean Culture into LLM Alignment: Toward Cultural Coherence arXiv:2606.06797v1 Announce Type: new Abstract: Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is… 15 arXiv — NLP / Computation & Language research 22d ago Sycophantic Praise: Evaluating Excessive Praise in Language Models arXiv:2606.07441v1 Announce Type: new Abstract: Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment… 26 arXiv — NLP / Computation & Language research 22d ago Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question… 14 arXiv — NLP / Computation & Language research 22d ago TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance.… 6 Hugging Face Daily Papers research 22d ago UniSHARP: Universal Sharp Monocular View Synthesis Abstract UniSHARP extends SHARP for universal monocular rendering across different camera systems by aligning images in an omnidirectional latent space through joint feature and Gaussian space alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In this work, we focus on… 35 OpenAI official-blog 22d ago Built to benefit everyone: our plan A vision for the future of AI, focusing on access, safety, and shared prosperity as OpenAI works to ensure AGI benefits everyone. 6 r/LocalLLaMA community 24d ago A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic) No numbers. Not sure if anybody cares… I’ve run the UD version of Q4_k_m for a month. I talk to this model nicely, because it’s a functional nervous wreck. And initially I thought that might be an alignment thing, so I also have the heretic version when I need a breather from… 25 Hugging Face Daily Papers research 24d ago SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models… 38 arXiv — Machine Learning research 25d ago Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning arXiv:2606.05675v1 Announce Type: new Abstract: Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making… 11 arXiv — Machine Learning research 25d ago Consistency Training Along the Transformer Stack arXiv:2606.05817v1 Announce Type: new Abstract: Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal… 37 arXiv — Machine Learning research 25d ago Adaptive Oscillatory-State Alignment for Time Series Forecasting arXiv:2606.06010v1 Announce Type: new Abstract: Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or… 14 arXiv — NLP / Computation & Language research 25d ago MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four… 5 arXiv — NLP / Computation & Language research 25d ago The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial… 20 arXiv — NLP / Computation & Language research 25d ago CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning arXiv:2606.05523v1 Announce Type: new Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on… 34 arXiv — NLP / Computation & Language research 25d ago Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models arXiv:2606.05688v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment.… 27 arXiv — NLP / Computation & Language research 25d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a… 9 arXiv — NLP / Computation & Language research 25d ago Harnessing Structural Context for Entity Alignment Foundation Models arXiv:2606.06109v1 Announce Type: new Abstract: Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment… 6 Hugging Face Daily Papers research 25d ago ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? Abstract Role-playing language agents require dynamic character development that evolves through narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character… 19 Hugging Face Daily Papers research 25d ago LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing… 33 Page 5 of 10 · 500 articles ← Newer Older →