Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 20d ago

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

arXiv:2606.09890v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior…

17
arXiv — Machine Learning research 20d ago

Quality Is Not a Safety Proxy Under Quantization

arXiv:2606.10154v1 Announce Type: new Abstract: Quantized checkpoints are often screened first with quality metrics and only later, if at all, with direct safety tests. This paper audits that shortcut on a matched 51-row matrix spanning 6 models, 4 families, a 7-level GGUF…

37
arXiv — Machine Learning research 20d ago

A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

arXiv:2606.10216v1 Announce Type: new Abstract: Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These…

12
arXiv — Machine Learning research 20d ago

Alignment Defends LLMs from Property Inference Attacks

arXiv:2606.10217v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets that may contain sensitive, dataset-level properties. Recent work has shown that such dataset-level information can be effectively extracted…

18
arXiv — Machine Learning research 20d ago

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

arXiv:2606.10228v1 Announce Type: new Abstract: Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to…

15
arXiv — NLP / Computation & Language research 20d ago

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

arXiv:2606.10061v1 Announce Type: new Abstract: Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research…

27
arXiv — NLP / Computation & Language research 20d ago

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

arXiv:2606.10126v1 Announce Type: new Abstract: Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained…

31
arXiv — NLP / Computation & Language research 20d ago

Hidden Consensus:Preference-Validity Compression in Human Feedback

arXiv:2606.10569v1 Announce Type: new Abstract: Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect…

7
arXiv — NLP / Computation & Language research 20d ago

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

arXiv:2606.10675v1 Announce Type: new Abstract: We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech…

34
arXiv — NLP / Computation & Language research 20d ago

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the…

18
arXiv — NLP / Computation & Language research 20d ago

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

arXiv:2606.11167v1 Announce Type: new Abstract: Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level…

23
arXiv — NLP / Computation & Language research 20d ago

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

arXiv:2606.06037v2 Announce Type: cross Abstract: Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability…

29
arXiv — NLP / Computation & Language research 20d ago

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

arXiv:2606.10461v1 Announce Type: cross Abstract: Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown…

8
Hugging Face Daily Papers research 20d ago

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated…

12
Hugging Face Daily Papers research 20d ago

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Abstract Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities. Generated by…

24
Hugging Face Daily Papers research 20d ago

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Abstract Researchers create BenSyc, a benchmark for evaluating conversational sycophancy in Bengali contexts, revealing challenges in distinguishing empathetic support from validation and escalation in emotionally sensitive dialogues. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

14
Interconnects (Nathan Lambert) research 20d ago

Claude Fable 5 and new AI safety fables

One step further into the power politics of frontier AI systems.

6
Hugging Face Daily Papers research 20d ago

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Abstract SCOUT framework dynamically allocates prompt-injection detection by predicting detector reliability and latency, improving safety and efficiency over fixed single-detector approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Prompt-injection detectors are…

30
Hugging Face Daily Papers research 21d ago

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather…

24
arXiv — Machine Learning research 21d ago

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

arXiv:2606.07631v1 Announce Type: new Abstract: Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated…

29
arXiv — Machine Learning research 21d ago

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

arXiv:2606.07678v1 Announce Type: new Abstract: Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing…

12
arXiv — Machine Learning research 21d ago

Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head

arXiv:2606.07694v1 Announce Type: new Abstract: Accurate vessel traffic flow prediction is crucial for smart port operations and navigational safety. However, maritime traffic flow data are often highly sparse with intermittent bursts, making robust forecasting challenging.…

6
arXiv — Machine Learning research 21d ago

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

arXiv:2606.07889v1 Announce Type: new Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change…

31
arXiv — Machine Learning research 21d ago

Enhancing AI Interpretability and Safety through Localised Architectures

arXiv:2606.07998v1 Announce Type: new Abstract: Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The…

8
arXiv — Machine Learning research 21d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under…

33
Hacker News — AI on Front Page community 21d ago

Surveillance Is Not Safety: A statement on the UK's latest threat to privacy [pdf]

Article URL: https://signal.org/blog/pdfs/2026-06-08-uk-surveillance-is-not-safety.pdf Comments URL: https://news.ycombinator.com/item?id=48450646 Points: 274 # Comments: 70

8
Hugging Face official-blog 21d ago

Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem

Back to Articles Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem Team Article Published June 8, 2026 Upvote 1 Abid Ali Awan kingabzpro build-small-hackathon For the Hugging Face Build Small Hackathon , I wanted to build something practical,…

35
arXiv — Machine Learning research 22d ago

Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy

arXiv:2606.06554v1 Announce Type: new Abstract: Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz Dual-Comb…

25
arXiv — Machine Learning research 22d ago

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

arXiv:2606.06892v1 Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and…

4
arXiv — Machine Learning research 22d ago

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

arXiv:2606.07088v1 Announce Type: new Abstract: Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly…

21
arXiv — NLP / Computation & Language research 22d ago

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

arXiv:2606.06667v1 Announce Type: new Abstract: The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated…

15
arXiv — NLP / Computation & Language research 22d ago

Korean Culture into LLM Alignment: Toward Cultural Coherence

arXiv:2606.06797v1 Announce Type: new Abstract: Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is…

15
arXiv — NLP / Computation & Language research 22d ago

Sycophantic Praise: Evaluating Excessive Praise in Language Models

arXiv:2606.07441v1 Announce Type: new Abstract: Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment…

26
arXiv — NLP / Computation & Language research 22d ago

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

arXiv:2606.07309v1 Announce Type: cross Abstract: Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question…

14
arXiv — NLP / Computation & Language research 22d ago

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance.…

6
Hugging Face Daily Papers research 22d ago

UniSHARP: Universal Sharp Monocular View Synthesis

Abstract UniSHARP extends SHARP for universal monocular rendering across different camera systems by aligning images in an omnidirectional latent space through joint feature and Gaussian space alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In this work, we focus on…

35
OpenAI official-blog 22d ago

Built to benefit everyone: our plan

A vision for the future of AI, focusing on access, safety, and shared prosperity as OpenAI works to ensure AGI benefits everyone.

6
r/LocalLLaMA community 24d ago

A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic)

No numbers. Not sure if anybody cares… I’ve run the UD version of Q4_k_m for a month. I talk to this model nicely, because it’s a functional nervous wreck. And initially I thought that might be an alignment thing, so I also have the heretic version when I need a breather from…

25
Hugging Face Daily Papers research 24d ago

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

38
arXiv — Machine Learning research 25d ago

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

arXiv:2606.05675v1 Announce Type: new Abstract: Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making…

11
arXiv — Machine Learning research 25d ago

Consistency Training Along the Transformer Stack

arXiv:2606.05817v1 Announce Type: new Abstract: Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal…

37
arXiv — Machine Learning research 25d ago

Adaptive Oscillatory-State Alignment for Time Series Forecasting

arXiv:2606.06010v1 Announce Type: new Abstract: Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or…

14
arXiv — NLP / Computation & Language research 25d ago

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four…

5
arXiv — NLP / Computation & Language research 25d ago

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial…

20
arXiv — NLP / Computation & Language research 25d ago

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

arXiv:2606.05523v1 Announce Type: new Abstract: Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on…

34
arXiv — NLP / Computation & Language research 25d ago

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

arXiv:2606.05688v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment.…

27
arXiv — NLP / Computation & Language research 25d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a…

9
arXiv — NLP / Computation & Language research 25d ago

Harnessing Structural Context for Entity Alignment Foundation Models

arXiv:2606.06109v1 Announce Type: new Abstract: Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment…

6
Hugging Face Daily Papers research 25d ago

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Abstract Role-playing language agents require dynamic character development that evolves through narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character…

19
Hugging Face Daily Papers research 25d ago

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing…

33

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

Quality Is Not a Safety Proxy Under Quantization

A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

Alignment Defends LLMs from Property Inference Attacks

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

Hidden Consensus:Preference-Validity Compression in Human Feedback

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

Claude Fable 5 and new AI safety fables

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

Enhancing AI Interpretability and Safety through Localised Architectures

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Surveillance Is Not Safety: A statement on the UK's latest threat to privacy [pdf]

Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem

Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

Korean Culture into LLM Alignment: Toward Cultural Coherence

Sycophantic Praise: Evaluating Excessive Praise in Language Models

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

UniSHARP: Universal Sharp Monocular View Synthesis

Built to benefit everyone: our plan

A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic)

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

Consistency Training Along the Transformer Stack

Adaptive Oscillatory-State Alignment for Time Series Forecasting

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Harnessing Structural Context for Entity Alignment Foundation Models

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing