Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — Machine Learning research 1mo ago

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

arXiv:2605.27758v1 Announce Type: new Abstract: Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While…

18
arXiv — Machine Learning research 1mo ago

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized…

17
arXiv — Machine Learning research 1mo ago

FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

arXiv:2605.27892v1 Announce Type: new Abstract: Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are…

32
arXiv — Machine Learning research 1mo ago

AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels

arXiv:2605.28021v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on…

25
arXiv — Machine Learning research 1mo ago

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

arXiv:2605.28030v1 Announce Type: new Abstract: Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a…

22
arXiv — NLP / Computation & Language research 1mo ago

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

arXiv:2605.27374v1 Announce Type: new Abstract: Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical…

26
arXiv — NLP / Computation & Language research 1mo ago

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

arXiv:2605.27383v1 Announce Type: new Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by…

24
arXiv — NLP / Computation & Language research 1mo ago

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

arXiv:2605.27388v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical…

16
arXiv — NLP / Computation & Language research 1mo ago

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

arXiv:2605.27545v1 Announce Type: new Abstract: Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple…

38
arXiv — NLP / Computation & Language research 1mo ago

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

arXiv:2605.27690v1 Announce Type: new Abstract: LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore…

22
arXiv — NLP / Computation & Language research 1mo ago

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

arXiv:2605.27901v1 Announce Type: new Abstract: Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse…

30
arXiv — NLP / Computation & Language research 1mo ago

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

arXiv:2605.28013v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations:…

18
arXiv — NLP / Computation & Language research 1mo ago

Chinese Word Boundary Recovery through Character Alignment Projection

arXiv:2605.28128v1 Announce Type: new Abstract: Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper…

30
arXiv — NLP / Computation & Language research 1mo ago

Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

arXiv:2605.28188v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but…

23
arXiv — NLP / Computation & Language research 1mo ago

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

arXiv:2605.28292v1 Announce Type: new Abstract: Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example…

31
arXiv — NLP / Computation & Language research 1mo ago

HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment

arXiv:2605.28308v1 Announce Type: new Abstract: Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject…

11
OpenAI official-blog 1mo ago

OpenAI’s Frontier Governance Framework

Explore OpenAI’s Frontier Governance Framework and how our AI safety, security, and risk practices align with emerging EU and California regulations.

15
Hugging Face Daily Papers research 1mo ago

D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

Abstract Diffusion large language models generate text through multi-step denoising processes that expose intermediate representations useful for safety monitoring, leading to the development of a bi-level safety monitor that dynamically routes computational resources based on…

35
arXiv — Machine Learning research 1mo ago

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv:2605.26121v1 Announce Type: new Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering…

27
arXiv — Machine Learning research 1mo ago

Curriculum Learning for Safety Alignment

arXiv:2605.26315v1 Announce Type: new Abstract: Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate…

20
arXiv — Machine Learning research 1mo ago

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

arXiv:2605.26491v1 Announce Type: new Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary…

10
arXiv — Machine Learning research 1mo ago

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

arXiv:2605.26552v1 Announce Type: new Abstract: Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV,…

36
arXiv — Machine Learning research 1mo ago

Linear and Neural Dueling Bandits with Delayed Feedback

arXiv:2605.26554v1 Announce Type: new Abstract: Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption…

35
arXiv — NLP / Computation & Language research 1mo ago

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

arXiv:2605.26365v1 Announce Type: new Abstract: Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access…

33
arXiv — NLP / Computation & Language research 1mo ago

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay…

34
arXiv — NLP / Computation & Language research 1mo ago

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

arXiv:2605.26442v1 Announce Type: new Abstract: Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment…

17
arXiv — NLP / Computation & Language research 1mo ago

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

arXiv:2605.26463v1 Announce Type: new Abstract: Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency…

7
arXiv — NLP / Computation & Language research 1mo ago

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

arXiv:2605.26785v1 Announce Type: new Abstract: Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability:…

21
arXiv — NLP / Computation & Language research 1mo ago

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs…

27
arXiv — NLP / Computation & Language research 1mo ago

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

arXiv:2605.26947v1 Announce Type: new Abstract: Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas…

5
arXiv — NLP / Computation & Language research 1mo ago

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

arXiv:2605.26954v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation…

25
arXiv — NLP / Computation & Language research 1mo ago

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments…

27
arXiv — NLP / Computation & Language research 1mo ago

Grounding Text Embeddings in Stakeholder Associations

arXiv:2605.27168v1 Announce Type: new Abstract: Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding…

17
Hugging Face Daily Papers research 1mo ago

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing…

32
Hugging Face Daily Papers research 1mo ago

Cross-scale Aligned Supervision for Training GANs

Abstract Standard GANs with adversarial supervision on intermediate outputs fail to maintain consistent sample trajectories across scales, leading to misalignment; a new transformer-based approach called CAT addresses this by enforcing consistency between intermediate and final…

28
Hugging Face Daily Papers research 1mo ago

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Abstract Open-source large language models exhibit varying political expressivity and vulnerability to jailbreak techniques, necessitating systematic red-teaming frameworks for assessing their potential misuse in influence campaigns. AI-generated summary As large language model…

25
Hugging Face Daily Papers research 1mo ago

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Abstract Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation. AI-generated…

7
Hugging Face Daily Papers research 1mo ago

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Abstract Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining. AI-generated summary Text-to-image diffusion models like Stable Diffusion generate high-quality images from…

35
Hugging Face Daily Papers research 1mo ago

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

Abstract A natural language interface for transportation safety analysis uses large language models to translate user queries into structured spatial operations while maintaining deterministic database execution for reliable and reproducible results. AI-generated summary…

21
r/LocalLLaMA community 1mo ago

qwen 3.6 27B AR-> Diffusion - local training on 5090

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get…

22
Hugging Face Daily Papers research 1mo ago

SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Abstract SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages. AI-generated summary Sparse encoders offer high-precision retrieval by…

23
Hugging Face Daily Papers research 1mo ago

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Abstract ASASR addresses spectral misalignment in image super-resolution by leveraging Riemannian geometry and adversarial training to improve structural fidelity and reduce artifacts. AI-generated summary Generative priors in Image Super-Resolution (SR) often compromise…

10
Hugging Face Daily Papers research 1mo ago

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Abstract RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. AI-generated summary Recent advances in few-step diffusion distillation have…

30
arXiv — Machine Learning research 1mo ago

AvAtar: Learning to Align via Active Optimal Transport

arXiv:2605.24395v1 Announce Type: new Abstract: Alignment plays a fundamental role in many machine learning problems, such as multi-network analysis, multimodal learning, and point cloud registration. Recent works increasingly leverage optimal transport (OT) for distributional…

12
arXiv — Machine Learning research 1mo ago

An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits

arXiv:2605.24583v1 Announce Type: new Abstract: We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix…

8
arXiv — Machine Learning research 1mo ago

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

arXiv:2605.24649v1 Announce Type: new Abstract: Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy.…

15
arXiv — Machine Learning research 1mo ago

The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

arXiv:2605.24782v1 Announce Type: new Abstract: While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based…

28
arXiv — NLP / Computation & Language research 1mo ago

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

arXiv:2605.23954v1 Announce Type: new Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement,…

36
arXiv — NLP / Computation & Language research 1mo ago

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

arXiv:2605.23974v1 Announce Type: new Abstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing…

36
arXiv — NLP / Computation & Language research 1mo ago

Measuring the Depth of LLM Unlearning via Activation Patching

arXiv:2605.24614v1 Announce Type: new Abstract: Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail…

17

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

Chinese Word Boundary Recovery through Character Alignment Projection

Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment

OpenAI’s Frontier Governance Framework

D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

Curriculum Learning for Safety Alignment

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

Linear and Neural Dueling Bandits with Delayed Feedback

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

Grounding Text Embeddings in Stakeholder Associations

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Cross-scale Aligned Supervision for Training GANs

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

qwen 3.6 27B AR-> Diffusion - local training on 5090

SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

AvAtar: Learning to Align via Active Optimal Transport

An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits

On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks

The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

Measuring the Depth of LLM Unlearning via Activation Patching