News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — Machine Learning research 1mo ago High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention arXiv:2605.27758v1 Announce Type: new Abstract: Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While… 18 arXiv — Machine Learning research 1mo ago A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized… 17 arXiv — Machine Learning research 1mo ago FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation arXiv:2605.27892v1 Announce Type: new Abstract: Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are… 32 arXiv — Machine Learning research 1mo ago AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels arXiv:2605.28021v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on… 25 arXiv — Machine Learning research 1mo ago SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection arXiv:2605.28030v1 Announce Type: new Abstract: Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a… 22 arXiv — NLP / Computation & Language research 1mo ago ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment arXiv:2605.27374v1 Announce Type: new Abstract: Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical… 26 arXiv — NLP / Computation & Language research 1mo ago Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models arXiv:2605.27383v1 Announce Type: new Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by… 24 arXiv — NLP / Computation & Language research 1mo ago Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities arXiv:2605.27388v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical… 16 arXiv — NLP / Computation & Language research 1mo ago PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI arXiv:2605.27545v1 Announce Type: new Abstract: Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple… 38 arXiv — NLP / Computation & Language research 1mo ago TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling arXiv:2605.27690v1 Announce Type: new Abstract: LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore… 22 arXiv — NLP / Computation & Language research 1mo ago The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages arXiv:2605.27901v1 Announce Type: new Abstract: Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse… 30 arXiv — NLP / Computation & Language research 1mo ago KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks arXiv:2605.28013v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations:… 18 arXiv — NLP / Computation & Language research 1mo ago Chinese Word Boundary Recovery through Character Alignment Projection arXiv:2605.28128v1 Announce Type: new Abstract: Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper… 30 arXiv — NLP / Computation & Language research 1mo ago Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment arXiv:2605.28188v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but… 23 arXiv — NLP / Computation & Language research 1mo ago CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models arXiv:2605.28292v1 Announce Type: new Abstract: Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example… 31 arXiv — NLP / Computation & Language research 1mo ago HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment arXiv:2605.28308v1 Announce Type: new Abstract: Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject… 11 OpenAI official-blog 1mo ago OpenAI’s Frontier Governance Framework Explore OpenAI’s Frontier Governance Framework and how our AI safety, security, and risk practices align with emerging EU and California regulations. 15 Hugging Face Daily Papers research 1mo ago D^2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing Abstract Diffusion large language models generate text through multi-step denoising processes that expose intermediate representations useful for safety monitoring, leading to the development of a bi-level safety monitor that dynamically routes computational resources based on… 35 arXiv — Machine Learning research 1mo ago GEM: Geometric Entropy Mixing for Optimal LLM Data Curation arXiv:2605.26121v1 Announce Type: new Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering… 27 arXiv — Machine Learning research 1mo ago Curriculum Learning for Safety Alignment arXiv:2605.26315v1 Announce Type: new Abstract: Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate… 20 arXiv — Machine Learning research 1mo ago Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models arXiv:2605.26491v1 Announce Type: new Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary… 10 arXiv — Machine Learning research 1mo ago Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference arXiv:2605.26552v1 Announce Type: new Abstract: Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV,… 36 arXiv — Machine Learning research 1mo ago Linear and Neural Dueling Bandits with Delayed Feedback arXiv:2605.26554v1 Announce Type: new Abstract: Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption… 35 arXiv — NLP / Computation & Language research 1mo ago Cultural Value Alignment Via Latent Activation Steering in Large Language Models arXiv:2605.26365v1 Announce Type: new Abstract: Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access… 33 arXiv — NLP / Computation & Language research 1mo ago LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay… 34 arXiv — NLP / Computation & Language research 1mo ago Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines arXiv:2605.26442v1 Announce Type: new Abstract: Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment… 17 arXiv — NLP / Computation & Language research 1mo ago Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records arXiv:2605.26463v1 Announce Type: new Abstract: Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency… 7 arXiv — NLP / Computation & Language research 1mo ago EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation arXiv:2605.26785v1 Announce Type: new Abstract: Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability:… 21 arXiv — NLP / Computation & Language research 1mo ago Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation arXiv:2605.26918v1 Announce Type: new Abstract: Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs… 27 arXiv — NLP / Computation & Language research 1mo ago KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models arXiv:2605.26947v1 Announce Type: new Abstract: Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas… 5 arXiv — NLP / Computation & Language research 1mo ago AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian arXiv:2605.26954v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation… 25 arXiv — NLP / Computation & Language research 1mo ago Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations arXiv:2605.27025v1 Announce Type: new Abstract: Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments… 27 arXiv — NLP / Computation & Language research 1mo ago Grounding Text Embeddings in Stakeholder Associations arXiv:2605.27168v1 Announce Type: new Abstract: Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding… 17 Hugging Face Daily Papers research 1mo ago LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing… 32 Hugging Face Daily Papers research 1mo ago Cross-scale Aligned Supervision for Training GANs Abstract Standard GANs with adversarial supervision on intermediate outputs fail to maintain consistent sample trajectories across scales, leading to misalignment; a new transformer-based approach called CAT addresses this by enforcing consistency between intermediate and final… 28 Hugging Face Daily Papers research 1mo ago How Far Will They Go? Red-Teaming Online Influence with Large Language Models Abstract Open-source large language models exhibit varying political expressivity and vulnerability to jailbreak techniques, necessitating systematic red-teaming frameworks for assessing their potential misuse in influence campaigns. AI-generated summary As large language model… 25 Hugging Face Daily Papers research 1mo ago Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models Abstract Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation. AI-generated… 7 Hugging Face Daily Papers research 1mo ago Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference Abstract Visual Concept Fusion enables dual text and image conditioning in diffusion models through feature alignment and fusion strategies without requiring retraining. AI-generated summary Text-to-image diffusion models like Stable Diffusion generate high-quality images from… 35 Hugging Face Daily Papers research 1mo ago Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries Abstract A natural language interface for transportation safety analysis uses large language models to translate user queries into structured spatial operations while maintaining deterministic database execution for reliable and reproducible results. AI-generated summary… 21 r/LocalLLaMA community 1mo ago qwen 3.6 27B AR-> Diffusion - local training on 5090 based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get… 22 Hugging Face Daily Papers research 1mo ago SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges Abstract SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages. AI-generated summary Sparse encoders offer high-precision retrieval by… 23 Hugging Face Daily Papers research 1mo ago Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution Abstract ASASR addresses spectral misalignment in image super-resolution by leveraging Riemannian geometry and adversarial training to improve structural fidelity and reduce artifacts. AI-generated summary Generative priors in Image Super-Resolution (SR) often compromise… 10 Hugging Face Daily Papers research 1mo ago Reinforcing Few-step Generators via Reward-Tilted Distribution Matching Abstract RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. AI-generated summary Recent advances in few-step diffusion distillation have… 30 arXiv — Machine Learning research 1mo ago AvAtar: Learning to Align via Active Optimal Transport arXiv:2605.24395v1 Announce Type: new Abstract: Alignment plays a fundamental role in many machine learning problems, such as multi-network analysis, multimodal learning, and point cloud registration. Recent works increasingly leverage optimal transport (OT) for distributional… 12 arXiv — Machine Learning research 1mo ago An Effective-Rank Audit of Alignment-Induced Activation Shifts: Confound Control, Constructive Calibration, and Limits arXiv:2605.24583v1 Announce Type: new Abstract: We audit alignment-induced shifts in residual-stream activations of three open-weight instruction-tuned LLMs (Llama-3.1-8B-Instruct, Gemma-2-9B-it, Qwen-2.5-7B-Instruct) using the effective rank of the alignment modification matrix… 8 arXiv — Machine Learning research 1mo ago On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks arXiv:2605.24649v1 Announce Type: new Abstract: Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying them as runtime monitors in safety-critical systems demands more than predictive accuracy.… 15 arXiv — Machine Learning research 1mo ago The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench arXiv:2605.24782v1 Announce Type: new Abstract: While Vision Foundation Models (VFMs) excel at predictive tasks on satellite imagery, their performance can arise from visual correlations rather than underlying structural invariants, making even perception-based… 28 arXiv — NLP / Computation & Language research 1mo ago EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs arXiv:2605.23954v1 Announce Type: new Abstract: Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement,… 36 arXiv — NLP / Computation & Language research 1mo ago AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue arXiv:2605.23974v1 Announce Type: new Abstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing… 36 arXiv — NLP / Computation & Language research 1mo ago Measuring the Depth of LLM Unlearning via Activation Patching arXiv:2605.24614v1 Announce Type: new Abstract: Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail… 17 Page 8 of 10 · 500 articles ← Newer Older →