News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — Machine Learning research 2h ago A Gravitational Interpretation of Fine-Tuning Reversion arXiv:2606.28525v1 Announce Type: new Abstract: Fine-tuning on harmless data can partially undo behaviors acquired earlier in training. Safety can erode under benign post-alignment updates, unlearned capabilities can re-emerge, latent traits can transfer through apparently… 27 arXiv — Machine Learning research 2h ago MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment arXiv:2606.29049v1 Announce Type: new Abstract: Knowledge Tracing (KT) is important for personalized education but traditionally suffers from two key limitations: a reliance on shallow ID-based representations that neglect semantic depth and a restriction to single-granularity… 37 arXiv — Machine Learning research 2h ago Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using… 27 arXiv — Machine Learning research 2h ago Beyond Trajectory Matching: Reflow with Marginal Distribution Alignment arXiv:2606.29287v1 Announce Type: new Abstract: Diffusion and continuous-flow generative models achieve high-quality generation, and their deterministic sampling can be formulated as solving learned ODE dynamics. However, accurate ODE discretization often requires many steps,… 36 arXiv — Machine Learning research 2h ago Do Models Read What They Write? Causal Registers in Scratchpad Reasoning arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a… 29 arXiv — Machine Learning research 2h ago VISTA-DZ: Visual Semantic Trajectory Adaptation for Personalized Dilemma Zone Prediction arXiv:2606.29548v1 Announce Type: new Abstract: Driver decision making in the dilemma zone at signalized intersections is safety critical, as vehicles approaching a yellow signal must decide whether to stop or proceed within limited time and distance margins. Accurate prediction… 38 arXiv — NLP / Computation & Language research 2h ago DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation arXiv:2606.28725v1 Announce Type: new Abstract: Automated toxicity moderation systems operate in dynamic online environments where harmful behavior evolves through coded language, shifting targets, and strategic adaptation to enforcement. Existing drift detection methods often… 12 arXiv — NLP / Computation & Language research 2h ago The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning arXiv:2606.28843v1 Announce Type: new Abstract: Fine-tuning a large language model is a ubiquitous method for enhancing its capability on a specific downstream task. However, prior work has shown that this increase in capability comes with a cost: it can increase a model's… 18 arXiv — NLP / Computation & Language research 2h ago A Hybrid Framework for Song Lyric Annotation Based on Human-LLM Alignment arXiv:2606.29273v1 Announce Type: new Abstract: Emotion recognition of song lyrics is a challenging task since lyrics may not necessarily align with the overall emotion of a song. As a result, lyrics annotation remains largely underexplored. Drawing inspiration from research in… 34 arXiv — NLP / Computation & Language research 2h ago Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages arXiv:2606.29649v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) are increasingly deployed as content moderation tools, yet they remain vulnerable to jailbreak attacks in which harmful text is visually encoded as ASCII art. This can allow inappropriate or… 31 arXiv — NLP / Computation & Language research 2h ago Timesteps of Mamba Align with Human Reading Times arXiv:2606.29904v1 Announce Type: new Abstract: This study demonstrates an alignment of per-word processing time in a popular state-space language model Mamba and human readers. In Mamba, the recurrent state transition at each layer conceptually takes some duration of time, the… 12 arXiv — NLP / Computation & Language research 2h ago Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization arXiv:2606.29933v1 Announce Type: new Abstract: The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and… 16 arXiv — NLP / Computation & Language research 2h ago Node-to-Neighborhood Semantic Consistency: Text-Topology Alignment for TAGs Anomaly Detection arXiv:2606.30009v1 Announce Type: new Abstract: Graph anomaly detection (GAD) on text-attributed graphs (TAGs) is vital for applications such as fraud detection and academic integrity verification. Existing approaches generally fall into two paradigms. GNN-based methods… 36 arXiv — Machine Learning research 1d ago RS-Diffuser: Risk-Sensitive Diffusion Planning with Distributional Value Guidance arXiv:2606.27766v1 Announce Type: new Abstract: Offline reinforcement learning enables policy learning from fixed datasets without additional environment interaction, making it appealing for safety-critical applications where online exploration is costly or unsafe.… 32 arXiv — Machine Learning research 1d ago NormGuard: Reward-Preserving Norm Constraints in Flow-Matching Reinforcement Learning arXiv:2606.27771v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of… 8 arXiv — Machine Learning research 1d ago OperatorSHAP: Fast and Accurate Shapley Value Estimation for Neural Operators arXiv:2606.28065v1 Announce Type: new Abstract: Understanding model predictions is essential for physical applications, where outputs often inform safety-critical decisions, such as structural load assessment, weather warnings, and clinical diagnosis. Shapley values satisfy many… 20 arXiv — Machine Learning research 1d ago Democratic ICAI: Debating Our Way to Steering Principles from Preferences arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the… 38 arXiv — NLP / Computation & Language research 1d ago Position: The Term "Machine Unlearning" Is Overused in LLMs arXiv:2606.27379v1 Announce Type: new Abstract: Large language models increasingly face demands to "forget" training data, knowledge, or behaviors due to regulatory deletion obligations, copyright/licensing disputes, and safety or product-policy requirements. This position paper… 15 arXiv — Machine Learning research 1d ago Physics-Guided Robotic Radiation Source Localization along Arbitrary Measurement Paths in Unstructured Environments arXiv:2606.27624v1 Announce Type: cross Abstract: Using robots to estimate the location of the radiation source is an effective way to improve efficiency and safety. Existing methods focus on planning the robot's path to achieve precise estimation, typically approaching the… 19 arXiv — NLP / Computation & Language research 1d ago Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety arXiv:2606.27632v1 Announce Type: new Abstract: As large language models are increasingly deployed in real-world systems, safety failures can still lead to harmful outputs and dangerous misuse. We argue that the essence of safety is adversarial: many failures arise not from… 29 arXiv — NLP / Computation & Language research 1d ago Enhancing Numerical Prediction in LLMs via Smooth MMD Alignment arXiv:2606.27731v1 Announce Type: new Abstract: Despite their strong general capabilities, large language models (LLMs) often remain unreliable when outputs must be numerically precise. A key reason is the training objective: standard cross-entropy treats numeric tokens as… 31 arXiv — NLP / Computation & Language research 1d ago Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety arXiv:2510.16492v4 Announce Type: replace Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn… 20 arXiv — NLP / Computation & Language research 1d ago SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,… 31 r/LocalLLaMA community 2d ago [NEW MODEL] - SupraSafety-18M · Tiny Content-Moderation Model Hey r/LocalLLaMA ! SupraLabs is back with a new model: SupraSafety-18M . It's a BERT-style 18M params model trained from scratch on 2 T4 GPUs in Kaggle on the nvidia/Nemotron-3.5-Content-Safety-Dataset dataset for 7 epochs. It's built to run on edge devices , mobile phones , or… 13 Hugging Face Daily Papers research 3d ago LISA: Likelihood Score Alignment for Visual-condition Controllable Generation Abstract Score-based generative modeling reveals that side networks contribute likelihood scores to conditional control, leading to improved training efficiency through likelihood score alignment regularization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The prevalent… 36 OpenAI official-blog 3d ago Previewing GPT-5.6 Sol: a next-generation model OpenAI previews GPT-5.6 Sol, a next-generation model with stronger capabilities in coding, science, and cybersecurity, paired with its most advanced safety stack. 10 Smol AI News news-outlet 4d ago not much happened today **OpenAI** previewed **GPT-5.6** with three variants: **Sol** (flagship), **Terra** (mid-tier), and **Luna** (lower-cost), launching under a restricted rollout mandated by the U.S. government, limiting access to trusted partners. **Sol** boasts enhanced cybersecurity and safety… 35 arXiv — Machine Learning research 4d ago Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's… 4 arXiv — Machine Learning research 4d ago Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI arXiv:2606.26406v1 Announce Type: new Abstract: We propose a complete architectural blueprint for safe artificial general intelligence based on a closed reentry loop (D I cycle). In contrast to feedforward networks, which are directed acyclic graphs (C=0, S=0) incapable of… 37 arXiv — NLP / Computation & Language research 4d ago AIGP: An LLM-Based Framework for Long-Term Value Alignment in E-Commerce Pricing arXiv:2606.26787v1 Announce Type: cross Abstract: Traditional dynamic pricing models in large-scale e-commerce suffer from limited interpretability, poor utilization of unstructured information, and misalignment with long-term business objectives such as cumulative Gross… 26 arXiv — Machine Learning research 4d ago RecallRisk-BERT: A Multi-Task Framework for Post-Report Medical Device Recall Triage arXiv:2606.27174v1 Announce Type: new Abstract: Medical device recalls are a critical regulatory mechanism for protecting patient safety. The growing volume of FDA recall records presents challenges in post-report recall triage, severity assessment, and root-cause… 24 arXiv — NLP / Computation & Language research 4d ago The Geometry of Updates: Fisher Alignment at Vocabulary Scale arXiv:2606.27242v1 Announce Type: cross Abstract: Training-free source selection for LLM families with shared vocabularies arises in scientific string domains such as SMILES, protein, and genomic sequences, where candidate corpora share a tokenizer but differ in prediction… 38 arXiv — NLP / Computation & Language research 4d ago Reducing Conversational Escalation in Large Language Model Dialogue with Nonviolent Communication Constraints arXiv:2606.26106v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in emotionally charged situations involving interpersonal conflict, frustration, and distress. While prior safety research has focused on preventing explicit harms such as toxic or… 26 arXiv — NLP / Computation & Language research 4d ago Soft Token Alignment for Cross-Lingual Reasoning arXiv:2606.26466v1 Announce Type: new Abstract: Multilingual large language models often produce inconsistent reasoning and answers for semantically equivalent prompts in different languages. Prior work suggests that intermediate representations can be relatively… 5 arXiv — NLP / Computation & Language research 4d ago The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report arXiv:2606.26529v1 Announce Type: new Abstract: AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified. We show that conditioning a language or vision model on a narrow task suppresses its… 14 arXiv — NLP / Computation & Language research 4d ago GAVEL: Grounded Caption Error Verification and Localization arXiv:2606.26923v1 Announce Type: new Abstract: Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy… 24 arXiv — NLP / Computation & Language research 4d ago RedVox: Safety and Fairness Gaps in Speech Models Across Languages arXiv:2606.26968v1 Announce Type: new Abstract: Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting… 35 arXiv — NLP / Computation & Language research 4d ago MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment arXiv:2606.27019v1 Announce Type: new Abstract: The Unigram tokenizer uses an elegant representation which makes it straightforward to edit vocabularies, but its training is comparatively heavy and complex. We introduce MinGram (Minimalist Unigram), which keeps the token-list… 7 arXiv — NLP / Computation & Language research 4d ago Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes arXiv:2606.27210v1 Announce Type: new Abstract: We argue that safety classifiers should model user intent as an explicit signal between the prompt and the final label. To study this, we introduce AIMS, a human-annotated dataset of 1,724 difficult safety prompts, each paired with… 17 arXiv — NLP / Computation & Language research 4d ago Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to… 19 arXiv — NLP / Computation & Language research 4d ago Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against… 18 arXiv — NLP / Computation & Language research 4d ago Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation arXiv:2606.26686v1 Announce Type: cross Abstract: In order to screen a prompt or a response, the recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict. This design follows a common belief that step-by-step reasoning improves a decision. However,… 17 arXiv — NLP / Computation & Language research 4d ago Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries arXiv:2606.26936v1 Announce Type: cross Abstract: With a profusion of jailbreaks for LLMs now widely known, a growing concern is that non-expert malicious actors ("the average Jane") could elicit actionable responses to malicious requests. In this work, we examine whether this… 36 TechCrunch — AI news-outlet 4d ago The White House is asking OpenAI to slow roll the release of its new model over safety concerns penAI reportedly plans to share its newest model, GPT 5.6, with a select group of partners instead of to the broader public. The reason: the Trump administration told it to. 14 Hugging Face Daily Papers research 4d ago Do Thinking Tokens Help with Safety? Abstract Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation… 25 Hugging Face Daily Papers research 4d ago PrivacyAlign: Contextual Privacy Alignment for LLM Agents Abstract Researchers develop a human-centered approach to align AI agents with privacy norms by creating a comprehensive dataset of privacy judgments and using annotation-conditioned reward modeling to improve agent behavior. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI… 7 Hugging Face Daily Papers research 4d ago What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics Abstract Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Jailbreak attacks reveal… 23 Hugging Face Daily Papers research 5d ago When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents Abstract LLM agents frequently select higher-privilege tools unnecessarily, and while safety alignment doesn't ensure least-privilege choices, a post-training defense can reduce excessive privilege use without sacrificing performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 26 arXiv — NLP / Computation & Language research 5d ago Digital Twin-Driven Adaptive Sim-to-Real Alignment via Reinforcement Learning for Vibration-Based Bearing Health Monitoring Under Data Scarcity arXiv:2606.24954v1 Announce Type: cross Abstract: Vibration-based health monitoring of rotating machinery requires reliable fault diagnosis under operational data constraints, yet condition assessment remains challenged by structural scarcity of fault events and heterogeneous… 30 arXiv — Machine Learning research 5d ago Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL arXiv:2606.25012v1 Announce Type: new Abstract: Many reinforcement learning (RL) problems in the infinite-horizon average-reward setting require optimizing multiple conflicting objectives while satisfying multiple safety constraints. A common approach is concave scalarization,… 27 Page 1 of 10 · 500 articles Older →