News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 5d ago Do Thinking Tokens Help with Safety? arXiv:2606.25013v1 Announce Type: cross Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and… 37 arXiv — Machine Learning research 5d ago Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion arXiv:2606.25097v1 Announce Type: new Abstract: Speculative decoding accelerates inference by letting a draft model propose tokens for a target model to verify, raising a concrete safety question: at temperature zero, can draft-side behavior leak into safety-scored outputs? We… 7 arXiv — NLP / Computation & Language research 5d ago What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics arXiv:2606.25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it… 5 arXiv — NLP / Computation & Language research 5d ago A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models arXiv:2606.25380v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across languages, but their safety behavior remains uneven across linguistic and cultural contexts. This survey synthesizes work on toxicity detection and detoxification for… 38 arXiv — NLP / Computation & Language research 5d ago PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models arXiv:2606.25442v1 Announce Type: new Abstract: Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often… 29 arXiv — NLP / Computation & Language research 5d ago A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation arXiv:2606.25476v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and… 36 arXiv — NLP / Computation & Language research 5d ago How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat… 23 arXiv — NLP / Computation & Language research 5d ago MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction arXiv:2606.25651v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety.… 34 arXiv — NLP / Computation & Language research 5d ago Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation arXiv:2606.25782v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM… 18 arXiv — NLP / Computation & Language research 5d ago SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment arXiv:2606.25821v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which… 21 arXiv — NLP / Computation & Language research 5d ago The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar arXiv:2606.26015v1 Announce Type: new Abstract: Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have… 10 arXiv — NLP / Computation & Language research 5d ago Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs? arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on… 23 arXiv — NLP / Computation & Language research 5d ago Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced… 24 arXiv — NLP / Computation & Language research 5d ago RAS: Measuring LLM Safety Through Refusal Alignment arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is… 27 arXiv — NLP / Computation & Language research 5d ago Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet… 14 Hugging Face Daily Papers research 6d ago FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation… 34 arXiv — Machine Learning research 6d ago Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications arXiv:2606.23858v1 Announce Type: new Abstract: A primary challenge in AI safety is the existence of adversarial examples -- slightly distorted inputs that cause a neural network (NN) to misclassify. To mitigate this problem, recent research focuses on the computation of… 12 arXiv — Machine Learning research 6d ago ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation arXiv:2606.23898v1 Announce Type: new Abstract: Distilling conditional diffusion models aims to transfer the behavior of a large teacher to a smaller student while preserving alignment across conditioning inputs. Unlike recognition tasks, knowledge distillation in conditional… 14 arXiv — Machine Learning research 6d ago Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment arXiv:2606.24851v1 Announce Type: new Abstract: Fourier Neural Operators (FNO) learn solution operators of partial differential equations by parameterizing global convolutions in the complex Fourier domain. For real-valued PDE solutions, the complex FFT carries representational… 20 arXiv — NLP / Computation & Language research 6d ago Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment arXiv:2606.23700v1 Announce Type: new Abstract: Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model's aligned character rather than direct learning of… 8 arXiv — Machine Learning research 6d ago Verifiable Foundation Models for Robot Safety arXiv:2606.23754v1 Announce Type: cross Abstract: Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable… 4 arXiv — Machine Learning research 6d ago EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics arXiv:2606.24586v1 Announce Type: cross Abstract: Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate… 19 arXiv — Machine Learning research 6d ago ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning arXiv:2606.24601v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target… 29 arXiv — NLP / Computation & Language research 6d ago One Year Later...The Harms Persist, But So Do We! arXiv:2606.23884v1 Announce Type: new Abstract: General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical conditions. This study evaluates six proprietary… 26 arXiv — NLP / Computation & Language research 6d ago Towards Spec Learning: Inference-Time Alignment from Preference Pairs arXiv:2606.24004v1 Announce Type: new Abstract: Steering a large language model (LLM) toward a desired behavior typically relies on an iterative process of hand-crafting a prompt based on a careful inspection of the model's responses. This is an involved, brittle, and… 28 arXiv — NLP / Computation & Language research 6d ago Selective Capability Unlearning in End-to-End Spoken Language Understanding arXiv:2606.24063v1 Announce Type: new Abstract: Modern spoken language understanding (SLU) systems are increasingly deployed in real-world settings, where specific functionalities may need to be removed due to policy or safety constraints. In SLU, a functionality corresponds to… 23 arXiv — NLP / Computation & Language research 6d ago Less is More: Quality-Aware Training Data Selection for Scientific Summarization arXiv:2606.24828v1 Announce Type: new Abstract: Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific… 38 arXiv — NLP / Computation & Language research 6d ago Mind the Heads: Topological Representation Alignment for Multimodal LLMs arXiv:2606.23885v1 Announce Type: cross Abstract: Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing… 17 arXiv — NLP / Computation & Language research 6d ago Reinforcement Learning Towards Broadly and Persistently Beneficial Models arXiv:2606.24014v1 Announce Type: cross Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL),… 5 arXiv — NLP / Computation & Language research 6d ago Progressive Alignment Objectives for Aligner-Encoder based ASR arXiv:2606.24147v1 Announce Type: cross Abstract: Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without… 23 arXiv — NLP / Computation & Language research 6d ago AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline… 25 r/LocalLLaMA community 6d ago I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention. I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8… 10 OpenAI official-blog 6d ago Helping build shared standards for advanced AI OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation. 31 Hugging Face Daily Papers research 6d ago SkillHarness: Harnessing Safe Skills for Computer-Use Agents Abstract SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-Use Agents (CUAs)… 24 Hugging Face Daily Papers research 7d ago Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding Abstract Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal… 34 Hugging Face Daily Papers research 7d ago DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured… 19 Hugging Face Daily Papers research 7d ago Safe Few-Step Generation via Velocity Editing Abstract VESFlow is a training-free safety method for flow matching-based text-to-image generation that edits velocity fields to ensure safe output while maintaining prompt integrity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Flow matching has recently emerged as a strong… 16 Hugging Face Daily Papers research 7d ago Exploring the Design Space of Reward Backpropagation for Flow Matching Abstract FlowBP addresses limitations in flow matching model alignment by using a surrogate trajectory framework that reduces memory usage and gradient chaining while maintaining performance across multiple text-to-image models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 23 Latent.Space news-outlet 7d ago Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan OpenAI boardmember Zico Kolter and Gray Swan CEO Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI” 22 NVIDIA Developer Blog official-blog 7d ago Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI Physical AI—robots working autonomously alongside people in factories, warehouses, hospitals, and homes—is arriving faster than most expected. Traditional... 12 r/LocalLLaMA community 8d ago Qwen 3.6 27b Abliterated (apostate) I've been working on a project called Apostate and have finally released my first large model with it on Hugging Face. Qwen 3.6 27B with safety alignment removed down from 92% to 7.6% refusal rate with minimal impact on the model's capabilities (0.120 KL). Qwen 3.6 27B Apostate… 17 Don't Worry About the Vase community 10d ago Claude Fable 5 and Mythos 5: Capabilities Only three days after the release of Claude Fable 5, Anthropic was forced by the United States Government to make it unavailable, when a jailbreak was brought to its attention, rather than the previous situation of ‘yes obviously experts can jailbreak anything if they care… 32 Hugging Face Daily Papers research 10d ago Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation Abstract Hybrid linear attention models can be improved through a novel initialization technique that enhances conversion from pretrained Transformers by leveraging teacher attention statistics and alignment steps. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Hybrid linear… 6 Hugging Face Daily Papers research 10d ago FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows Abstract FlowBender is a closed-loop framework that addresses constraint satisfaction in diffusion and flow models by training networks to correct alignment errors using inference-time feedback, outperforming traditional supervised and guidance-based approaches across multiple… 11 arXiv — Machine Learning research 11d ago When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting arXiv:2606.19363v1 Announce Type: new Abstract: The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when… 32 arXiv — Machine Learning research 11d ago Tracking Representation Dynamics in Large Language Models with Persistent Homology arXiv:2606.19542v1 Announce Type: new Abstract: Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking… 38 arXiv — Machine Learning research 11d ago On the QUEST for Uncertainty Quantification via Highest Density Regions arXiv:2606.19569v1 Announce Type: new Abstract: Uncertainty quantification (UQ) is essential for reliable decision-making in safety-critical applications in probabilistic machine learning. For regression problems, dominant scalar UQ approaches - notably, those based on proper… 23 arXiv — Machine Learning research 11d ago Shifting-based Optimizable Linear Relaxations for General Activation Functions arXiv:2606.20292v1 Announce Type: new Abstract: The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of… 10 arXiv — NLP / Computation & Language research 11d ago Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer arXiv:2606.19346v1 Announce Type: new Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and… 6 arXiv — NLP / Computation & Language research 11d ago The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI arXiv:2606.19864v1 Announce Type: new Abstract: The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns… 12 Page 2 of 10 · 500 articles ← Newer Older →