Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 5d ago

Do Thinking Tokens Help with Safety?

arXiv:2606.25013v1 Announce Type: cross Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and…

37
arXiv — Machine Learning research 5d ago

Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion

arXiv:2606.25097v1 Announce Type: new Abstract: Speculative decoding accelerates inference by letting a draft model propose tokens for a target model to verify, raising a concrete safety question: at temperature zero, can draft-side behavior leak into safety-scored outputs? We…

7
arXiv — NLP / Computation & Language research 5d ago

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

arXiv:2606.25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it…

5
arXiv — NLP / Computation & Language research 5d ago

A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

arXiv:2606.25380v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across languages, but their safety behavior remains uneven across linguistic and cultural contexts. This survey synthesizes work on toxicity detection and detoxification for…

38
arXiv — NLP / Computation & Language research 5d ago

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

arXiv:2606.25442v1 Announce Type: new Abstract: Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often…

29
arXiv — NLP / Computation & Language research 5d ago

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

arXiv:2606.25476v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and…

36
arXiv — NLP / Computation & Language research 5d ago

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat…

23
arXiv — NLP / Computation & Language research 5d ago

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

arXiv:2606.25651v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare settings, accurate error detection and correction in generated or existing text becomes critical, as even minor mistakes can pose risks to patient safety.…

34
arXiv — NLP / Computation & Language research 5d ago

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

arXiv:2606.25782v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM…

18
arXiv — NLP / Computation & Language research 5d ago

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

arXiv:2606.25821v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (MoE) architectures have emerged as an increasingly influential paradigm as they offer a strategic balance between parameter scalability and computational efficiency. However, low-resource languages, which…

21
arXiv — NLP / Computation & Language research 5d ago

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

arXiv:2606.26015v1 Announce Type: new Abstract: Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have…

10
arXiv — NLP / Computation & Language research 5d ago

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

arXiv:2606.25444v1 Announce Type: cross Abstract: Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on…

23
arXiv — NLP / Computation & Language research 5d ago

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

arXiv:2606.25460v1 Announce Type: cross Abstract: Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced…

24
arXiv — NLP / Computation & Language research 5d ago

RAS: Measuring LLM Safety Through Refusal Alignment

arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is…

27
arXiv — NLP / Computation & Language research 5d ago

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet…

14
Hugging Face Daily Papers research 6d ago

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation…

34
arXiv — Machine Learning research 6d ago

Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications

arXiv:2606.23858v1 Announce Type: new Abstract: A primary challenge in AI safety is the existence of adversarial examples -- slightly distorted inputs that cause a neural network (NN) to misclassify. To mitigate this problem, recent research focuses on the computation of…

12
arXiv — Machine Learning research 6d ago

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

arXiv:2606.23898v1 Announce Type: new Abstract: Distilling conditional diffusion models aims to transfer the behavior of a large teacher to a smaller student while preserving alignment across conditioning inputs. Unlike recognition tasks, knowledge distillation in conditional…

14
arXiv — Machine Learning research 6d ago

Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment

arXiv:2606.24851v1 Announce Type: new Abstract: Fourier Neural Operators (FNO) learn solution operators of partial differential equations by parameterizing global convolutions in the complex Fourier domain. For real-valued PDE solutions, the complex FFT carries representational…

20
arXiv — NLP / Computation & Language research 6d ago

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

arXiv:2606.23700v1 Announce Type: new Abstract: Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model's aligned character rather than direct learning of…

8
arXiv — Machine Learning research 6d ago

Verifiable Foundation Models for Robot Safety

arXiv:2606.23754v1 Announce Type: cross Abstract: Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable…

4
arXiv — Machine Learning research 6d ago

EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

arXiv:2606.24586v1 Announce Type: cross Abstract: Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate…

19
arXiv — Machine Learning research 6d ago

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

arXiv:2606.24601v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target…

29
arXiv — NLP / Computation & Language research 6d ago

One Year Later...The Harms Persist, But So Do We!

arXiv:2606.23884v1 Announce Type: new Abstract: General-purpose large language models (LLMs) are increasingly used for mental health-related conversations, yet safety safeguards remain inadequate and inconsistent across clinical conditions. This study evaluates six proprietary…

26
arXiv — NLP / Computation & Language research 6d ago

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

arXiv:2606.24004v1 Announce Type: new Abstract: Steering a large language model (LLM) toward a desired behavior typically relies on an iterative process of hand-crafting a prompt based on a careful inspection of the model's responses. This is an involved, brittle, and…

28
arXiv — NLP / Computation & Language research 6d ago

Selective Capability Unlearning in End-to-End Spoken Language Understanding

arXiv:2606.24063v1 Announce Type: new Abstract: Modern spoken language understanding (SLU) systems are increasingly deployed in real-world settings, where specific functionalities may need to be removed due to policy or safety constraints. In SLU, a functionality corresponds to…

23
arXiv — NLP / Computation & Language research 6d ago

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

arXiv:2606.24828v1 Announce Type: new Abstract: Scientific long-document summarization datasets commonly treat author-written abstracts as gold reference summaries, although their quality and alignment with the source article vary. At the same time, publicly available scientific…

38
arXiv — NLP / Computation & Language research 6d ago

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

arXiv:2606.23885v1 Announce Type: cross Abstract: Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing…

17
arXiv — NLP / Computation & Language research 6d ago

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

arXiv:2606.24014v1 Announce Type: cross Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL),…

5
arXiv — NLP / Computation & Language research 6d ago

Progressive Alignment Objectives for Aligner-Encoder based ASR

arXiv:2606.24147v1 Announce Type: cross Abstract: Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without…

23
arXiv — NLP / Computation & Language research 6d ago

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline…

25
r/LocalLLaMA community 6d ago

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8…

10
OpenAI official-blog 6d ago

Helping build shared standards for advanced AI

OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation.

31
Hugging Face Daily Papers research 6d ago

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Abstract SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-Use Agents (CUAs)…

24
Hugging Face Daily Papers research 7d ago

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

Abstract Autoregressive generation in large language models traditionally uses the final layer for token prediction, but a new decoding strategy dynamically selects more reliable intermediate layers based on entropy-guided search, improving reasoning performance with minimal…

34
Hugging Face Daily Papers research 7d ago

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured…

19
Hugging Face Daily Papers research 7d ago

Safe Few-Step Generation via Velocity Editing

Abstract VESFlow is a training-free safety method for flow matching-based text-to-image generation that edits velocity fields to ensure safe output while maintaining prompt integrity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Flow matching has recently emerged as a strong…

16
Hugging Face Daily Papers research 7d ago

Exploring the Design Space of Reward Backpropagation for Flow Matching

Abstract FlowBP addresses limitations in flow matching model alignment by using a surrogate trajectory framework that reduces memory usage and gradient chaining while maintaining performance across multiple text-to-image models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

23
Latent.Space news-outlet 7d ago

Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

OpenAI boardmember Zico Kolter and Gray Swan CEO Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI”

22
NVIDIA Developer Blog official-blog 7d ago

Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI

Physical AI—robots working autonomously alongside people in factories, warehouses, hospitals, and homes—is arriving faster than most expected. Traditional...

12
r/LocalLLaMA community 8d ago

Qwen 3.6 27b Abliterated (apostate)

I've been working on a project called Apostate and have finally released my first large model with it on Hugging Face. Qwen 3.6 27B with safety alignment removed down from 92% to 7.6% refusal rate with minimal impact on the model's capabilities (0.120 KL). Qwen 3.6 27B Apostate…

17
Don't Worry About the Vase community 10d ago

Claude Fable 5 and Mythos 5: Capabilities

Only three days after the release of Claude Fable 5, Anthropic was forced by the United States Government to make it unavailable, when a jailbreak was brought to its attention, rather than the previous situation of ‘yes obviously experts can jailbreak anything if they care…

32
Hugging Face Daily Papers research 10d ago

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Abstract Hybrid linear attention models can be improved through a novel initialization technique that enhances conversion from pretrained Transformers by leveraging teacher attention statistics and alignment steps. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Hybrid linear…

6
Hugging Face Daily Papers research 10d ago

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

Abstract FlowBender is a closed-loop framework that addresses constraint satisfaction in diffusion and flow models by training networks to correct alignment errors using inference-time feedback, outperforming traditional supervised and guidance-based approaches across multiple…

11
arXiv — Machine Learning research 11d ago

When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting

arXiv:2606.19363v1 Announce Type: new Abstract: The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when…

32
arXiv — Machine Learning research 11d ago

Tracking Representation Dynamics in Large Language Models with Persistent Homology

arXiv:2606.19542v1 Announce Type: new Abstract: Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking…

38
arXiv — Machine Learning research 11d ago

On the QUEST for Uncertainty Quantification via Highest Density Regions

arXiv:2606.19569v1 Announce Type: new Abstract: Uncertainty quantification (UQ) is essential for reliable decision-making in safety-critical applications in probabilistic machine learning. For regression problems, dominant scalar UQ approaches - notably, those based on proper…

23
arXiv — Machine Learning research 11d ago

Shifting-based Optimizable Linear Relaxations for General Activation Functions

arXiv:2606.20292v1 Announce Type: new Abstract: The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of…

10
arXiv — NLP / Computation & Language research 11d ago

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

arXiv:2606.19346v1 Announce Type: new Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and…

6
arXiv — NLP / Computation & Language research 11d ago

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

arXiv:2606.19864v1 Announce Type: new Abstract: The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns…

12

Do Thinking Tokens Help with Safety?

Speculative Decoding at Temperature Zero: A Scoped Safety-Invariance Screen with a 48,072-Sample Expansion

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

RAS: Measuring LLM Safety Through Refusal Alignment

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

Real vs. Complex Spectral Bases for Neural Operators: The Role of Green's Function Alignment

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

Verifiable Foundation Models for Robot Safety

EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

One Year Later...The Harms Persist, But So Do We!

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

Selective Capability Unlearning in End-to-End Spoken Language Understanding

Less is More: Quality-Aware Training Data Selection for Scientific Summarization

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

Progressive Alignment Objectives for Aligner-Encoder based ASR

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

Helping build shared standards for advanced AI

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Deeper is Not Always Better: Mitigating the Alignment Tax via Confident Layer Decoding

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Safe Few-Step Generation via Velocity Editing

Exploring the Design Space of Reward Backpropagation for Flow Matching

Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

Inside NVIDIA Halos for Robotics: A Full-Stack Functional Safety System for Physical AI

Qwen 3.6 27b Abliterated (apostate)

Claude Fable 5 and Mythos 5: Capabilities

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting

Tracking Representation Dynamics in Large Language Models with Persistent Homology

On the QUEST for Uncertainty Quantification via Highest Density Regions

Shifting-based Optimizable Linear Relaxations for General Activation Functions

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI