Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 11d ago

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

arXiv:2606.20225v1 Announce Type: new Abstract: Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared…

31
arXiv — NLP / Computation & Language research 11d ago

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

arXiv:2606.20482v1 Announce Type: new Abstract: To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations.…

7
arXiv — NLP / Computation & Language research 11d ago

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving…

17
arXiv — NLP / Computation & Language research 11d ago

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

arXiv:2606.20205v1 Announce Type: cross Abstract: Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in…

34
arXiv — NLP / Computation & Language research 11d ago

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

arXiv:2512.03818v2 Announce Type: replace Abstract: Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording…

33
arXiv — NLP / Computation & Language research 11d ago

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual…

7
r/MachineLearning community 11d ago

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU." As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified…

29
Hugging Face Daily Papers research 11d ago

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Abstract Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human…

4
r/MachineLearning community 11d ago

HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]

TL;DR for ML Specialists: The Core: An empirical study on how long, semantically dense, completely benign text (with zero triggers, instructions, or jailbreak prompts) drives an implicit shift in the model's latent space trajectories. The Effect: Dilution of the initial system…

24
Hugging Face Daily Papers research 11d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems…

28
arXiv — Machine Learning research 12d ago

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

arXiv:2606.18308v1 Announce Type: new Abstract: Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these…

35
arXiv — Machine Learning research 12d ago

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

arXiv:2606.18703v1 Announce Type: new Abstract: Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet…

17
arXiv — Machine Learning research 12d ago

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target…

14
arXiv — NLP / Computation & Language research 12d ago

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

arXiv:2606.18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded…

5
arXiv — NLP / Computation & Language research 12d ago

Steerable Cultural Preference Optimization of Reward Models

arXiv:2606.18606v1 Announce Type: new Abstract: It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on…

16
arXiv — NLP / Computation & Language research 12d ago

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

arXiv:2606.18656v1 Announce Type: new Abstract: Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper…

22
arXiv — NLP / Computation & Language research 12d ago

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

arXiv:2606.18986v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series…

10
arXiv — NLP / Computation & Language research 12d ago

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is…

6
arXiv — NLP / Computation & Language research 12d ago

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

arXiv:2606.19218v1 Announce Type: new Abstract: Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one…

37
arXiv — NLP / Computation & Language research 12d ago

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

arXiv:2510.04120v2 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis…

27
Stratechery (Ben Thompson) community 12d ago

The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor

The administration is very likely wrong about Fable, but that is ultimately Anthropic's responsibility.

20
arXiv — Machine Learning research 13d ago

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

arXiv:2606.17414v1 Announce Type: new Abstract: Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a…

10
arXiv — Machine Learning research 13d ago

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

arXiv:2606.17526v1 Announce Type: new Abstract: Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still…

35
arXiv — Machine Learning research 13d ago

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

arXiv:2606.17872v1 Announce Type: new Abstract: Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since…

27
arXiv — Machine Learning research 13d ago

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

arXiv:2606.18066v1 Announce Type: new Abstract: We introduce the Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the pretrained reverse kernel unchanged and requiring only a single sample per…

31
arXiv — NLP / Computation & Language research 13d ago

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation…

23
arXiv — NLP / Computation & Language research 13d ago

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

arXiv:2606.17791v1 Announce Type: new Abstract: AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using…

24
arXiv — NLP / Computation & Language research 13d ago

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

arXiv:2606.18193v1 Announce Type: cross Abstract: We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a…

6
arXiv — NLP / Computation & Language research 13d ago

ALAS: An Automatic Latent Alignment Score for Audio Language Models

arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion…

17
arXiv — NLP / Computation & Language research 13d ago

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning…

38
Hacker News — AI on Front Page community 13d ago

Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak

Article URL: https://www.theregister.com/security/2026/06/15/feds-freaked-over-fable-5-after-simple-fix-this-code-prompt-not-jailbreak-says-researcher/5255827 Comments URL: https://news.ycombinator.com/item?id=48552687 Points: 230 # Comments: 131

36
r/LocalLLaMA community 14d ago

Diffusion Gemma Jailbreak

I was told my Gemma 4 jailbreak also works with Diffusion Gemma, so I'm reposting here for kicks. Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.…

36
Simon Willison community 14d ago

The Fable 5 Export Controls Harm US Cyber Defense

The Fable 5 Export Controls Harm US Cyber Defense I quoted The Atlantic quoting Kate Moussouris earlier, when I should have gone straight to the source. Here she is confirming that the "jailbreak" that got Claude Fable 5 banned under an export control really was "fix this code":…

9
arXiv — Machine Learning research 14d ago

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

arXiv:2606.15054v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously,…

13
arXiv — Machine Learning research 14d ago

False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control

arXiv:2606.15153v1 Announce Type: new Abstract: Selective prediction with distribution-free risk control promises that, with confidence 1-delta over the calibration draw, the error rate of accepted inputs stays below a user budget alpha. We audit this promise on signal-domain…

32
arXiv — Machine Learning research 14d ago

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data…

9
arXiv — Machine Learning research 14d ago

DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

arXiv:2606.15359v1 Announce Type: new Abstract: Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment…

26
arXiv — Machine Learning research 14d ago

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

arXiv:2606.15531v1 Announce Type: new Abstract: Fine-tuning aligned language models on benign tasks (e.g. math tutoring) systematically breaks safety guardrails, even when training data contains no harmful content. While mechanistic approaches have shed light on where alignment…

36
arXiv — Machine Learning research 14d ago

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

arXiv:2606.15767v1 Announce Type: new Abstract: Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model…

19
arXiv — NLP / Computation & Language research 14d ago

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale…

21
arXiv — NLP / Computation & Language research 14d ago

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

arXiv:2606.15396v1 Announce Type: new Abstract: Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to…

14
arXiv — NLP / Computation & Language research 14d ago

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

arXiv:2606.15461v1 Announce Type: new Abstract: PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil…

31
arXiv — NLP / Computation & Language research 14d ago

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

arXiv:2606.15517v1 Announce Type: new Abstract: Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce…

16
arXiv — NLP / Computation & Language research 14d ago

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are…

21
arXiv — NLP / Computation & Language research 14d ago

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

arXiv:2606.15783v1 Announce Type: new Abstract: We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across…

8
arXiv — NLP / Computation & Language research 14d ago

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking…

10
arXiv — NLP / Computation & Language research 14d ago

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We…

33
Simon Willison community 14d ago

Quoting Matteo Wong, The Atlantic

Katie Moussouris, a cybersecurity expert and the CEO of Luta Security, told me that Anthropic shared with her a copy of the White House’s report on the Fable jailbreak to get her appraisal. (She said that she is not being paid by Anthropic.) The report, Moussouris said, involved…

21
Hugging Face Daily Papers research 14d ago

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Abstract A novel open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications through a frozen reward mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce…

5
OpenAI official-blog 14d ago

Predicting model behavior before release by simulating deployment

OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.

27

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R]

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Steerable Cultural Preference Optimization of Reward Models

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

ALAS: An Automatic Latent Alignment Score for Audio Language Models

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak

Diffusion Gemma Jailbreak

The Fable 5 Export Controls Harm US Cyber Defense

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment

Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

Quoting Matteo Wong, The Atlantic

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

Predicting model behavior before release by simulating deployment