News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 11d ago Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families arXiv:2606.20225v1 Announce Type: new Abstract: Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared… 31 arXiv — NLP / Computation & Language research 11d ago Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users arXiv:2606.20482v1 Announce Type: new Abstract: To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations.… 7 arXiv — NLP / Computation & Language research 11d ago When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving… 17 arXiv — NLP / Computation & Language research 11d ago Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact arXiv:2606.20205v1 Announce Type: cross Abstract: Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in… 34 arXiv — NLP / Computation & Language research 11d ago Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology arXiv:2512.03818v2 Announce Type: replace Abstract: Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording… 33 arXiv — NLP / Computation & Language research 11d ago Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual… 7 r/MachineLearning community 11d ago Fearless Concurrency on the GPU: Safe GPU inference in Rust, competitive with vLLM/SGLang [R] I maintain cuTile Rust and just posted the paper "Fearless Concurrency on the GPU." As more GPU code gets AI-generated, the bottleneck moves from writing it to trusting it. cuTile Rust lets you write or generate GPU kernels whose memory safety and data-race freedom are verified… 29 Hugging Face Daily Papers research 11d ago The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL Abstract Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human… 4 r/MachineLearning community 11d ago HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D] TL;DR for ML Specialists: The Core: An empirical study on how long, semantically dense, completely benign text (with zero triggers, instructions, or jailbreak prompts) drives an implicit shift in the model's latent space trajectories. The Effect: Dilution of the initial system… 24 Hugging Face Daily Papers research 11d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems… 28 arXiv — Machine Learning research 12d ago TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning arXiv:2606.18308v1 Announce Type: new Abstract: Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these… 35 arXiv — Machine Learning research 12d ago Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment arXiv:2606.18703v1 Announce Type: new Abstract: Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet… 17 arXiv — Machine Learning research 12d ago Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target… 14 arXiv — NLP / Computation & Language research 12d ago Montreal Forced Aligner and the state of speech-to-text alignment in 2026 arXiv:2606.18466v1 Announce Type: new Abstract: The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded… 5 arXiv — NLP / Computation & Language research 12d ago Steerable Cultural Preference Optimization of Reward Models arXiv:2606.18606v1 Announce Type: new Abstract: It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on… 16 arXiv — NLP / Computation & Language research 12d ago The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs arXiv:2606.18656v1 Announce Type: new Abstract: Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper… 22 arXiv — NLP / Computation & Language research 12d ago Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering arXiv:2606.18986v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series… 10 arXiv — NLP / Computation & Language research 12d ago G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is… 6 arXiv — NLP / Computation & Language research 12d ago RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering arXiv:2606.19218v1 Announce Type: new Abstract: Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one… 37 arXiv — NLP / Computation & Language research 12d ago Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing arXiv:2510.04120v2 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis… 27 Stratechery (Ben Thompson) community 12d ago The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor The administration is very likely wrong about Fable, but that is ultimately Anthropic's responsibility. 20 arXiv — Machine Learning research 13d ago Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations arXiv:2606.17414v1 Announce Type: new Abstract: Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a… 10 arXiv — Machine Learning research 13d ago MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization arXiv:2606.17526v1 Announce Type: new Abstract: Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still… 35 arXiv — Machine Learning research 13d ago AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor arXiv:2606.17872v1 Announce Type: new Abstract: Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since… 27 arXiv — Machine Learning research 13d ago NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment arXiv:2606.18066v1 Announce Type: new Abstract: We introduce the Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the pretrained reverse kernel unchanged and requiring only a single sample per… 31 arXiv — NLP / Computation & Language research 13d ago Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing arXiv:2606.17478v1 Announce Type: new Abstract: As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation… 23 arXiv — NLP / Computation & Language research 13d ago The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports arXiv:2606.17791v1 Announce Type: new Abstract: AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using… 24 arXiv — NLP / Computation & Language research 13d ago A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models arXiv:2606.18193v1 Announce Type: cross Abstract: We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a… 6 arXiv — NLP / Computation & Language research 13d ago ALAS: An Automatic Latent Alignment Score for Audio Language Models arXiv:2505.19937v3 Announce Type: replace Abstract: Large Language Models (LLMs) are extended into Speech-LLMs, and the quality of the audio--text alignment they learn affects most downstream Spoken Language Understanding (SLU) behavior. Yet despite a growth of fusion… 17 arXiv — NLP / Computation & Language research 13d ago EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning… 38 Hacker News — AI on Front Page community 13d ago Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak Article URL: https://www.theregister.com/security/2026/06/15/feds-freaked-over-fable-5-after-simple-fix-this-code-prompt-not-jailbreak-says-researcher/5255827 Comments URL: https://news.ycombinator.com/item?id=48552687 Points: 230 # Comments: 131 36 r/LocalLLaMA community 14d ago Diffusion Gemma Jailbreak I was told my Gemma 4 jailbreak also works with Diffusion Gemma, so I'm reposting here for kicks. Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.… 36 Simon Willison community 14d ago The Fable 5 Export Controls Harm US Cyber Defense The Fable 5 Export Controls Harm US Cyber Defense I quoted The Atlantic quoting Kate Moussouris earlier, when I should have gone straight to the source. Here she is confirming that the "jailbreak" that got Claude Fable 5 banned under an export control really was "fix this code":… 9 arXiv — Machine Learning research 14d ago Size Doesn't Matter: Cosine-Scored Sparse Autoencoders arXiv:2606.15054v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously,… 13 arXiv — Machine Learning research 14d ago False Sense of Safety in Selective Signal Classification: Auditing Bound Tightness and Exchangeability for Risk Control arXiv:2606.15153v1 Announce Type: new Abstract: Selective prediction with distribution-free risk control promises that, with confidence 1-delta over the calibration draw, the error rate of accepted inputs stays below a user budget alpha. We audit this promise on signal-domain… 32 arXiv — Machine Learning research 14d ago EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data… 9 arXiv — Machine Learning research 14d ago DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising arXiv:2606.15359v1 Announce Type: new Abstract: Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment… 26 arXiv — Machine Learning research 14d ago Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance arXiv:2606.15531v1 Announce Type: new Abstract: Fine-tuning aligned language models on benign tasks (e.g. math tutoring) systematically breaks safety guardrails, even when training data contains no harmful content. While mechanistic approaches have shed light on where alignment… 36 arXiv — Machine Learning research 14d ago Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning arXiv:2606.15767v1 Announce Type: new Abstract: Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model… 19 arXiv — NLP / Computation & Language research 14d ago CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale… 21 arXiv — NLP / Computation & Language research 14d ago CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment arXiv:2606.15396v1 Announce Type: new Abstract: Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to… 14 arXiv — NLP / Computation & Language research 14d ago ESBMC-PLC: Formal Verification of IEC 61131-3 Ladder Diagram Programs Using SMT-Based Model Checking arXiv:2606.15461v1 Announce Type: new Abstract: PLCs execute safety-critical programs across industrial sectors. The dominant PLC notation, ladder diagram (LD) per IEC 61131-3, remains absent from formal verification: SMT-based model checkers cannot process LD's rung-and-coil… 31 arXiv — NLP / Computation & Language research 14d ago SHARD: Safe and Helpful Alignment via Self-Reframing Distillation arXiv:2606.15517v1 Announce Type: new Abstract: Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce… 16 arXiv — NLP / Computation & Language research 14d ago Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning arXiv:2606.15733v1 Announce Type: new Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are… 21 arXiv — NLP / Computation & Language research 14d ago ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment arXiv:2606.15783v1 Announce Type: new Abstract: We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across… 8 arXiv — NLP / Computation & Language research 14d ago Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking… 10 arXiv — NLP / Computation & Language research 14d ago AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We… 33 Simon Willison community 14d ago Quoting Matteo Wong, The Atlantic Katie Moussouris, a cybersecurity expert and the CEO of Luta Security, told me that Anthropic shared with her a copy of the White House’s report on the Fable jailbreak to get her appraisal. (She said that she is not being paid by Anthropic.) The report, Moussouris said, involved… 21 Hugging Face Daily Papers research 14d ago TuneJury: An Open Metric for Improving Music Generation Preference Alignment Abstract A novel open-source pairwise reward model for text-to-music generation that provides calibrated preference scoring and generalizes across multiple downstream applications through a frozen reward mechanism. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce… 5 OpenAI official-blog 14d ago Predicting model behavior before release by simulating deployment OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy. 27 Page 3 of 10 · 500 articles ← Newer Older →