News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow Hugging Face Daily Papers research 25d ago Large Language Models Hack Rewards, and Society Abstract Large language models trained with reinforcement learning can exploit ambiguities in societal regulations to discover loopholes that bypass regulatory intent, posing safety risks for real-world deployment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reinforcement… 18 Hugging Face Daily Papers research 25d ago Neural Networks Provably Learn Spectral Representations for Group Composition Abstract Neural network training on group composition tasks exhibits convergence to irreducible representations and rotational rank-one alignment through Riemannian gradient ascent on representation-theoretic energy functionals. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 32 Hugging Face official-blog 25d ago Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Back to Articles Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Enterprise + Article Published June 4, 2026 Upvote - Varun Singh varunsingh nvidia Isabel Hulseman ihulseman0220 nvidia Anuj Doshi andoshi nvidia Shyamala Prayaga sprayaga25… 6 Hugging Face Daily Papers research 25d ago Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by… 7 arXiv — Machine Learning research 26d ago RUBAS: Rubric-Based Reinforcement Learning for Agent Safety arXiv:2606.04051v1 Announce Type: new Abstract: The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or… 21 arXiv — Machine Learning research 26d ago When Autoregressive Consistency Hurts Safety Alignment arXiv:2606.04168v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is fragile in part because it is often shallow: fine-tuning mainly reshapes the model's behavior near the first few output tokens. We argue that this phenomenon can be understood… 21 arXiv — Machine Learning research 26d ago KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not… 8 arXiv — Machine Learning research 26d ago Latent Anchor-Driven Test Generation for Deep Neural Networks arXiv:2606.04310v1 Announce Type: new Abstract: Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous testing essential to identify and mitigate model weaknesses. Existing DNN testing approaches… 6 arXiv — Machine Learning research 26d ago Testing Neural Networks via Bayesian-Guided Exploration of Decision Landscapes arXiv:2606.04314v1 Announce Type: new Abstract: As neural networks are increasingly deployed in safety-critical domains, testing is essential to evaluate and improve their reliability. Existing testing methods, whether black-box or white-box, primarily use global mutation or… 18 arXiv — Machine Learning research 26d ago Explainably Safe Reinforcement Learning arXiv:2606.04634v1 Announce Type: new Abstract: Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly… 25 arXiv — Machine Learning research 26d ago Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms arXiv:2606.04767v1 Announce Type: new Abstract: The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric… 15 arXiv — NLP / Computation & Language research 26d ago Expert-Aware Refusal Steering arXiv:2606.04160v1 Announce Type: new Abstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense… 22 arXiv — NLP / Computation & Language research 26d ago Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA arXiv:2606.04262v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains… 5 arXiv — NLP / Computation & Language research 26d ago Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs arXiv:2606.04450v1 Announce Type: new Abstract: Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across… 29 arXiv — NLP / Computation & Language research 26d ago Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing… 18 arXiv — NLP / Computation & Language research 26d ago Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas arXiv:2606.04846v1 Announce Type: new Abstract: As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability… 36 arXiv — NLP / Computation & Language research 26d ago Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game arXiv:2606.04978v1 Announce Type: new Abstract: LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a… 26 Hugging Face Daily Papers research 26d ago BraveGuard: From Open-World Threats to Safer Computer-Use Agents Abstract BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use agents extend language… 30 OpenAI official-blog 27d ago OpenAI public policy agenda OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure AI benefits society. 10 OpenAI official-blog 27d ago A blueprint for democratic governance of frontier AI OpenAI outlines a blueprint for U.S. governance of frontier AI, proposing a federal framework for safety, resilience, and national security. 11 arXiv — Machine Learning research 27d ago Assessing Region-Level EEG Contributions to Cognitive Workload Prediction arXiv:2606.02598v1 Announce Type: new Abstract: Accurate and generalizable estimation of cognitive workload from electroencephalography (EEG) is critical for human-centered and safety-critical systems. Although EEG is widely used for workload assessment, the consistency of… 29 arXiv — Machine Learning research 27d ago Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis arXiv:2606.02671v1 Announce Type: new Abstract: Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: predictive models are typically optimized in terms of standard statistical metrics in isolation… 19 arXiv — Machine Learning research 27d ago Gate AI: LLM Security Benchmark Evaluation Methodology and Results arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation… 27 arXiv — Machine Learning research 27d ago Libra: Efficient Resource Management for Agentic RL Post-Training arXiv:2606.03077v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout… 23 arXiv — Machine Learning research 27d ago HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models arXiv:2606.03131v1 Announce Type: new Abstract: Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real… 15 arXiv — NLP / Computation & Language research 27d ago Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization arXiv:2606.03022v1 Announce Type: new Abstract: Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical constraints -- remains a persistent challenge for reliable deployment. In this work, we address… 14 arXiv — NLP / Computation & Language research 27d ago The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment arXiv:2606.03043v1 Announce Type: new Abstract: LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard… 18 arXiv — NLP / Computation & Language research 27d ago Coherence Maximization Improves Pluralistic Alignment arXiv:2606.03110v1 Announce Type: new Abstract: Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these… 16 arXiv — NLP / Computation & Language research 27d ago Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models arXiv:2606.03165v1 Announce Type: new Abstract: The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking… 30 arXiv — NLP / Computation & Language research 27d ago Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability arXiv:2606.03648v1 Announce Type: new Abstract: Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and… 32 arXiv — NLP / Computation & Language research 27d ago Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings arXiv:2606.03695v1 Announce Type: new Abstract: As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the… 19 arXiv — NLP / Computation & Language research 27d ago Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric… 19 arXiv — NLP / Computation & Language research 27d ago Consistency Training Can Entrench Misalignment arXiv:2606.03810v1 Announce Type: new Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly… 31 arXiv — NLP / Computation & Language research 27d ago AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task arXiv:2606.03967v1 Announce Type: new Abstract: We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated… 16 arXiv — NLP / Computation & Language research 27d ago Quantifying Faithful Confidence Expression in Large Reasoning Models arXiv:2606.03969v1 Announce Type: new Abstract: Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This… 35 Hugging Face Daily Papers research 27d ago Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures Abstract Deep learning approach for co-speech gesture retrieval that uses semantic motion anchors to improve alignment between spoken text and gesture representations, enhancing both retrieval accuracy and semantic relevance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Learning… 17 Hugging Face Daily Papers research 27d ago TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation Abstract A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research Agents have shown strong… 4 Hugging Face Daily Papers research 27d ago Review Arcade: On the Human Alignment and Gameability of LLM Reviews Abstract Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM… 25 r/MachineLearning community 27d ago Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R] Third in a series of papers tracking learning rules vs. human fMRI (THINGS dataset, V1–IT, N=3 subjects). Previous finding: untrained CNNs match backprop at V1. This paper asks: when does training break that, and does the learning rule matter? Setup: RSA alignment measured at 8… 30 Hugging Face Daily Papers research 28d ago Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems Abstract Physical AI systems face safety challenges where black-box models can execute harmful actions without detection, necessitating comprehensive runtime guardrail mechanisms for safe operation. AI-generated summary Physical AI systems increasingly map multimodal… 12 OpenAI official-blog 28d ago Advancing youth safety and opportunity through global leadership OpenAI calls for global action on youth AI safety through a dedicated AI Safety Institute 4 arXiv — Machine Learning research 28d ago ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use arXiv:2606.00341v1 Announce Type: new Abstract: As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work… 12 arXiv — Machine Learning research 28d ago Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning arXiv:2606.00400v1 Announce Type: new Abstract: Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed… 11 arXiv — Machine Learning research 28d ago MESA: Improving MoE Safety Alignment via Decentralized Expertise arXiv:2606.00651v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical… 36 arXiv — Machine Learning research 28d ago Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing arXiv:2606.00686v1 Announce Type: new Abstract: The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach… 7 arXiv — NLP / Computation & Language research 28d ago A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models arXiv:2606.00027v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a… 37 arXiv — NLP / Computation & Language research 28d ago RealityTest: How People Probe AI Identity and Whether Models Disclose It arXiv:2606.00168v1 Announce Type: new Abstract: AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of… 24 arXiv — NLP / Computation & Language research 28d ago Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models arXiv:2606.00284v1 Announce Type: new Abstract: While continual pretraining~(CPT) is a practical way to extend large language models to new languages, na\"ive finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around… 23 arXiv — NLP / Computation & Language research 28d ago Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning arXiv:2606.00334v1 Announce Type: new Abstract: Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are… 21 arXiv — NLP / Computation & Language research 28d ago Lost in Delusion: Examining LLM Safety Under User Delusions and Distress arXiv:2606.00975v1 Announce Type: new Abstract: LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates… 8 Page 6 of 10 · 500 articles ← Newer Older →