News / #safety Tag Safety + alignment 41 articles archived under #safety · RSS Sign in to follow NVIDIA Developer Blog official-blog 7h ago Google DeepMind paper: reinforcement learning at scale New work demonstrates RL fine-tuning at unprecedented scale, with concrete benchmarks on reasoning tasks. 14 arXiv — Machine Learning research 15h ago TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment arXiv:2605.10983v1 Announce Type: new Abstract: Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by… 10 arXiv — Machine Learning research 15h ago The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains arXiv:2605.11205v1 Announce Type: new Abstract: Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation… 34 arXiv — Machine Learning research 15h ago Leveraging RAG for Training-Free Alignment of LLMs arXiv:2605.11217v1 Announce Type: new Abstract: Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that… 36 arXiv — Machine Learning research 15h ago Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning arXiv:2605.11235v1 Announce Type: new Abstract: In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the… 18 arXiv — Machine Learning research 15h ago Gradient-Free Noise Optimization for Reward Alignment in Generative Models arXiv:2605.11347v1 Announce Type: new Abstract: Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but… 38 arXiv — Machine Learning research 15h ago The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives arXiv:2605.11361v1 Announce Type: new Abstract: Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law $p$ into a sampler that favors a reward $r$ while remaining close to $p$. Since there is no canonical distributional distance for this… 27 arXiv — NLP / Computation & Language research 15h ago StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models arXiv:2605.11483v1 Announce Type: new Abstract: While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on… 13 arXiv — NLP / Computation & Language research 15h ago Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization arXiv:2605.11632v1 Announce Type: new Abstract: Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling… 37 arXiv — NLP / Computation & Language research 15h ago Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter arXiv:2605.11685v1 Announce Type: new Abstract: Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical… 17 arXiv — NLP / Computation & Language research 15h ago Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control arXiv:2605.11769v1 Announce Type: new Abstract: Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their… 7 arXiv — NLP / Computation & Language research 15h ago Metaphor Is Not All Attention Needs arXiv:2605.12128v1 Announce Type: new Abstract: Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies,… 20 arXiv — NLP / Computation & Language research 15h ago AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment arXiv:2605.11398v1 Announce Type: cross Abstract: We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health… 36 arXiv — Machine Learning research 1d ago The Safety-Aware Denoiser for Text Diffusion Models arXiv:2605.08116v1 Announce Type: new Abstract: Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety… 9 OpenAI news 6d ago Introducing Trusted Contact in ChatGPT Introducing Trusted Contact in ChatGPT, an optional safety feature that notifies someone you trust if serious self-harm concerns are detected. 23 Ars Technica — AI news-outlet 6d ago Spooked by Mythos, Trump suddenly realized AI safety testing might be good Trump forced to admit Biden was right on AI safety testing. 9 OpenAI news 8d ago GPT-5.5 Instant System Card May 5, 2026 Safety Publication GPT‑5.5 Instant System Card Read the System Card (opens in a new window) Introduction GPT‑5.5 Instant is our latest Instant model, and explained in our blog . The comprehensive safety mitigation approach for this model is similar to previous… 10 OpenAI news 8d ago Advancing youth safety and wellbeing in EMEA Explore OpenAI’s European Youth Safety Blueprint and EMEA Youth & Wellbeing Grants, advancing safe, responsible AI for teens, families, and educators. 30 OpenAI news 15d ago Our commitment to community safety Learn how OpenAI protects community safety in ChatGPT through model safeguards, misuse detection, policy enforcement, and collaboration with safety experts. 8 Marcus on AI community 16d ago Dario Amodei, hype, AI safety, and the explosion of vibe-coded AI disasters What the AI cheerleaders don’t tell you 25 OpenAI news 20d ago GPT-5.5 System Card April 23, 2026 Safety Publication GPT‑5.5 System Card Read the System Card (opens in a new window) 1. Introduction GPT‑5.5 is a new model designed for complex, real-world work, including writing code, researching online, analyzing information, creating documents and… 4 Smol AI News news-outlet 20d ago GPT 5.5 **OpenAI launched GPT-5.5** as its new flagship model for "real work and powering agents," immediately available in ChatGPT and Codex but with delayed API access due to enhanced safety requirements. The model features improved token efficiency and supports longer multi-step… 14 OpenAI news 20d ago GPT-5.5 Bio Bug Bounty Explore the GPT-5.5 Bio Bug Bounty: a red-teaming challenge to find universal jailbreaks for bio safety risks, with rewards up to $25,000. 35 Import AI news-outlet 23d ago Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 At what point do the financial markets price in the singularity? 28 OpenAI news 1mo ago Responsible and safe use of AI Learn how to use AI responsibly with best practices for safety, accuracy, and transparency when using tools like ChatGPT. 8 OpenAI news 1mo ago Introducing the Child Safety Blueprint Discover OpenAI’s Child Safety Blueprint—a roadmap for building AI responsibly with safeguards, age-appropriate design, and collaboration to protect and empower young people online. 6 OpenAI news 1mo ago Announcing the OpenAI Safety Fellowship A pilot program to support independent safety and alignment research and develop the next generation of talent 14 Google DeepMind official-blog 1mo ago Protecting people from harmful manipulation Google DeepMind researches AI's harmful manipulation risks across areas like finance and health, leading to new safety measures. 7 OpenAI news 1mo ago Inside our approach to the Model Spec Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance. 27 NVIDIA Developer Blog official-blog 1mo ago Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale,... 37 MIT News — AI research 2mo ago Improving AI models’ ability to explain their predictions A new approach could help users know whether to trust a model’s predictions in safety-critical applications like health care and autonomous driving. 24 MIT News — AI research 2mo ago Exposing biases, moods, personalities, and abstract concepts hidden in large language models A new method developed at MIT could root out vulnerabilities and improve LLM safety and performance. 18 Smol AI News news-outlet 3mo ago OpenEvidence, the ‘ChatGPT for doctors,’ raises $250m at $12B valuation, 12x from $1b last Feb **OpenEvidence** raised **$12 billion**, a 12x increase from last year, with usage by 40% of U.S. physicians and over $100 million in annual revenue. **Anthropic** released a new **Claude** model constitution under **CC0 1.0**, framing it as a living document for alignment and… 34 Smol AI News news-outlet 4mo ago not much happened today **AI News for 1/6/2026-1/7/2026** highlights a quiet day with key updates on **LangChain DeepAgents** introducing **Ralph Mode** for persistent agent loops, **Cursor** improving context management by reducing token usage by **46.9%**, and operational safety measures for coding… 26 Hugging Face official-blog 4mo ago AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems Back to Articles AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems Enterprise Article Published December 23, 2025 Upvote 48 Jaykumar Kasundra JayKasundraSNOW ServiceNow-AI Large Language Models (LLMs) have rapidly evolved from text-only… 34 Google DeepMind official-blog 4mo ago Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior Open interpretability tools for language models are now available across the entire Gemma 3 family with the release of Gemma Scope 2. 11 Google DeepMind official-blog 5mo ago Deepening our partnership with the UK AI Security Institute Google DeepMind and UK AI Security Institute (AISI) strengthen collaboration on critical AI safety and security research 35 Google DeepMind official-blog 6mo ago Strengthening our Frontier Safety Framework We’re strengthening the Frontier Safety Framework (FSF) to help identify and mitigate severe risks from advanced AI models. 14 Google DeepMind official-blog 13mo ago Taking a responsible path to AGI We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community. 32 Eugene Yan research 21mo ago Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge) Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators. 26 Lil'Log (Lilian Weng) research 31mo ago Adversarial Attacks on LLMs The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via RLHF ).… 5