Tag

Safety + alignment

41 articles archived under #safety · RSS

NVIDIA Developer Blog official-blog 7h ago

Google DeepMind paper: reinforcement learning at scale

New work demonstrates RL fine-tuning at unprecedented scale, with concrete benchmarks on reasoning tasks.

14
arXiv — Machine Learning research 15h ago

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

arXiv:2605.10983v1 Announce Type: new Abstract: Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by…

10
arXiv — Machine Learning research 15h ago

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

arXiv:2605.11205v1 Announce Type: new Abstract: Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation…

34
arXiv — Machine Learning research 15h ago

Leveraging RAG for Training-Free Alignment of LLMs

arXiv:2605.11217v1 Announce Type: new Abstract: Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that…

36
arXiv — Machine Learning research 15h ago

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

arXiv:2605.11235v1 Announce Type: new Abstract: In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the…

18
arXiv — Machine Learning research 15h ago

Gradient-Free Noise Optimization for Reward Alignment in Generative Models

arXiv:2605.11347v1 Announce Type: new Abstract: Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but…

38
arXiv — Machine Learning research 15h ago

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

arXiv:2605.11361v1 Announce Type: new Abstract: Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law $p$ into a sampler that favors a reward $r$ while remaining close to $p$. Since there is no canonical distributional distance for this…

27
arXiv — NLP / Computation & Language research 15h ago

StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models

arXiv:2605.11483v1 Announce Type: new Abstract: While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on…

13
arXiv — NLP / Computation & Language research 15h ago

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

arXiv:2605.11632v1 Announce Type: new Abstract: Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling…

37
arXiv — NLP / Computation & Language research 15h ago

Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

arXiv:2605.11685v1 Announce Type: new Abstract: Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical…

17
arXiv — NLP / Computation & Language research 15h ago

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

arXiv:2605.11769v1 Announce Type: new Abstract: Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their…

7
arXiv — NLP / Computation & Language research 15h ago

Metaphor Is Not All Attention Needs

arXiv:2605.12128v1 Announce Type: new Abstract: Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies,…

20
arXiv — NLP / Computation & Language research 15h ago

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

arXiv:2605.11398v1 Announce Type: cross Abstract: We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health…

36
arXiv — Machine Learning research 1d ago

The Safety-Aware Denoiser for Text Diffusion Models

arXiv:2605.08116v1 Announce Type: new Abstract: Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety…

9
OpenAI news 6d ago

Introducing Trusted Contact in ChatGPT

Introducing Trusted Contact in ChatGPT, an optional safety feature that notifies someone you trust if serious self-harm concerns are detected.

23
Ars Technica — AI news-outlet 6d ago

Spooked by Mythos, Trump suddenly realized AI safety testing might be good

Trump forced to admit Biden was right on AI safety testing.

9
OpenAI news 8d ago

GPT-5.5 Instant System Card

May 5, 2026 Safety Publication GPT‑5.5 Instant System Card Read the System Card (opens in a new window) Introduction GPT‑5.5 Instant is our latest Instant model, and explained in our blog ⁠ . The comprehensive safety mitigation approach for this model is similar to previous…

10
OpenAI news 8d ago

Advancing youth safety and wellbeing in EMEA

Explore OpenAI’s European Youth Safety Blueprint and EMEA Youth & Wellbeing Grants, advancing safe, responsible AI for teens, families, and educators.

30
OpenAI news 15d ago

Our commitment to community safety

Learn how OpenAI protects community safety in ChatGPT through model safeguards, misuse detection, policy enforcement, and collaboration with safety experts.

8
Marcus on AI community 16d ago

Dario Amodei, hype, AI safety, and the explosion of vibe-coded AI disasters

What the AI cheerleaders don’t tell you

25
OpenAI news 20d ago

GPT-5.5 System Card

April 23, 2026 Safety Publication GPT‑5.5 System Card Read the System Card (opens in a new window) 1. Introduction GPT‑5.5 is a new model designed for complex, real-world work, including writing code, researching online, analyzing information, creating documents and…

4
Smol AI News news-outlet 20d ago

GPT 5.5

**OpenAI launched GPT-5.5** as its new flagship model for "real work and powering agents," immediately available in ChatGPT and Codex but with delayed API access due to enhanced safety requirements. The model features improved token efficiency and supports longer multi-step…

14
OpenAI news 20d ago

GPT-5.5 Bio Bug Bounty

Explore the GPT-5.5 Bio Bug Bounty: a red-teaming challenge to find universal jailbreaks for bio safety risks, with rewards up to $25,000.

35
Import AI news-outlet 23d ago

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4

At what point do the financial markets price in the singularity?

28
OpenAI news 1mo ago

Responsible and safe use of AI

Learn how to use AI responsibly with best practices for safety, accuracy, and transparency when using tools like ChatGPT.

8
OpenAI news 1mo ago

Introducing the Child Safety Blueprint

Discover OpenAI’s Child Safety Blueprint—a roadmap for building AI responsibly with safeguards, age-appropriate design, and collaboration to protect and empower young people online.

6
OpenAI news 1mo ago

Announcing the OpenAI Safety Fellowship

A pilot program to support independent safety and alignment research and develop the next generation of talent

14
Google DeepMind official-blog 1mo ago

Protecting people from harmful manipulation

Google DeepMind researches AI's harmful manipulation risks across areas like finance and health, leading to new safety measures.

7
OpenAI news 1mo ago

Inside our approach to the Model Spec

Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance.

27
NVIDIA Developer Blog official-blog 1mo ago

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale,...

37
MIT News — AI research 2mo ago

Improving AI models’ ability to explain their predictions

A new approach could help users know whether to trust a model’s predictions in safety-critical applications like health care and autonomous driving.

24
MIT News — AI research 2mo ago

Exposing biases, moods, personalities, and abstract concepts hidden in large language models

A new method developed at MIT could root out vulnerabilities and improve LLM safety and performance.

18
Smol AI News news-outlet 3mo ago

OpenEvidence, the ‘ChatGPT for doctors,’ raises $250m at $12B valuation, 12x from $1b last Feb

**OpenEvidence** raised **$12 billion**, a 12x increase from last year, with usage by 40% of U.S. physicians and over $100 million in annual revenue. **Anthropic** released a new **Claude** model constitution under **CC0 1.0**, framing it as a living document for alignment and…

34
Smol AI News news-outlet 4mo ago

not much happened today

**AI News for 1/6/2026-1/7/2026** highlights a quiet day with key updates on **LangChain DeepAgents** introducing **Ralph Mode** for persistent agent loops, **Cursor** improving context management by reducing token usage by **46.9%**, and operational safety measures for coding…

26
Hugging Face official-blog 4mo ago

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

Back to Articles AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems Enterprise Article Published December 23, 2025 Upvote 48 Jaykumar Kasundra JayKasundraSNOW ServiceNow-AI Large Language Models (LLMs) have rapidly evolved from text-only…

34
Google DeepMind official-blog 4mo ago

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

Open interpretability tools for language models are now available across the entire Gemma 3 family with the release of Gemma Scope 2.

11
Google DeepMind official-blog 5mo ago

Deepening our partnership with the UK AI Security Institute

Google DeepMind and UK AI Security Institute (AISI) strengthen collaboration on critical AI safety and security research

35
Google DeepMind official-blog 6mo ago

Strengthening our Frontier Safety Framework

We’re strengthening the Frontier Safety Framework (FSF) to help identify and mitigate severe risks from advanced AI models.

14
Google DeepMind official-blog 13mo ago

Taking a responsible path to AGI

We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.

32
Eugene Yan research 21mo ago

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Use cases, techniques, alignment, finetuning, and critiques against LLM-evaluators.

26
Lil'Log (Lilian Weng) research 31mo ago

Adversarial Attacks on LLMs

The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via RLHF ).…

5

Google DeepMind paper: reinforcement learning at scale

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Leveraging RAG for Training-Free Alignment of LLMs

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

Gradient-Free Noise Optimization for Reward Alignment in Generative Models

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

Metaphor Is Not All Attention Needs

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

The Safety-Aware Denoiser for Text Diffusion Models

Introducing Trusted Contact in ChatGPT

Spooked by Mythos, Trump suddenly realized AI safety testing might be good

GPT-5.5 Instant System Card

Advancing youth safety and wellbeing in EMEA

Our commitment to community safety

Dario Amodei, hype, AI safety, and the explosion of vibe-coded AI disasters

GPT-5.5 System Card

GPT 5.5

GPT-5.5 Bio Bug Bounty

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4

Responsible and safe use of AI

Introducing the Child Safety Blueprint

Announcing the OpenAI Safety Fellowship

Protecting people from harmful manipulation

Inside our approach to the Model Spec

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

Improving AI models’ ability to explain their predictions

Exposing biases, moods, personalities, and abstract concepts hidden in large language models

OpenEvidence, the ‘ChatGPT for doctors,’ raises $250m at $12B valuation, 12x from $1b last Feb

not much happened today

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

Deepening our partnership with the UK AI Security Institute

Strengthening our Frontier Safety Framework

Taking a responsible path to AGI

Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)

Adversarial Attacks on LLMs