Tag

Safety + alignment

500 articles archived under #safety · RSS

arXiv — NLP / Computation & Language research 1mo ago

Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA

arXiv:2605.25204v1 Announce Type: new Abstract: Pluralistic alignment requires systems to adapt to diverse user values, communication styles, and contextual assumptions. We believe that a foundational prerequisite for such alignment enabling accurate preference elicitation from…

34
arXiv — NLP / Computation & Language research 1mo ago

MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

arXiv:2605.25342v1 Announce Type: new Abstract: Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require…

29
arXiv — NLP / Computation & Language research 1mo ago

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

arXiv:2605.25415v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of…

17
arXiv — NLP / Computation & Language research 1mo ago

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

arXiv:2605.25420v1 Announce Type: new Abstract: Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0,…

13
Hacker News — AI on Front Page community 1mo ago

What we lost when we stopped letting kids leave the front yard

Article URL: https://stevemagness.substack.com/p/the-cost-of-safetyism Comments URL: https://news.ycombinator.com/item?id=48267290 Points: 227 # Comments: 201

17
r/MachineLearning community 1mo ago

Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R]

I have been seeing a lot of really interesting work lately around unlearning, model editing, controllability, safety, etc. Feels like this space is moving very fast right now, and there are still so many open questions. This year I’m helping organize the U&ME workshop at ECCV…

27
Hugging Face Daily Papers research 1mo ago

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Abstract LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes. AI-generated summary Unified…

30
arXiv — Machine Learning research 1mo ago

Test-Time Training Undermines Safety Guardrails

arXiv:2605.22984v1 Announce Type: new Abstract: Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning.…

24
arXiv — Machine Learning research 1mo ago

Convex Optimization for Alignment and Preference Learning on a Single GPU

arXiv:2605.23244v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain…

20
arXiv — Machine Learning research 1mo ago

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

arXiv:2605.23351v1 Announce Type: new Abstract: We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy.…

14
arXiv — Machine Learning research 1mo ago

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

arXiv:2605.23471v1 Announce Type: new Abstract: Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their…

28
arXiv — Machine Learning research 1mo ago

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

arXiv:2605.23522v1 Announce Type: new Abstract: Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the…

38
arXiv — NLP / Computation & Language research 1mo ago

Evaluating Large Language Models in a Complex Hidden Role Game

arXiv:2605.22826v1 Announce Type: new Abstract: Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of…

22
arXiv — NLP / Computation & Language research 1mo ago

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

arXiv:2605.22880v1 Announce Type: new Abstract: As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus…

34
arXiv — NLP / Computation & Language research 1mo ago

Graph Alignment Topology as an Inductive Bias for Grounding Detection

arXiv:2605.22963v1 Announce Type: new Abstract: Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables…

12
arXiv — NLP / Computation & Language research 1mo ago

Brain-LLM Alignment Tracks Training Data, Not Typology

arXiv:2605.23032v1 Announce Type: new Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this…

20
arXiv — NLP / Computation & Language research 1mo ago

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

arXiv:2605.23035v1 Announce Type: new Abstract: Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap…

36
arXiv — NLP / Computation & Language research 1mo ago

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

arXiv:2605.23157v1 Announce Type: new Abstract: The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study…

32
arXiv — NLP / Computation & Language research 1mo ago

Naturalistic measure of social norms alignment

arXiv:2605.23420v1 Announce Type: new Abstract: Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice…

18
arXiv — NLP / Computation & Language research 1mo ago

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

arXiv:2412.14642v4 Announce Type: replace Abstract: Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one…

32
arXiv — NLP / Computation & Language research 1mo ago

Training-Free Multimodal Large Language Model Orchestration

arXiv:2508.10016v4 Announce Type: replace Abstract: Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free…

26
arXiv — NLP / Computation & Language research 1mo ago

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

arXiv:2602.17653v2 Announce Type: replace Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word…

21
Hugging Face Daily Papers research 1mo ago

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Abstract SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset. AI-generated summary We present SWIM…

35
Hugging Face Daily Papers research 1mo ago

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Abstract Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction. AI-generated summary Camera-controlled video…

20
r/MachineLearning community 1mo ago

Alignment: Higher order prioritizing over constraints [R]

So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the details and the keys to jailbreaking. The nature of a transformer is to predict the…

25
Hugging Face Daily Papers research 1mo ago

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Abstract AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation while improving generation quality in downstream tasks.…

36
Ars Technica — AI news-outlet 1mo ago

Trump canceled AI safety testing EO after snub from tech CEOs

Trump delays AI safety testing EO, claiming it would be an innovation “blocker.”

35
arXiv — Machine Learning research 1mo ago

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

arXiv:2605.21496v1 Announce Type: new Abstract: Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level…

4
arXiv — Machine Learning research 1mo ago

Harnesses for Inference-Time Alignment over Execution Trajectories

arXiv:2605.21516v1 Announce Type: new Abstract: Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate…

20
arXiv — Machine Learning research 1mo ago

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

arXiv:2605.21552v1 Announce Type: new Abstract: Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and…

26
arXiv — Machine Learning research 1mo ago

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

arXiv:2605.21558v1 Announce Type: new Abstract: Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated…

38
arXiv — Machine Learning research 1mo ago

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

arXiv:2605.21648v1 Announce Type: new Abstract: We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at…

37
arXiv — Machine Learning research 1mo ago

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

arXiv:2605.21801v1 Announce Type: new Abstract: Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish…

18
arXiv — Machine Learning research 1mo ago

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

arXiv:2605.21834v1 Announce Type: new Abstract: Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such…

31
arXiv — NLP / Computation & Language research 1mo ago

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

arXiv:2605.21609v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in…

13
arXiv — NLP / Computation & Language research 1mo ago

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

arXiv:2605.21712v1 Announce Type: new Abstract: Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites…

10
arXiv — NLP / Computation & Language research 1mo ago

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

arXiv:2605.22643v1 Announce Type: new Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the…

16
arXiv — NLP / Computation & Language research 1mo ago

Boundary-targeted Membership Inference Attacks on Safety Classifiers

arXiv:2605.22373v1 Announce Type: cross Abstract: Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on…

6
TechCrunch — AI news-outlet 1mo ago

The Path, founded by Tony Robbins and Calm alums, hopes to offer safer AI therapy

The Path says its AI model has scored 95 on the mental health safety AI benchmark, Vera-MH. This compares to a top score of 65 for the consumer bots.

4
Hugging Face Daily Papers research 1mo ago

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Abstract Current GUI agents show limited effectiveness in professional media post-production tasks despite advances in spatial grounding and multimodal alignment. AI-generated summary While GUI agents have made significant progress in web navigation and basic operating system…

13
Hugging Face Daily Papers research 1mo ago

Stitched Value Model for Diffusion Alignment

Abstract StitchVM efficiently transfers pretrained pixel-space reward models to noisy latent spaces for diffusion model alignment through a lightweight model stitching framework. AI-generated summary For practical use, diffusion- or flow-based generative models must be aligned…

4
Hugging Face Daily Papers research 1mo ago

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Abstract Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection. AI-generated summary Safety post-training can…

32
arXiv — Machine Learning research 1mo ago

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

arXiv:2605.20241v1 Announce Type: new Abstract: Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular,…

8
arXiv — Machine Learning research 1mo ago

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

arXiv:2605.20270v1 Announce Type: new Abstract: A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety…

28
arXiv — Machine Learning research 1mo ago

Spectral Souping: A Unified Framework for Online Preference Alignment

arXiv:2605.20408v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this…

26
arXiv — Machine Learning research 1mo ago

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

arXiv:2605.20654v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal…

36
arXiv — Machine Learning research 1mo ago

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

arXiv:2605.20780v1 Announce Type: new Abstract: Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce…

8
arXiv — NLP / Computation & Language research 1mo ago

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

arXiv:2605.20730v1 Announce Type: new Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising…

22
arXiv — NLP / Computation & Language research 1mo ago

Towards Context-Invariant Safety Alignment for Large Language Models

arXiv:2605.20994v1 Announce Type: new Abstract: Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording.…

34
arXiv — NLP / Computation & Language research 1mo ago

Cross-lingual robustness of LLM-brain alignment and its computational roots

arXiv:2605.21049v1 Announce Type: new Abstract: Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such…

35

Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA

MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

What we lost when we stopped letting kids leave the front yard

Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R]

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Test-Time Training Undermines Safety Guardrails

Convex Optimization for Alignment and Preference Learning on a Single GPU

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Evaluating Large Language Models in a Complex Hidden Role Game

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Graph Alignment Topology as an Inductive Bias for Grounding Detection

Brain-LLM Alignment Tracks Training Data, Not Typology

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Naturalistic measure of social norms alignment

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Training-Free Multimodal Large Language Model Orchestration

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Geo-Align: Video Generation Alignment via Metric Geometry Reward

Alignment: Higher order prioritizing over constraints [R]

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Trump canceled AI safety testing EO after snub from tech CEOs

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

Harnesses for Inference-Time Alignment over Execution Trajectories

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Boundary-targeted Membership Inference Attacks on Safety Classifiers

The Path, founded by Tony Robbins and Calm alums, hopes to offer safer AI therapy

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Stitched Value Model for Diffusion Alignment

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

Spectral Souping: A Unified Framework for Online Preference Alignment

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

Towards Context-Invariant Safety Alignment for Large Language Models

Cross-lingual robustness of LLM-brain alignment and its computational roots