News / #safety Tag Safety + alignment 500 articles archived under #safety · RSS Sign in to follow arXiv — NLP / Computation & Language research 1mo ago Clarification Is Not Enough: Post-Clarification Answering Remains the Bottleneck in Multi-Turn QA arXiv:2605.25204v1 Announce Type: new Abstract: Pluralistic alignment requires systems to adapt to diverse user values, communication styles, and contextual assumptions. We believe that a foundational prerequisite for such alignment enabling accurate preference elicitation from… 34 arXiv — NLP / Computation & Language research 1mo ago MATO: Multi-objective Personalized Alignment with Test-time Optimization for Large Language Models arXiv:2605.25342v1 Announce Type: new Abstract: Aligning large language models (LLMs) with diverse and multifaceted user preferences is a fundamental challenge in personalized AI systems. Existing multi-objective alignment methods either rely on costly training or require… 29 arXiv — NLP / Computation & Language research 1mo ago LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers arXiv:2605.25415v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of… 17 arXiv — NLP / Computation & Language research 1mo ago SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models arXiv:2605.25420v1 Announce Type: new Abstract: Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0,… 13 Hacker News — AI on Front Page community 1mo ago What we lost when we stopped letting kids leave the front yard Article URL: https://stevemagness.substack.com/p/the-cost-of-safetyism Comments URL: https://news.ycombinator.com/item?id=48267290 Points: 227 # Comments: 201 17 r/MachineLearning community 1mo ago Call for Papers - Workshop on Unlearning and Model Editing U&ME at ECCV 2026 [R] I have been seeing a lot of really interesting work lately around unlearning, model editing, controllability, safety, etc. Feels like this space is moving very fast right now, and there are still so many open questions. This year I’m helping organize the U&ME workshop at ECCV… 27 Hugging Face Daily Papers research 1mo ago LatentUMM: Dual Latent Alignment for Unified Multimodal Models Abstract LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes. AI-generated summary Unified… 30 arXiv — Machine Learning research 1mo ago Test-Time Training Undermines Safety Guardrails arXiv:2605.22984v1 Announce Type: new Abstract: Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning.… 24 arXiv — Machine Learning research 1mo ago Convex Optimization for Alignment and Preference Learning on a Single GPU arXiv:2605.23244v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain… 20 arXiv — Machine Learning research 1mo ago Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays arXiv:2605.23351v1 Announce Type: new Abstract: We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy.… 14 arXiv — Machine Learning research 1mo ago CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection arXiv:2605.23471v1 Announce Type: new Abstract: Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their… 28 arXiv — Machine Learning research 1mo ago Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models arXiv:2605.23522v1 Announce Type: new Abstract: Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the… 38 arXiv — NLP / Computation & Language research 1mo ago Evaluating Large Language Models in a Complex Hidden Role Game arXiv:2605.22826v1 Announce Type: new Abstract: Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of… 22 arXiv — NLP / Computation & Language research 1mo ago How Far Will They Go? Red-Teaming Online Influence with Large Language Models arXiv:2605.22880v1 Announce Type: new Abstract: As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus… 34 arXiv — NLP / Computation & Language research 1mo ago Graph Alignment Topology as an Inductive Bias for Grounding Detection arXiv:2605.22963v1 Announce Type: new Abstract: Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables… 12 arXiv — NLP / Computation & Language research 1mo ago Brain-LLM Alignment Tracks Training Data, Not Typology arXiv:2605.23032v1 Announce Type: new Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this… 20 arXiv — NLP / Computation & Language research 1mo ago Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography arXiv:2605.23035v1 Announce Type: new Abstract: Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap… 36 arXiv — NLP / Computation & Language research 1mo ago Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs arXiv:2605.23157v1 Announce Type: new Abstract: The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study… 32 arXiv — NLP / Computation & Language research 1mo ago Naturalistic measure of social norms alignment arXiv:2605.23420v1 Announce Type: new Abstract: Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice… 18 arXiv — NLP / Computation & Language research 1mo ago Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation arXiv:2412.14642v4 Announce Type: replace Abstract: Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one… 32 arXiv — NLP / Computation & Language research 1mo ago Training-Free Multimodal Large Language Model Orchestration arXiv:2508.10016v4 Announce Type: replace Abstract: Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free… 26 arXiv — NLP / Computation & Language research 1mo ago Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking arXiv:2602.17653v2 Announce Type: replace Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word… 21 Hugging Face Daily Papers research 1mo ago See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Abstract SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset. AI-generated summary We present SWIM… 35 Hugging Face Daily Papers research 1mo ago Geo-Align: Video Generation Alignment via Metric Geometry Reward Abstract Geo-Align presents a reinforcement learning framework for camera-controlled video re-rendering that improves generalization through scale-aware perceptual rewards and metric 3D estimation for camera trajectory extraction. AI-generated summary Camera-controlled video… 20 r/MachineLearning community 1mo ago Alignment: Higher order prioritizing over constraints [R] So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the details and the keys to jailbreaking. The nature of a transformer is to predict the… 25 Hugging Face Daily Papers research 1mo ago AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment Abstract AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation while improving generation quality in downstream tasks.… 36 Ars Technica — AI news-outlet 1mo ago Trump canceled AI safety testing EO after snub from tech CEOs Trump delays AI safety testing EO, claiming it would be an innovation “blocker.” 35 arXiv — Machine Learning research 1mo ago HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine arXiv:2605.21496v1 Announce Type: new Abstract: Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level… 4 arXiv — Machine Learning research 1mo ago Harnesses for Inference-Time Alignment over Execution Trajectories arXiv:2605.21516v1 Announce Type: new Abstract: Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate… 20 arXiv — Machine Learning research 1mo ago Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift arXiv:2605.21552v1 Announce Type: new Abstract: Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and… 26 arXiv — Machine Learning research 1mo ago From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment arXiv:2605.21558v1 Announce Type: new Abstract: Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated… 38 arXiv — Machine Learning research 1mo ago Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos arXiv:2605.21648v1 Announce Type: new Abstract: We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at… 37 arXiv — Machine Learning research 1mo ago Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization arXiv:2605.21801v1 Announce Type: new Abstract: Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish… 18 arXiv — Machine Learning research 1mo ago On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation arXiv:2605.21834v1 Announce Type: new Abstract: Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such… 31 arXiv — NLP / Computation & Language research 1mo ago CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety arXiv:2605.21609v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in… 13 arXiv — NLP / Computation & Language research 1mo ago Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries arXiv:2605.21712v1 Announce Type: new Abstract: Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites… 10 arXiv — NLP / Computation & Language research 1mo ago Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety arXiv:2605.22643v1 Announce Type: new Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the… 16 arXiv — NLP / Computation & Language research 1mo ago Boundary-targeted Membership Inference Attacks on Safety Classifiers arXiv:2605.22373v1 Announce Type: cross Abstract: Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on… 6 TechCrunch — AI news-outlet 1mo ago The Path, founded by Tony Robbins and Calm alums, hopes to offer safer AI therapy The Path says its AI model has scored 95 on the mental health safety AI benchmark, Vera-MH. This compares to a top score of 65 for the consumer bots. 4 Hugging Face Daily Papers research 1mo ago CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing Abstract Current GUI agents show limited effectiveness in professional media post-production tasks despite advances in spatial grounding and multimodal alignment. AI-generated summary While GUI agents have made significant progress in web navigation and basic operating system… 13 Hugging Face Daily Papers research 1mo ago Stitched Value Model for Diffusion Alignment Abstract StitchVM efficiently transfers pretrained pixel-space reward models to noisy latent spaces for diffusion model alignment through a lightweight model stitching framework. AI-generated summary For practical use, diffusion- or flow-based generative models must be aligned… 4 Hugging Face Daily Papers research 1mo ago Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection Abstract Orthogonal Gradient Projection for Safety Alignment (OGPSA) addresses the safety-utility trade-off in LLM alignment by preserving general capabilities during sequential safety training through low-rank gradient projection. AI-generated summary Safety post-training can… 32 arXiv — Machine Learning research 1mo ago Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry arXiv:2605.20241v1 Announce Type: new Abstract: Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular,… 8 arXiv — Machine Learning research 1mo ago Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs arXiv:2605.20270v1 Announce Type: new Abstract: A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety… 28 arXiv — Machine Learning research 1mo ago Spectral Souping: A Unified Framework for Online Preference Alignment arXiv:2605.20408v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this… 26 arXiv — Machine Learning research 1mo ago REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak arXiv:2605.20654v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal… 36 arXiv — Machine Learning research 1mo ago Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment arXiv:2605.20780v1 Announce Type: new Abstract: Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce… 8 arXiv — NLP / Computation & Language research 1mo ago Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning arXiv:2605.20730v1 Announce Type: new Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising… 22 arXiv — NLP / Computation & Language research 1mo ago Towards Context-Invariant Safety Alignment for Large Language Models arXiv:2605.20994v1 Announce Type: new Abstract: Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording.… 34 arXiv — NLP / Computation & Language research 1mo ago Cross-lingual robustness of LLM-brain alignment and its computational roots arXiv:2605.21049v1 Announce Type: new Abstract: Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such… 35 Page 9 of 10 · 500 articles ← Newer Older →