arXiv — NLP / Computation & Language
97 articles archived · Visit source ↗ · RSS
-
arXiv — NLP / Computation & Language research 15h ago
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
arXiv:2605.11128v1 Announce Type: new Abstract: Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks…
11 -
arXiv — NLP / Computation & Language research 15h ago
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
arXiv:2605.11143v1 Announce Type: new Abstract: Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer…
27 -
arXiv — NLP / Computation & Language research 15h ago
Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary
arXiv:2605.11153v1 Announce Type: new Abstract: We decompose an evolutionary mixture-of-LoRA system on a from-scratch ~150M-parameter widened-D substrate (D=1536, V=32000; D/V approx 0.048; the "widened-1536" substrate) into three factors -- a router rewrite (parallel sigmoid…
19 -
arXiv — NLP / Computation & Language research 15h ago
The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models
arXiv:2605.11167v1 Announce Type: new Abstract: Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent…
16 -
arXiv — NLP / Computation & Language research 15h ago
How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation
arXiv:2605.11195v1 Announce Type: new Abstract: Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of…
32 -
arXiv — NLP / Computation & Language research 15h ago
Instructions shape Production of Language, not Processing
arXiv:2605.11206v1 Announce Type: new Abstract: Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by…
14 -
arXiv — NLP / Computation & Language research 15h ago
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
arXiv:2605.11212v1 Announce Type: new Abstract: Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly,…
11 -
arXiv — NLP / Computation & Language research 15h ago
RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German
arXiv:2605.11242v1 Announce Type: new Abstract: In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and…
26 -
arXiv — NLP / Computation & Language research 15h ago
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
arXiv:2605.11255v1 Announce Type: new Abstract: We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous…
11 -
arXiv — NLP / Computation & Language research 15h ago
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
arXiv:2605.11290v1 Announce Type: new Abstract: Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most…
27 -
arXiv — NLP / Computation & Language research 15h ago
Predicting Psychological Well-Being from Spontaneous Speech using LLMs
arXiv:2605.11303v1 Announce Type: new Abstract: We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD…
7 -
arXiv — NLP / Computation & Language research 15h ago
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
arXiv:2605.11317v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every…
33 -
arXiv — NLP / Computation & Language research 15h ago
Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence
arXiv:2605.11348v1 Announce Type: new Abstract: During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However,…
17 -
arXiv — NLP / Computation & Language research 15h ago
An Empirical Study of Automating Agent Evaluation
arXiv:2605.11378v1 Announce Type: new Abstract: Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate…
5 -
arXiv — NLP / Computation & Language research 15h ago
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
arXiv:2605.11388v1 Announce Type: new Abstract: Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified…
5 -
arXiv — NLP / Computation & Language research 15h ago
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
arXiv:2605.11416v1 Announce Type: new Abstract: Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable…
28 -
arXiv — NLP / Computation & Language research 15h ago
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
arXiv:2605.11436v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two…
38 -
arXiv — NLP / Computation & Language research 15h ago
StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models
arXiv:2605.11483v1 Announce Type: new Abstract: While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on…
13 -
arXiv — NLP / Computation & Language research 15h ago
Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations
arXiv:2605.11502v1 Announce Type: new Abstract: Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design…
23 -
arXiv — NLP / Computation & Language research 15h ago
A Study on Hidden Layer Distillation for Large Language Model Pre-Training
arXiv:2605.11513v1 Announce Type: new Abstract: Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's…
25 -
arXiv — NLP / Computation & Language research 15h ago
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
arXiv:2605.11533v1 Announce Type: new Abstract: Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for…
30 -
arXiv — NLP / Computation & Language research 15h ago
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
arXiv:2605.11538v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and…
23 -
arXiv — NLP / Computation & Language research 15h ago
Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
arXiv:2605.11574v1 Announce Type: new Abstract: The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained…
35 -
arXiv — NLP / Computation & Language research 15h ago
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
arXiv:2605.11577v1 Announce Type: new Abstract: Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token…
27 -
arXiv — NLP / Computation & Language research 15h ago
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
arXiv:2605.11581v1 Announce Type: new Abstract: When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers…
32 -
arXiv — NLP / Computation & Language research 15h ago
Efficient LLM-based Advertising via Model Compression and Parallel Verification
arXiv:2605.11582v1 Announce Type: new Abstract: Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges…
19 -
arXiv — NLP / Computation & Language research 15h ago
DiffScore: Text Evaluation Beyond Autoregressive Likelihood
arXiv:2605.11601v1 Announce Type: new Abstract: Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry…
38 -
arXiv — NLP / Computation & Language research 15h ago
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
arXiv:2605.11608v1 Announce Type: new Abstract: Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA…
31 -
arXiv — NLP / Computation & Language research 15h ago
When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models
arXiv:2605.11612v1 Announce Type: new Abstract: Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In…
25 -
arXiv — NLP / Computation & Language research 15h ago
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
arXiv:2605.11629v1 Announce Type: new Abstract: Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource…
38 -
arXiv — NLP / Computation & Language research 15h ago
Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization
arXiv:2605.11632v1 Announce Type: new Abstract: Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling…
37 -
arXiv — NLP / Computation & Language research 15h ago
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
arXiv:2605.11663v1 Announce Type: new Abstract: Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed…
13 -
arXiv — NLP / Computation & Language research 15h ago
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
arXiv:2605.11685v1 Announce Type: new Abstract: Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical…
17 -
arXiv — NLP / Computation & Language research 15h ago
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
arXiv:2605.11739v1 Announce Type: new Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level…
6 -
arXiv — NLP / Computation & Language research 15h ago
Training-Inference Consistent Segmented Execution for Long-Context LLMs
arXiv:2605.11744v1 Announce Type: new Abstract: Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many…
7 -
arXiv — NLP / Computation & Language research 15h ago
Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
arXiv:2605.11769v1 Announce Type: new Abstract: Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their…
7 -
arXiv — NLP / Computation & Language research 15h ago
From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
arXiv:2605.11774v1 Announce Type: new Abstract: By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or…
13 -
arXiv — NLP / Computation & Language research 15h ago
Choosing features for classifying multiword expressions
arXiv:2605.11779v1 Announce Type: new Abstract: Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all…
21 -
arXiv — NLP / Computation & Language research 15h ago
Probabilistic Calibration Is a Trainable Capability in Language Models
arXiv:2605.11845v1 Announce Type: new Abstract: Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability…
17 -
arXiv — NLP / Computation & Language research 15h ago
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
arXiv:2605.11854v1 Announce Type: new Abstract: Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard…
18 -
arXiv — NLP / Computation & Language research 15h ago
Concordance Comparison as a Means of Assembling Local Grammars
arXiv:2605.11862v1 Announce Type: new Abstract: Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences.…
12 -
arXiv — NLP / Computation & Language research 15h ago
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
arXiv:2605.11887v1 Announce Type: new Abstract: Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This…
22 -
arXiv — NLP / Computation & Language research 15h ago
YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning
arXiv:2605.11906v1 Announce Type: new Abstract: Preference optimization has become an important post-training paradigm for improving the reasoning abilities of large language models. Existing methods typically rely on externally constructed preference data, using preferred and…
31 -
arXiv — NLP / Computation & Language research 15h ago
Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
arXiv:2605.11964v1 Announce Type: new Abstract: A target-guided proactive dialogue system aims to steer conversations proactively toward pre-defined targets, such as designated keywords or specific topics. During guided conversations, dynamically modeling conversational…
37 -
arXiv — NLP / Computation & Language research 15h ago
On Predicting the Post-training Potential of Pre-trained LLMs
arXiv:2605.11978v1 Announce Type: new Abstract: The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's…
11 -
arXiv — NLP / Computation & Language research 15h ago
Towards Visually-Guided Movie Subtitle Translation for Indic Languages
arXiv:2605.11993v1 Announce Type: new Abstract: Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu,…
13 -
arXiv — NLP / Computation & Language research 15h ago
Learning Agentic Policy from Action Guidance
arXiv:2605.12004v1 Announce Type: new Abstract: Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base…
12 -
arXiv — NLP / Computation & Language research 15h ago
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
arXiv:2605.12022v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in…
26 -
arXiv — NLP / Computation & Language research 15h ago
Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking
arXiv:2605.12028v1 Announce Type: new Abstract: We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5…
30 -
arXiv — NLP / Computation & Language research 15h ago
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
arXiv:2605.12039v1 Announce Type: new Abstract: Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key…
11 -
arXiv — NLP / Computation & Language research 15h ago
Is Child-Directed Language Optimized for Word Learning? A Computational Study of Verb Meaning Acquisition
arXiv:2605.12047v1 Announce Type: new Abstract: Is child-directed language (CDL) optimized to support language learning, and which aspects of linguistic development does it facilitate? We investigate this question using neural language models trained on CDL versus adult-directed…
6 -
arXiv — NLP / Computation & Language research 15h ago
Do Language Models Encode Knowledge of Linguistic Constraint Violations?
arXiv:2605.12055v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic…
31 -
arXiv — NLP / Computation & Language research 15h ago
Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward
arXiv:2605.12096v1 Announce Type: new Abstract: Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational…
27 -
arXiv — NLP / Computation & Language research 15h ago
Metaphor Is Not All Attention Needs
arXiv:2605.12128v1 Announce Type: new Abstract: Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies,…
20 -
arXiv — NLP / Computation & Language research 15h ago
Latent Causal Void: Explicit Missing-Context Reconstruction for Misinformation Detection
arXiv:2605.12156v1 Announce Type: new Abstract: Automatic misinformation detection performs well when deception is visible in what an article explicitly states. However, some misinformation articles remain locally coherent and only become misleading once compared with…
27 -
arXiv — NLP / Computation & Language research 15h ago
Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach
arXiv:2605.12177v1 Announce Type: new Abstract: [Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from…
6 -
arXiv — NLP / Computation & Language research 15h ago
Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding
arXiv:2605.12185v1 Announce Type: new Abstract: Large language models accumulate extensive parametric knowledge through pre-training. However, knowledge conflicts occur when outdated or incorrect parametric knowledge conflicts with external knowledge in the context. Existing…
27 -
arXiv — NLP / Computation & Language research 15h ago
Mechanistic Interpretability of ASR models using Sparse Autoencoders
arXiv:2605.12225v1 Announce Type: new Abstract: Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance,…
24 -
arXiv — NLP / Computation & Language research 15h ago
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
arXiv:2605.12227v1 Announce Type: new Abstract: Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as…
12 -
arXiv — NLP / Computation & Language research 15h ago
Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs
arXiv:2605.12242v1 Announce Type: new Abstract: Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left…
5 -
arXiv — NLP / Computation & Language research 15h ago
PreScam: A Benchmark for Predicting Scam Progression from Early Conversations
arXiv:2605.12243v1 Announce Type: new Abstract: Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in…
33 -
arXiv — NLP / Computation & Language research 15h ago
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
arXiv:2605.12260v1 Announce Type: new Abstract: Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the…
8 -
arXiv — NLP / Computation & Language research 15h ago
What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty
arXiv:2605.12281v1 Announce Type: new Abstract: What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with…
32 -
arXiv — NLP / Computation & Language research 15h ago
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
arXiv:2605.12288v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions.…
12 -
arXiv — NLP / Computation & Language research 15h ago
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
arXiv:2605.12299v1 Announce Type: new Abstract: Recent works have analyzed the impact of individual components of neural networks on gendered predictions, often with a focus on mitigating gender bias. However, mechanistic interpretations of gender tend to (i) focus on a very…
23 -
arXiv — NLP / Computation & Language research 15h ago
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
arXiv:2605.12313v1 Announce Type: new Abstract: Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX…
18 -
arXiv — NLP / Computation & Language research 15h ago
A categorical error sensitivity index (ISEC): A preventive ordinal decision-support measure for irrecoverable errors in manual data entry systems
arXiv:2605.12328v1 Announce Type: new Abstract: Data entry systems remain structurally vulnerable to categorical misclassifications, particularly in small and medium sized enterprises (SMEs). When nominal categories exhibit semantic or morphological proximity, human machine…
29 -
arXiv — NLP / Computation & Language research 15h ago
Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation
arXiv:2605.12345v1 Announce Type: new Abstract: Parameter-efficient fine-tuning (PEFT) techniques offer task-specific fine-tuning at a fraction of the cost of full fine-tuning, but require separate fine-tuning for every new task (combination). In this paper, we explore three…
25 -
arXiv — NLP / Computation & Language research 15h ago
MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
arXiv:2605.12361v1 Announce Type: new Abstract: Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question…
6 -
arXiv — NLP / Computation & Language research 15h ago
Context Convergence Improves Answering Inferential Questions
arXiv:2605.12370v1 Announce Type: new Abstract: While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This…
21 -
arXiv — NLP / Computation & Language research 15h ago
Pretraining Exposure Explains Popularity Judgments in Large Language Models
arXiv:2605.12382v1 Announce Type: new Abstract: Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical…
19 -
arXiv — NLP / Computation & Language research 15h ago
Scalable Token-Level Hallucination Detection in Large Language Models
arXiv:2605.12384v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent…
35 -
arXiv — NLP / Computation & Language research 15h ago
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
arXiv:2605.12395v1 Announce Type: new Abstract: Background: Many different approaches to controlled text generation (CTG) have been proposed over recent years, but it is difficult to get a clear picture of which approach performs best, because different datasets and evaluation…
23 -
arXiv — NLP / Computation & Language research 15h ago
Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring
arXiv:2605.12398v1 Announce Type: new Abstract: Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or…
8 -
arXiv — NLP / Computation & Language research 15h ago
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
arXiv:2605.12412v1 Announce Type: new Abstract: Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this…
9 -
arXiv — NLP / Computation & Language research 15h ago
ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
arXiv:2605.12419v1 Announce Type: new Abstract: Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates…
24 -
arXiv — NLP / Computation & Language research 15h ago
Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals
arXiv:2605.12422v1 Announce Type: new Abstract: Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has…
16 -
arXiv — NLP / Computation & Language research 15h ago
Geometric Factual Recall in Transformers
arXiv:2605.12426v1 Announce Type: new Abstract: How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of…
5 -
arXiv — NLP / Computation & Language research 15h ago
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
arXiv:2605.12438v1 Announce Type: new Abstract: When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay…
38 -
arXiv — NLP / Computation & Language research 15h ago
The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
arXiv:2605.12452v1 Announce Type: new Abstract: Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as…
18 -
arXiv — NLP / Computation & Language research 15h ago
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
arXiv:2605.12487v1 Announce Type: new Abstract: We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of…
36 -
arXiv — NLP / Computation & Language research 15h ago
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
arXiv:2605.12493v1 Announce Type: new Abstract: Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for…
17 -
arXiv — NLP / Computation & Language research 15h ago
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
arXiv:2605.11026v1 Announce Type: cross Abstract: Defenses against indirect prompt injection (IPI) in tool-using LLM agents share two structural weaknesses. First, they all attempt to prevent attacks rather than detect the compromises that slip through. Second, they have only…
21 -
arXiv — NLP / Computation & Language research 15h ago
On Problems of Implicit Context Compression for Software Engineering Agents
arXiv:2605.11051v1 Announce Type: cross Abstract: LLM-based Software Engineering agents face a critical bottleneck: context length limitations cause failures on complex, long-horizon tasks. One promising solution is to encode context as continuous embeddings rather than discrete…
27 -
arXiv — NLP / Computation & Language research 15h ago
Unlocking LLM Creativity in Science through Analogical Reasoning
arXiv:2605.11258v1 Announce Type: cross Abstract: Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We…
22 -
arXiv — NLP / Computation & Language research 15h ago
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
arXiv:2605.11301v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than…
24 -
arXiv — NLP / Computation & Language research 15h ago
VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
arXiv:2605.11334v1 Announce Type: cross Abstract: LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are…
19 -
arXiv — NLP / Computation & Language research 15h ago
Much of Geospatial Web Search Is Beyond Traditional GIS
arXiv:2605.11336v1 Announce Type: cross Abstract: Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We…
13 -
arXiv — NLP / Computation & Language research 15h ago
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
arXiv:2605.11363v1 Announce Type: cross Abstract: Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework…
30 -
arXiv — NLP / Computation & Language research 15h ago
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
arXiv:2605.11374v1 Announce Type: cross Abstract: Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Most modern embedding checkpoints are distilled from large LLM backbones and inherit their representation…
21 -
arXiv — NLP / Computation & Language research 15h ago
AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment
arXiv:2605.11398v1 Announce Type: cross Abstract: We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health…
36 -
arXiv — NLP / Computation & Language research 15h ago
fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
arXiv:2605.11403v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked…
38 -
arXiv — NLP / Computation & Language research 15h ago
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
arXiv:2605.11408v1 Announce Type: cross Abstract: Tabular data forms the backbone of high-stakes decision systems in finance, healthcare, and beyond. Yet industrial tabular datasets are inherently difficult: high-dimensional, riddled with missing entries, and rarely labeled at…
5 -
arXiv — NLP / Computation & Language research 15h ago
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
arXiv:2605.11442v1 Announce Type: cross Abstract: Large Language Model (LLM) agents have emerged as key intermediaries, orchestrating complex interactions between human users and a wide range of digital services and LLM infrastructures. While prior research has extensively…
20 -
arXiv — NLP / Computation & Language research 15h ago
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
arXiv:2605.11458v1 Announce Type: cross Abstract: On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such…
28 -
arXiv — NLP / Computation & Language research 15h ago
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
arXiv:2605.11518v1 Announce Type: cross Abstract: Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial…
13 -
arXiv — NLP / Computation & Language research 15h ago
Controllable User Simulation
arXiv:2605.11519v1 Announce Type: cross Abstract: Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation,…
20