News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — NLP / Computation & Language research 28d ago A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models arXiv:2606.00027v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a… 37 arXiv — NLP / Computation & Language research 28d ago Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge… 33 arXiv — NLP / Computation & Language research 28d ago RealityTest: How People Probe AI Identity and Whether Models Disclose It arXiv:2606.00168v1 Announce Type: new Abstract: AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of… 24 arXiv — NLP / Computation & Language research 28d ago Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities arXiv:2606.00596v1 Announce Type: new Abstract: Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in… 9 arXiv — NLP / Computation & Language research 28d ago IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs arXiv:2606.00875v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior… 25 arXiv — NLP / Computation & Language research 28d ago Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations arXiv:2606.00881v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size… 38 arXiv — NLP / Computation & Language research 28d ago PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects arXiv:2606.01016v1 Announce Type: new Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a… 19 arXiv — NLP / Computation & Language research 28d ago Child-directed speech facilitates production, not comprehension, in BabyLMs arXiv:2606.01045v1 Announce Type: new Abstract: Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of… 26 arXiv — NLP / Computation & Language research 28d ago IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages arXiv:2606.01260v1 Announce Type: new Abstract: Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied in Indonesia, thus leaving a critical gap in evaluating representational fairness and… 21 arXiv — NLP / Computation & Language research 28d ago TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages arXiv:2606.01322v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven… 19 Hugging Face Daily Papers research 28d ago Brain-IT-VQA: From Brain Signals to Answers Abstract Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation. AI-generated summary Decoding visual content from fMRI signals recorded while a person… 21 The Information — AI news-outlet 28d ago AI Evaluators Struggle with Models That Know When They’re Being Tested AI researchers are starting to make progress on a confounding problem: AI models are getting better at telling when they are in an evaluation. That could become a problem for AI companies that use evaluations to gauge the capabilities and behaviors of their models before… 37 arXiv — Machine Learning research 29d ago Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling arXiv:2605.30376v1 Announce Type: new Abstract: Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore critical inter-channel dependencies, while channel-dependent models are expressive but… 15 arXiv — Machine Learning research 29d ago A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI arXiv:2605.30388v1 Announce Type: new Abstract: This paper introduces a new systematic framework for detecting anomalies in maritime Automatic Identification System (AIS) datasets. These anomalies include abnormal vessel behaviours related to speed, position jumps, time gaps,… 22 arXiv — Machine Learning research 29d ago NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models arXiv:2605.30393v1 Announce Type: new Abstract: Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than out-of-sample skill. We introduce NumLeak, a measurement framework that combines API-boundary… 25 arXiv — Machine Learning research 29d ago MAAT: Multi-phase Adapter-Aware Targeted Unlearning arXiv:2605.30514v1 Announce Type: new Abstract: Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This… 10 arXiv — Machine Learning research 29d ago Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents arXiv:2605.30590v1 Announce Type: new Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other… 23 arXiv — Machine Learning research 29d ago Conformal Reliability: A New Evaluation Metric for Conditional Generation arXiv:2605.30807v1 Announce Type: new Abstract: Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is… 8 arXiv — Machine Learning research 29d ago GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring arXiv:2605.30865v1 Announce Type: new Abstract: Continuous glucose monitoring (CGM) provides a dense view of daily metabolic physiology, yet existing generic time-series and CGM-specific foundation models often encode glucose traces as entangled single-stream sequences, leaving… 31 arXiv — NLP / Computation & Language research 29d ago Refining Word-Based Grammatical Error Annotation for L2 Korean arXiv:2605.30545v1 Announce Type: new Abstract: Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they… 10 arXiv — NLP / Computation & Language research 29d ago Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge arXiv:2605.30568v1 Announce Type: new Abstract: LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained… 37 arXiv — NLP / Computation & Language research 29d ago TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal… 26 arXiv — NLP / Computation & Language research 29d ago Pairwise Reference Alignment as a Model-Level Ordinal Observable arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference… 18 arXiv — NLP / Computation & Language research 29d ago A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation arXiv:2605.31351v1 Announce Type: new Abstract: AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general… 30 arXiv — NLP / Computation & Language research 29d ago LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues… 36 arXiv — NLP / Computation & Language research 29d ago BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali arXiv:2605.31483v1 Announce Type: new Abstract: Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination… 20 Hugging Face Daily Papers research 29d ago Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios Abstract Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations. AI-generated summary Recent advances in speech generation have enabled… 5 Hugging Face Daily Papers research 29d ago OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents Abstract OpenSkillEval is an automatic evaluation framework that assesses skill-augmented agent systems and skills across diverse real-world applications, revealing that skill availability doesn't guarantee effective usage and that performance benefits depend heavily on model… 31 r/LocalLLaMA community 1mo ago PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark) Author here. The short version of why I built this: Cyber-AI evaluation is converging on the same diagnosis from multiple labs. Anthropic's Claude Mythos system card this year: their cyber ranges "lack many features often present in real-world environments such as defensive… 6 r/MachineLearning community 1mo ago Bayesian Opt. GPs vs Linear models and Neural Networks for parameter optimizations [R] Hi, Relatively new to deep learning. I wanted some opinions on which of these approaches might be best for time series data and spectral analysis. I currently use a GP and it works pretty well, but I’m wondering what the computational tradeoffs and so forth might be. Any ideas?… 4 Hacker News — AI on Front Page community 1mo ago OpenRouter raises $113M Series B Article URL: https://openrouter.ai/announcements/series-b Comments URL: https://news.ycombinator.com/item?id=48338660 Points: 242 # Comments: 110 4 TechCrunch — AI news-outlet 1mo ago The groupthink boom: what 3 top VCs really think about the AI frenzy "If you're 22 years old in San Francisco and building something in AI, there may be a seed term sheet in your inbox — but if you're 19, oh my God, this means you're really good; you might already have a Series A [offer]," said one, half-kiddingly. 12 r/LocalLLaMA community 1mo ago Gryphe/Pantheon-Reasoning-27B · Hugging Face from Gryphe: An experiment in bringing reasoning capability to the Pantheon roleplay series in the form of an uncensored dense Qwen 3.6 27B. This specific model can be thought of as a successor to both the Pantheon series and the one-time Codex release since I used such a large… 15 Hacker News — AI on Front Page community 1mo ago Danish pension fund excludes SpaceX citing governance and valuation Article URL: https://www.reuters.com/legal/transactional/danish-pension-fund-excludes-spacex-citing-governance-valuation-2026-05-29/ Comments URL: https://news.ycombinator.com/item?id=48333820 Points: 207 # Comments: 146 23 Hugging Face Daily Papers research 1mo ago Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection Abstract A parameter-efficient vision-language model is developed for time-series anomaly detection using a novel benchmark with natural-language rationales, achieving superior performance and generalization across multiple datasets. AI-generated summary Recent advances in… 38 TechCrunch — AI news-outlet 1mo ago This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory South Korean chip startup Xcena is betting that AI's real bottleneck is not compute, but memory. 20 Hugging Face Daily Papers research 1mo ago PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers Abstract PRISM evaluates automated peer review systems across multiple dimensions using argument mining and retrieval-augmented verification, revealing that while LLMs match human performance in specific areas, no system consistently equals human reviewers across all evaluation… 19 arXiv — Machine Learning research 1mo ago Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models arXiv:2605.28866v1 Announce Type: new Abstract: Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. However, prior studies largely overlook the inherent continuity and ordinality of time series… 20 arXiv — Machine Learning research 1mo ago PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation arXiv:2605.28867v1 Announce Type: new Abstract: Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an… 10 arXiv — Machine Learning research 1mo ago LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers arXiv:2605.29005v1 Announce Type: new Abstract: Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational… 18 arXiv — Machine Learning research 1mo ago Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation arXiv:2605.29108v1 Announce Type: new Abstract: Selecting efficient multi-step synthetic routes is a central challenge in organic synthesis, particularly in medicinal and process chemistry, where route choice directly impacts feasibility, cost, and development efficiency.… 28 arXiv — Machine Learning research 1mo ago RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains arXiv:2605.29156v1 Announce Type: new Abstract: Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit… 9 arXiv — Machine Learning research 1mo ago Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts arXiv:2605.29283v1 Announce Type: new Abstract: Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to… 22 arXiv — Machine Learning research 1mo ago Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems arXiv:2605.29373v1 Announce Type: new Abstract: Solving high-dimensional PDE-governed inverse problems is often challenging due to complex non-Gaussian posterior distributions, expensive forward model evaluations, and misspecified prior information. To address these issues, we… 13 arXiv — Machine Learning research 1mo ago Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities arXiv:2605.29500v1 Announce Type: new Abstract: Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard… 11 arXiv — NLP / Computation & Language research 1mo ago Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated… 19 arXiv — NLP / Computation & Language research 1mo ago GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how… 31 arXiv — NLP / Computation & Language research 1mo ago GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human arXiv:2605.28882v1 Announce Type: new Abstract: With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet… 4 arXiv — NLP / Computation & Language research 1mo ago DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents arXiv:2605.29256v1 Announce Type: new Abstract: Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and… 21 arXiv — NLP / Computation & Language research 1mo ago A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods… 19 Page 8 of 10 · 500 articles ← Newer Older →