News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow The Information — AI news-outlet 1mo ago Inference Provider Baseten in Talks to Double Valuation to $11 Billion Baseten, a startup that rents out Nvidia AI servers to application developers and helps them customize models, has recently been in talks with investors to raise $1 billion at an $11 billion valuation including the money, The Information reported Tuesday. That would more than… 37 arXiv — Machine Learning research 1mo ago TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models arXiv:2605.26161v1 Announce Type: new Abstract: Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing… 36 arXiv — Machine Learning research 1mo ago Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series arXiv:2605.26191v1 Announce Type: new Abstract: This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or… 22 arXiv — Machine Learning research 1mo ago Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection arXiv:2605.26193v1 Announce Type: new Abstract: Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure… 38 arXiv — Machine Learning research 1mo ago On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series arXiv:2605.26194v1 Announce Type: new Abstract: Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting).… 30 arXiv — Machine Learning research 1mo ago Function-Valued Causal Influence in Nonlinear Time Series arXiv:2605.26408v1 Announce Type: new Abstract: Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the… 34 arXiv — Machine Learning research 1mo ago Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series arXiv:2605.26569v1 Announce Type: new Abstract: We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to… 14 arXiv — Machine Learning research 1mo ago Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets arXiv:2605.26690v1 Announce Type: new Abstract: Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often… 9 arXiv — Machine Learning research 1mo ago SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation arXiv:2605.26704v1 Announce Type: new Abstract: Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models… 22 arXiv — Machine Learning research 1mo ago Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining arXiv:2605.26759v1 Announce Type: new Abstract: Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer… 33 arXiv — Machine Learning research 1mo ago Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability arXiv:2605.26790v1 Announce Type: new Abstract: Low-thrust trajectory design relies heavily on repeated evaluations of fuel consumption and transfer feasibility, which require expensive optimal control solutions. In this work, we show these quantities can be accurately… 23 arXiv — NLP / Computation & Language research 1mo ago The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology arXiv:2605.26346v1 Announce Type: new Abstract: Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice.… 7 arXiv — NLP / Computation & Language research 1mo ago LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay… 34 arXiv — NLP / Computation & Language research 1mo ago Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks arXiv:2605.26440v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with… 8 arXiv — NLP / Computation & Language research 1mo ago AI evaluation may bias perceptions: The importance of context in interpreting academic writing arXiv:2605.26662v1 Announce Type: new Abstract: This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we… 6 arXiv — NLP / Computation & Language research 1mo ago Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics arXiv:2605.26840v1 Announce Type: new Abstract: Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped,… 13 arXiv — NLP / Computation & Language research 1mo ago DunbaaBERT: From Sacrifice to Semantics arXiv:2605.26935v1 Announce Type: new Abstract: Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a… 22 arXiv — NLP / Computation & Language research 1mo ago KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models arXiv:2605.26947v1 Announce Type: new Abstract: Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas… 5 arXiv — NLP / Computation & Language research 1mo ago AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian arXiv:2605.26954v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation… 25 arXiv — NLP / Computation & Language research 1mo ago PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech arXiv:2605.26978v1 Announce Type: new Abstract: Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target… 13 arXiv — NLP / Computation & Language research 1mo ago Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals arXiv:2605.26999v1 Announce Type: new Abstract: Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In… 37 arXiv — NLP / Computation & Language research 1mo ago PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions arXiv:2605.27015v1 Announce Type: new Abstract: Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice… 17 arXiv — NLP / Computation & Language research 1mo ago GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing arXiv:2605.27204v1 Announce Type: new Abstract: Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately… 10 Hugging Face Daily Papers research 1mo ago MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research Abstract MobileGym presents a browser-based mobile environment enabling deterministic evaluation and scalable reinforcement learning through JSON-based state management and parallel execution. AI-generated summary We present MobileGym, a browser-hosted, lightweight, fully… 13 Hugging Face Daily Papers research 1mo ago LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing… 32 Hugging Face Daily Papers research 1mo ago EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation Abstract EvalVerse presents a comprehensive evaluation framework for generative video models that bridges the gap between human aesthetic judgment and machine scoring through expert-calibrated vision-language models and multi-stage cinematic assessment. AI-generated summary The… 33 The Information — AI news-outlet 1mo ago AI Inference Provider Baseten in Talks to Raise $1 Billion at $11 Billion Valuation AI startup Baseten has recently been in talks with investors to raise $1 billion at an $11 billion valuation including the money, according to a person with knowledge of the fundraise. That would more than double the company’s $5 billion valuation from its last round, which was… 21 Hugging Face Daily Papers research 1mo ago Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild Abstract Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering… 14 TechCrunch — AI news-outlet 1mo ago OpenRouter more than doubles valuation to $1.3B in a year OpenRouter has raised a $113 million Series B led by CapitalG. Its 5x growth in usage over six months indicates the multi-AI-model future is here. 15 Hugging Face Daily Papers research 1mo ago Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth Abstract Researchers created a benchmark with 3,066 labeled chains of thought examples across 13 tasks and 10 models to systematically evaluate faithfulness metrics, revealing that most metrics perform near randomly and have significant limitations in reliability and efficiency.… 33 arXiv — Machine Learning research 1mo ago Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions arXiv:2605.24055v1 Announce Type: new Abstract: Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional large-magnitude impulse outliers. For tasks that depend on local shape, such as ECG… 23 arXiv — Machine Learning research 1mo ago PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection arXiv:2605.24171v1 Announce Type: new Abstract: Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates… 10 arXiv — Machine Learning research 1mo ago Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions arXiv:2605.24251v1 Announce Type: new Abstract: Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing methods share three critical gaps: unrealistic evaluation, no systematic comparison,… 34 arXiv — Machine Learning research 1mo ago Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise arXiv:2605.24513v1 Announce Type: new Abstract: This paper considers the nonconvex nonsmooth problem in which the objective function is Lipschitz continuous. We focus on the stochastic setting where the algorithm can access stochastic function value evaluations with heavy-tailed… 22 arXiv — Machine Learning research 1mo ago Deep ZakaiJ: Structured Filtering for Jump-Diffusion Time Series Forecasting arXiv:2605.24548v1 Announce Type: new Abstract: Time series driven by unobserved latent states frequently exhibit abrupt jump discontinuities whose timing and magnitude cannot be predicted from observed history alone. Classical jump-diffusion models offer a principled… 13 arXiv — NLP / Computation & Language research 1mo ago Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges arXiv:2605.23970v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes,… 6 arXiv — NLP / Computation & Language research 1mo ago A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks arXiv:2605.23977v1 Announce Type: new Abstract: This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint… 27 arXiv — NLP / Computation & Language research 1mo ago Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation arXiv:2605.24247v1 Announce Type: new Abstract: Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the… 13 arXiv — NLP / Computation & Language research 1mo ago WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems arXiv:2605.24579v1 Announce Type: new Abstract: Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic… 27 arXiv — NLP / Computation & Language research 1mo ago Repeated Sequences Reveal Gaps between Large Language Models and Natural Language arXiv:2605.24850v1 Announce Type: new Abstract: Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior,… 4 arXiv — NLP / Computation & Language research 1mo ago When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation arXiv:2605.24902v1 Announce Type: new Abstract: Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from… 13 arXiv — NLP / Computation & Language research 1mo ago Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation arXiv:2605.24904v1 Announce Type: new Abstract: Machine-translated benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs), yet translation errors in these benchmarks remain underexplored, raising concerns about the reliability and… 9 arXiv — NLP / Computation & Language research 1mo ago Large Language Model Selection with Limited Annotations arXiv:2605.24981v1 Announce Type: new Abstract: Choosing a Large Language Model (LLM) for a given task requires comparing many strong candidates, yet standard evaluation relies on costly annotations over fixed evaluation sets. To address this challenge, we develop SELECT-LLM,… 34 arXiv — NLP / Computation & Language research 1mo ago Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth arXiv:2605.25052v1 Announce Type: new Abstract: Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's… 27 arXiv — NLP / Computation & Language research 1mo ago JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment arXiv:2605.25240v1 Announce Type: new Abstract: Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies… 7 arXiv — NLP / Computation & Language research 1mo ago SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models arXiv:2605.25420v1 Announce Type: new Abstract: Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0,… 13 Hugging Face Daily Papers research 1mo ago WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation Abstract WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types. AI-generated summary Interactive world models are advancing… 20 Hugging Face Daily Papers research 1mo ago The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm Abstract Vision-Language Models often fail to faithfully synthesize multimodal data due to reliance on language priors over visual representation, necessitating new evaluation frameworks that prioritize semantic sufficiency over traditional multimodal gain metrics. AI-generated… 38 arXiv — Machine Learning research 1mo ago Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems arXiv:2605.22891v1 Announce Type: new Abstract: Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails… 13 arXiv — Machine Learning research 1mo ago Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection arXiv:2605.22973v1 Announce Type: new Abstract: Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsupervised evaluation metrics computed on selected datasets, along with comparisons to existing… 20 Page 10 of 10 · 500 articles ← Newer