Tag

Funding

500 articles archived under #funding · RSS

The Information — AI news-outlet 1mo ago

Inference Provider Baseten in Talks to Double Valuation to $11 Billion

Baseten, a startup that rents out Nvidia AI servers to application developers and helps them customize models, has recently been in talks with investors to raise $1 billion at an $11 billion valuation including the money, The Information reported Tuesday. That would more than…

37
arXiv — Machine Learning research 1mo ago

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

arXiv:2605.26161v1 Announce Type: new Abstract: Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing…

36
arXiv — Machine Learning research 1mo ago

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

arXiv:2605.26191v1 Announce Type: new Abstract: This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or…

22
arXiv — Machine Learning research 1mo ago

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

arXiv:2605.26193v1 Announce Type: new Abstract: Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure…

38
arXiv — Machine Learning research 1mo ago

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

arXiv:2605.26194v1 Announce Type: new Abstract: Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting).…

30
arXiv — Machine Learning research 1mo ago

Function-Valued Causal Influence in Nonlinear Time Series

arXiv:2605.26408v1 Announce Type: new Abstract: Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the…

34
arXiv — Machine Learning research 1mo ago

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

arXiv:2605.26569v1 Announce Type: new Abstract: We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to…

14
arXiv — Machine Learning research 1mo ago

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

arXiv:2605.26690v1 Announce Type: new Abstract: Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often…

9
arXiv — Machine Learning research 1mo ago

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

arXiv:2605.26704v1 Announce Type: new Abstract: Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models…

22
arXiv — Machine Learning research 1mo ago

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

arXiv:2605.26759v1 Announce Type: new Abstract: Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer…

33
arXiv — Machine Learning research 1mo ago

Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability

arXiv:2605.26790v1 Announce Type: new Abstract: Low-thrust trajectory design relies heavily on repeated evaluations of fuel consumption and transfer feasibility, which require expensive optimal control solutions. In this work, we show these quantities can be accurately…

23
arXiv — NLP / Computation & Language research 1mo ago

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

arXiv:2605.26346v1 Announce Type: new Abstract: Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice.…

7
arXiv — NLP / Computation & Language research 1mo ago

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

arXiv:2605.26438v1 Announce Type: new Abstract: Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay…

34
arXiv — NLP / Computation & Language research 1mo ago

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

arXiv:2605.26440v1 Announce Type: new Abstract: The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with…

8
arXiv — NLP / Computation & Language research 1mo ago

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

arXiv:2605.26662v1 Announce Type: new Abstract: This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we…

6
arXiv — NLP / Computation & Language research 1mo ago

Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

arXiv:2605.26840v1 Announce Type: new Abstract: Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped,…

13
arXiv — NLP / Computation & Language research 1mo ago

DunbaaBERT: From Sacrifice to Semantics

arXiv:2605.26935v1 Announce Type: new Abstract: Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a…

22
arXiv — NLP / Computation & Language research 1mo ago

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

arXiv:2605.26947v1 Announce Type: new Abstract: Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas…

5
arXiv — NLP / Computation & Language research 1mo ago

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

arXiv:2605.26954v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation…

25
arXiv — NLP / Computation & Language research 1mo ago

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

arXiv:2605.26978v1 Announce Type: new Abstract: Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target…

13
arXiv — NLP / Computation & Language research 1mo ago

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

arXiv:2605.26999v1 Announce Type: new Abstract: Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In…

37
arXiv — NLP / Computation & Language research 1mo ago

PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions

arXiv:2605.27015v1 Announce Type: new Abstract: Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice…

17
arXiv — NLP / Computation & Language research 1mo ago

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

arXiv:2605.27204v1 Announce Type: new Abstract: Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately…

10
Hugging Face Daily Papers research 1mo ago

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Abstract MobileGym presents a browser-based mobile environment enabling deterministic evaluation and scalable reinforcement learning through JSON-based state management and parallel execution. AI-generated summary We present MobileGym, a browser-hosted, lightweight, fully…

13
Hugging Face Daily Papers research 1mo ago

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Abstract LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences. AI-generated summary Audio-visual generation is rapidly advancing…

32
Hugging Face Daily Papers research 1mo ago

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

Abstract EvalVerse presents a comprehensive evaluation framework for generative video models that bridges the gap between human aesthetic judgment and machine scoring through expert-calibrated vision-language models and multi-stage cinematic assessment. AI-generated summary The…

33
The Information — AI news-outlet 1mo ago

AI Inference Provider Baseten in Talks to Raise $1 Billion at $11 Billion Valuation

AI startup Baseten has recently been in talks with investors to raise $1 billion at an $11 billion valuation including the money, according to a person with knowledge of the fundraise. That would more than double the company’s $5 billion valuation from its last round, which was…

21
Hugging Face Daily Papers research 1mo ago

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Abstract Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering…

14
TechCrunch — AI news-outlet 1mo ago

OpenRouter more than doubles valuation to $1.3B in a year

OpenRouter has raised a $113 million Series B led by CapitalG. Its 5x growth in usage over six months indicates the multi-AI-model future is here.

15
Hugging Face Daily Papers research 1mo ago

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Abstract Researchers created a benchmark with 3,066 labeled chains of thought examples across 13 tasks and 10 models to systematically evaluate faithfulness metrics, revealing that most metrics perform near randomly and have significant limitations in reliability and efficiency.…

33
arXiv — Machine Learning research 1mo ago

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

arXiv:2605.24055v1 Announce Type: new Abstract: Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional large-magnitude impulse outliers. For tasks that depend on local shape, such as ECG…

23
arXiv — Machine Learning research 1mo ago

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

arXiv:2605.24171v1 Announce Type: new Abstract: Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates…

10
arXiv — Machine Learning research 1mo ago

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

arXiv:2605.24251v1 Announce Type: new Abstract: Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing methods share three critical gaps: unrealistic evaluation, no systematic comparison,…

34
arXiv — Machine Learning research 1mo ago

Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise

arXiv:2605.24513v1 Announce Type: new Abstract: This paper considers the nonconvex nonsmooth problem in which the objective function is Lipschitz continuous. We focus on the stochastic setting where the algorithm can access stochastic function value evaluations with heavy-tailed…

22
arXiv — Machine Learning research 1mo ago

Deep ZakaiJ: Structured Filtering for Jump-Diffusion Time Series Forecasting

arXiv:2605.24548v1 Announce Type: new Abstract: Time series driven by unobserved latent states frequently exhibit abrupt jump discontinuities whose timing and magnitude cannot be predicted from observed history alone. Classical jump-diffusion models offer a principled…

13
arXiv — NLP / Computation & Language research 1mo ago

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

arXiv:2605.23970v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes,…

6
arXiv — NLP / Computation & Language research 1mo ago

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

arXiv:2605.23977v1 Announce Type: new Abstract: This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint…

27
arXiv — NLP / Computation & Language research 1mo ago

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

arXiv:2605.24247v1 Announce Type: new Abstract: Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent use case. Simple category definitions are not detailed enough for labelers to produce the…

13
arXiv — NLP / Computation & Language research 1mo ago

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

arXiv:2605.24579v1 Announce Type: new Abstract: Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic…

27
arXiv — NLP / Computation & Language research 1mo ago

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

arXiv:2605.24850v1 Announce Type: new Abstract: Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior,…

4
arXiv — NLP / Computation & Language research 1mo ago

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

arXiv:2605.24902v1 Announce Type: new Abstract: Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from…

13
arXiv — NLP / Computation & Language research 1mo ago

Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation

arXiv:2605.24904v1 Announce Type: new Abstract: Machine-translated benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs), yet translation errors in these benchmarks remain underexplored, raising concerns about the reliability and…

9
arXiv — NLP / Computation & Language research 1mo ago

Large Language Model Selection with Limited Annotations

arXiv:2605.24981v1 Announce Type: new Abstract: Choosing a Large Language Model (LLM) for a given task requires comparing many strong candidates, yet standard evaluation relies on costly annotations over fixed evaluation sets. To address this challenge, we develop SELECT-LLM,…

34
arXiv — NLP / Computation & Language research 1mo ago

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

arXiv:2605.25052v1 Announce Type: new Abstract: Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's…

27
arXiv — NLP / Computation & Language research 1mo ago

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

arXiv:2605.25240v1 Announce Type: new Abstract: Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies…

7
arXiv — NLP / Computation & Language research 1mo ago

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

arXiv:2605.25420v1 Announce Type: new Abstract: Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0,…

13
Hugging Face Daily Papers research 1mo ago

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Abstract WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types. AI-generated summary Interactive world models are advancing…

20
Hugging Face Daily Papers research 1mo ago

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Abstract Vision-Language Models often fail to faithfully synthesize multimodal data due to reliance on language priors over visual representation, necessitating new evaluation frameworks that prioritize semantic sufficiency over traditional multimodal gain metrics. AI-generated…

38
arXiv — Machine Learning research 1mo ago

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

arXiv:2605.22891v1 Announce Type: new Abstract: Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails…

13
arXiv — Machine Learning research 1mo ago

Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection

arXiv:2605.22973v1 Announce Type: new Abstract: Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsupervised evaluation metrics computed on selected datasets, along with comparisons to existing…

20

Inference Provider Baseten in Talks to Double Valuation to $11 Billion

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

Function-Valued Causal Influence in Nonlinear Time Series

Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

DunbaaBERT: From Sacrifice to Semantics

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

AI Inference Provider Baseten in Talks to Raise $1 Billion at $11 Billion Valuation

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

OpenRouter more than doubles valuation to $1.3B in a year

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

Zeroth-Order Nonconvex Nonsmooth Optimization with Heavy-Tailed Noise

Deep ZakaiJ: Structured Filtering for Jump-Diffusion Time Series Forecasting

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation

Large Language Model Selection with Limited Annotations

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection