Tag

Funding

500 articles archived under #funding · RSS

arXiv — Machine Learning research 5d ago

Towards Continuous Power Forecasting: Practical Continual Learning for Real-World Energy Systems in Nonstationary Time Series

arXiv:2606.24955v1 Announce Type: new Abstract: Power forecasting models deployed in real-world energy markets must operate under nonstationary conditions, where data distributions continually evolve due to weather variability, infrastructure upgrades, and changing consumption…

24
arXiv — Machine Learning research 5d ago

Adapt Only When It Pays: Budgeted Decision-Loss Priority for Delayed Online Time-Series Adaptation

arXiv:2606.25068v1 Announce Type: new Abstract: Online time-series forecasters receive labels only after horizon-dependent delays, while every adaptation step spends limited compute. We study when an online learner should update, not how to adapt at every opportunity, and…

18
arXiv — Machine Learning research 5d ago

An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture, leaf area index, and plant height from Sentinel-1 and Sentinel-2 time series

arXiv:2606.25174v1 Announce Type: new Abstract: Field-scale retrieval of surface soil moisture (SM), leaf area index (LAI), and plant height (PH) is essential for precision agriculture, yet it remains an ill-posed inverse problem. Concurrent variations in soil moisture and…

24
arXiv — Machine Learning research 5d ago

UC-Search: Risk-Aware Test-Time Search for Delayed Constrained Time-Series Control

arXiv:2606.25274v1 Announce Type: new Abstract: Time-series models are usually scored as forecasters, yet deployed systems often require delayed decisions under uncertainty and hard feasibility constraints. UC-Search is a model-agnostic test-time wrapper: a backbone emits…

29
arXiv — Machine Learning research 5d ago

TopoCast: A Topological Fidelity Framework for Evaluating Transformer-Based Time Series Forecasting

arXiv:2606.25439v1 Announce Type: new Abstract: Deep learning-based models have achieved state-of-the-art performance in Time Series Forecasting (TSF), yet their evaluation remains dominated by pointwise error metrics such as Mean Squared Error (MSE), which quantify numerical…

37
arXiv — NLP / Computation & Language research 5d ago

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

arXiv:2606.25450v1 Announce Type: cross Abstract: Traditional evaluations measure a learning algorithm's final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a…

12
arXiv — Machine Learning research 5d ago

Leaking Circuit Secrets: Gradient Leakage Attacks on Graph Neural Networks

arXiv:2606.25589v1 Announce Type: new Abstract: As graph neural networks (GNNs) become standard tools for critical tasks in circuit design and analysis, their security and privacy risks require careful attention. Here, we present the first comprehensive evaluation of gradient…

20
arXiv — NLP / Computation & Language research 5d ago

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

arXiv:2606.25057v1 Announce Type: new Abstract: The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent…

11
arXiv — NLP / Computation & Language research 5d ago

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

arXiv:2606.25449v1 Announce Type: new Abstract: A language model's memory can be worse than having no memory at all. Give a model a memory that kept a wrong conclusion but dropped the work behind it, and it emits that stale value as a confident answer; give the same model an…

30
arXiv — NLP / Computation & Language research 5d ago

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

arXiv:2606.25476v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and…

36
arXiv — NLP / Computation & Language research 5d ago

Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization

arXiv:2606.25656v1 Announce Type: new Abstract: As advanced RAG variants like GraphRAG and Agentic RAG emerge, one leading question is when and how to use them. Here, we introduce a framework for different RAG scenarios evaluation and comparison on semi-structured knowledge…

21
arXiv — NLP / Computation & Language research 5d ago

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

arXiv:2606.25782v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM…

18
arXiv — NLP / Computation & Language research 5d ago

Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

arXiv:2606.25935v1 Announce Type: new Abstract: Was this person ever at that place, and if so, when? Answering such questions from noisy, multilingual historical documents is the central challenge of HIPE-2026, the third edition of the HIPE evaluation series. Moving from named…

14
arXiv — NLP / Computation & Language research 5d ago

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations…

29
arXiv — NLP / Computation & Language research 5d ago

RAS: Measuring LLM Safety Through Refusal Alignment

arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is…

27
arXiv — NLP / Computation & Language research 5d ago

Autodata: An agentic data scientist to create high quality synthetic data

arXiv:2606.25996v1 Announce Type: cross Abstract: We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to…

30
arXiv — NLP / Computation & Language research 5d ago

Robustness assessment of large audio language models in multiple-choice evaluation

arXiv:2510.04584v2 Announce Type: replace Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in…

13
arXiv — NLP / Computation & Language research 5d ago

Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

arXiv:2602.02219v2 Announce Type: replace Abstract: Large language models are widely employed as evaluators, a paradigm commonly referred to as LLM-as-a-judge. Prior research has predominantly examined point-wise or pair-wise evaluation protocols; in contrast, our focus is on…

8
Hugging Face Daily Papers research 5d ago

Are We Ready For An Agent-Native Memory System?

Abstract Large language model agents' memory systems have evolved into complex data management frameworks requiring systematic evaluation across multiple modules and workloads to understand their performance characteristics and trade-offs. Generated by…

7
Hugging Face Daily Papers research 5d ago

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

Abstract Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers that demonstrates the need for comprehensive benchmarking beyond ImageNet class-conditional generation to assess true progress in generative modeling. Generated by…

25
arXiv — Machine Learning research 6d ago

Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery

arXiv:2606.23757v1 Announce Type: new Abstract: Extracting interpretable governing equations from sparse, noisy chemical time-series data remains difficult because discrete reaction topology and continuous kinetic parameters are tightly coupled. We present PC-MCMC-CIGP, a…

33
arXiv — Machine Learning research 6d ago

One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline

arXiv:2606.23767v1 Announce Type: new Abstract: Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors' own protocol -- different pair subsets, weightings, model-selection, and decision rates.…

34
arXiv — Machine Learning research 6d ago

Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data

arXiv:2606.23871v1 Announce Type: new Abstract: Survival analysis is central to clinical decision-making, yet reliable time-to-event models require large, diverse cohorts that are rarely available at a single institution, while privacy regulations restrict the centralization of…

28
arXiv — Machine Learning research 6d ago

GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series

arXiv:2606.23880v1 Announce Type: new Abstract: From climate teleconnections to gene regulation, modern time-series datasets encompass tens or hundreds of interacting variables, making causal discovery increasingly challenging. Constraint-based methods offer statistical rigor…

30
arXiv — Machine Learning research 6d ago

You Don't Need to Run Every Eval

arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to…

29
arXiv — Machine Learning research 6d ago

Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation

arXiv:2606.24340v1 Announce Type: new Abstract: In recent years, the Internet of Things (IoT) paradigm has been shifting toward batteryless, energy-harvesting architectures. Sustaining reliable operation in these systems requires intelligent management of highly volatile stored…

30
arXiv — Machine Learning research 6d ago

A Fair Evaluation of Graph Foundation Models for Node Property Prediction

arXiv:2606.24509v1 Announce Type: new Abstract: Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called…

33
arXiv — NLP / Computation & Language research 6d ago

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

arXiv:2606.23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside…

28
arXiv — Machine Learning research 6d ago

Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web

arXiv:2606.24236v1 Announce Type: cross Abstract: Visual assessment of residual plots is a common approach for diagnosing linear models, but it relies on manual evaluation, which does not scale well and can lead to inconsistent decisions across analysts. The lineup protocol,…

16
arXiv — Machine Learning research 6d ago

PROTECT-90: A Fault Dataset for Power System Protection

arXiv:2606.24298v1 Announce Type: cross Abstract: The increasing interest in data-driven methods for power system protection is accompanied by a lack of standardized, publicly available high-voltage waveform datasets that enable transparent and reproducible evaluation. To…

36
arXiv — Machine Learning research 6d ago

EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

arXiv:2606.24586v1 Announce Type: cross Abstract: Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate…

19
arXiv — NLP / Computation & Language research 6d ago

Quantifying Prior Dominance in RAG Systems

arXiv:2606.23695v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ''epistemic blindness'' - failing to distinguish genuine contextual…

28
arXiv — NLP / Computation & Language research 6d ago

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark…

32
arXiv — NLP / Computation & Language research 6d ago

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language,…

38
arXiv — NLP / Computation & Language research 6d ago

Aspect-Based Sentiment Evolution and its Correlation with Review Rounds in Multi-Round Peer Reviews: A Deep Learning Approach

arXiv:2606.24188v1 Announce Type: new Abstract: Mining sentiment information from the textual content of peer review comments offers valuable insights into the scientific evaluation process. However, previous studies are often constrained by coarse-grained analysis and the lack…

19
arXiv — NLP / Computation & Language research 6d ago

SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization

arXiv:2606.24259v1 Announce Type: new Abstract: Fine-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical…

4
arXiv — NLP / Computation & Language research 6d ago

On the Stability of Prompt Ranking in Large Language Model Evaluation

arXiv:2606.24381v1 Announce Type: new Abstract: Prompt-based interaction has become a dominant paradigm for using large language models (LLMs), where multiple candidate prompts are evaluated and the top-ranked one is selected for downstream use. This workflow implicitly assumes…

34
arXiv — NLP / Computation & Language research 6d ago

Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

arXiv:2606.24610v1 Announce Type: new Abstract: The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range…

10
arXiv — NLP / Computation & Language research 6d ago

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline…

25
arXiv — NLP / Computation & Language research 6d ago

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained…

15
arXiv — NLP / Computation & Language research 6d ago

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

arXiv:2504.17768v3 Announce Type: replace Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with…

29
Hugging Face Daily Papers research 6d ago

Libretto: Giving LLM Agents a Sense of Musical Structure

Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from…

18
OpenAI official-blog 6d ago

Helping build shared standards for advanced AI

OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation.

31
Hugging Face Daily Papers research 6d ago

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Abstract A large-scale dataset of human-metaevaluations of LLM critiques for agentic tasks is introduced to improve the calibration and reliability of automated evaluation methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As agentic systems tackle increasingly complex…

22
r/LocalLLaMA community 6d ago

Human Evaluation of GLM-5.2

I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now. The last few days I've been testing it on a real world project, and it's…

6
Hugging Face Daily Papers research 7d ago

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents…

30
r/LocalLLaMA community 7d ago

Boogu Base, Turbo, Edit - open-source unified image generation and editing model series

Boogu-Image-0.1 is a competitive Apache-2.0 open-source unified image generation and editing model family , including Base , Turbo , Edit , and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing,…

22
Hugging Face Daily Papers research 7d ago

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to…

14
r/LocalLLaMA community 7d ago

DeepSeek raises $7.4B USD at $60B valuation. Remarkably, Liang Wenfeng invests $3B in DeepSeek himself.

  submitted by   /u/FullOf_Bad_Ideas [link]   [comments]

35
r/MachineLearning community 9d ago

TSAuditor: A time-series auditing framework [P]

This happened a few months ago when I was working on an analysis project that dealt with time-series data. The dataset was large (10 years of data). I was using a standard profiling tool to check the pipeline. Everything looked fine because the tool reported 3% missing data rate…

29

Towards Continuous Power Forecasting: Practical Continual Learning for Real-World Energy Systems in Nonstationary Time Series

Adapt Only When It Pays: Budgeted Decision-Loss Priority for Delayed Online Time-Series Adaptation

An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture, leaf area index, and plant height from Sentinel-1 and Sentinel-2 time series

UC-Search: Risk-Aware Test-Time Search for Delayed Constrained Time-Series Control

TopoCast: A Topological Fidelity Framework for Evaluating Transformer-Based Time Series Forecasting

The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms

Leaking Circuit Secrets: Gradient Leakage Attacks on Graph Neural Networks

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

RAS: Measuring LLM Safety Through Refusal Alignment

Autodata: An agentic data scientist to create high quality synthetic data

Robustness assessment of large audio language models in multiple-choice evaluation

Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

Are We Ready For An Agent-Native Memory System?

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery

One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline

Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data

GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series

You Don't Need to Run Every Eval

Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation

A Fair Evaluation of Graph Foundation Models for Node Property Prediction

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web

PROTECT-90: A Fault Dataset for Power System Protection

EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics

Quantifying Prior Dominance in RAG Systems

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

Aspect-Based Sentiment Evolution and its Correlation with Review Rounds in Multi-Round Peer Reviews: A Deep Learning Approach

SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization

On the Stability of Prompt Ranking in Large Language Model Evaluation

Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Libretto: Giving LLM Agents a Sense of Musical Structure

Helping build shared standards for advanced AI

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Human Evaluation of GLM-5.2

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Boogu Base, Turbo, Edit - open-source unified image generation and editing model series

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

DeepSeek raises $7.4B USD at $60B valuation. Remarkably, Liang Wenfeng invests $3B in DeepSeek himself.

TSAuditor: A time-series auditing framework [P]