News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — Machine Learning research 5d ago Towards Continuous Power Forecasting: Practical Continual Learning for Real-World Energy Systems in Nonstationary Time Series arXiv:2606.24955v1 Announce Type: new Abstract: Power forecasting models deployed in real-world energy markets must operate under nonstationary conditions, where data distributions continually evolve due to weather variability, infrastructure upgrades, and changing consumption… 24 arXiv — Machine Learning research 5d ago Adapt Only When It Pays: Budgeted Decision-Loss Priority for Delayed Online Time-Series Adaptation arXiv:2606.25068v1 Announce Type: new Abstract: Online time-series forecasters receive labels only after horizon-dependent delays, while every adaptation step spends limited compute. We study when an online learner should update, not how to adapt at every opportunity, and… 18 arXiv — Machine Learning research 5d ago An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture, leaf area index, and plant height from Sentinel-1 and Sentinel-2 time series arXiv:2606.25174v1 Announce Type: new Abstract: Field-scale retrieval of surface soil moisture (SM), leaf area index (LAI), and plant height (PH) is essential for precision agriculture, yet it remains an ill-posed inverse problem. Concurrent variations in soil moisture and… 24 arXiv — Machine Learning research 5d ago UC-Search: Risk-Aware Test-Time Search for Delayed Constrained Time-Series Control arXiv:2606.25274v1 Announce Type: new Abstract: Time-series models are usually scored as forecasters, yet deployed systems often require delayed decisions under uncertainty and hard feasibility constraints. UC-Search is a model-agnostic test-time wrapper: a backbone emits… 29 arXiv — Machine Learning research 5d ago TopoCast: A Topological Fidelity Framework for Evaluating Transformer-Based Time Series Forecasting arXiv:2606.25439v1 Announce Type: new Abstract: Deep learning-based models have achieved state-of-the-art performance in Time Series Forecasting (TSF), yet their evaluation remains dominated by pointwise error metrics such as Mean Squared Error (MSE), which quantify numerical… 37 arXiv — NLP / Computation & Language research 5d ago The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms arXiv:2606.25450v1 Announce Type: cross Abstract: Traditional evaluations measure a learning algorithm's final performance on an i.i.d. test set, reducing learning to a single aggregate score. This approach obscures a fundamental question: to what extent does learning from a… 12 arXiv — Machine Learning research 5d ago Leaking Circuit Secrets: Gradient Leakage Attacks on Graph Neural Networks arXiv:2606.25589v1 Announce Type: new Abstract: As graph neural networks (GNNs) become standard tools for critical tasks in circuit design and analysis, their security and privacy risks require careful attention. Here, we present the first comprehensive evaluation of gradient… 20 arXiv — NLP / Computation & Language research 5d ago LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges arXiv:2606.25057v1 Announce Type: new Abstract: The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent… 11 arXiv — NLP / Computation & Language research 5d ago Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One arXiv:2606.25449v1 Announce Type: new Abstract: A language model's memory can be worse than having no memory at all. Give a model a memory that kept a wrong conclusion but dropped the work behind it, and it emits that stale value as a confident answer; give the same model an… 30 arXiv — NLP / Computation & Language research 5d ago A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation arXiv:2606.25476v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and… 36 arXiv — NLP / Computation & Language research 5d ago Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization arXiv:2606.25656v1 Announce Type: new Abstract: As advanced RAG variants like GraphRAG and Agentic RAG emerge, one leading question is when and how to use them. Here, we introduce a framework for different RAG scenarios evaluation and comparison on semi-structured knowledge… 21 arXiv — NLP / Computation & Language research 5d ago Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation arXiv:2606.25782v1 Announce Type: new Abstract: With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM… 18 arXiv — NLP / Computation & Language research 5d ago Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts arXiv:2606.25935v1 Announce Type: new Abstract: Was this person ever at that place, and if so, when? Answering such questions from noisy, multilingual historical documents is the central challenge of HIPE-2026, the third edition of the HIPE evaluation series. Moving from named… 14 arXiv — NLP / Computation & Language research 5d ago SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations… 29 arXiv — NLP / Computation & Language research 5d ago RAS: Measuring LLM Safety Through Refusal Alignment arXiv:2606.25750v1 Announce Type: cross Abstract: Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is… 27 arXiv — NLP / Computation & Language research 5d ago Autodata: An agentic data scientist to create high quality synthetic data arXiv:2606.25996v1 Announce Type: cross Abstract: We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to… 30 arXiv — NLP / Computation & Language research 5d ago Robustness assessment of large audio language models in multiple-choice evaluation arXiv:2510.04584v2 Announce Type: replace Abstract: Recent advances in large audio language models (LALMs) have primarily been assessed using a multiple-choice question answering (MCQA) framework. However, subtle changes, such as shifting the order of choices, result in… 13 arXiv — NLP / Computation & Language research 5d ago Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge arXiv:2602.02219v2 Announce Type: replace Abstract: Large language models are widely employed as evaluators, a paradigm commonly referred to as LLM-as-a-judge. Prior research has predominantly examined point-wise or pair-wise evaluation protocols; in contrast, our focus is on… 8 Hugging Face Daily Papers research 5d ago Are We Ready For An Agent-Native Memory System? Abstract Large language model agents' memory systems have evolved into complex data management frameworks requiring systematic evaluation across multiple modules and workloads to understand their performance characteristics and trade-offs. Generated by… 7 Hugging Face Daily Papers research 5d ago DiffusionBench: On Holistic Evaluation of Diffusion Transformers Abstract Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers that demonstrates the need for comprehensive benchmarking beyond ImageNet class-conditional generation to assess true progress in generative modeling. Generated by… 25 arXiv — Machine Learning research 6d ago Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery arXiv:2606.23757v1 Announce Type: new Abstract: Extracting interpretable governing equations from sparse, noisy chemical time-series data remains difficult because discrete reaction topology and continuous kinetic parameters are tightly coupled. We present PC-MCMC-CIGP, a… 33 arXiv — Machine Learning research 6d ago One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline arXiv:2606.23767v1 Announce Type: new Abstract: Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors' own protocol -- different pair subsets, weightings, model-selection, and decision rates.… 34 arXiv — Machine Learning research 6d ago Federated Survival Analysis in Healthcare: A Multi-Model Evaluation on Cross-Institutional Heterogeneous Breast Cancer Data arXiv:2606.23871v1 Announce Type: new Abstract: Survival analysis is central to clinical decision-making, yet reliable time-to-event models require large, diverse cohorts that are rarely available at a single institution, while privacy regulations restrict the centralization of… 28 arXiv — Machine Learning research 6d ago GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series arXiv:2606.23880v1 Announce Type: new Abstract: From climate teleconnections to gene regulation, modern time-series datasets encompass tens or hundreds of interacting variables, making causal discovery increasingly challenging. Constraint-based methods offer statistical rigor… 30 arXiv — Machine Learning research 6d ago You Don't Need to Run Every Eval arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to… 29 arXiv — Machine Learning research 6d ago Managing Task Execution for Unknown Workloads in Batteryless IoT: A Hardware-Agnostic Evaluation arXiv:2606.24340v1 Announce Type: new Abstract: In recent years, the Internet of Things (IoT) paradigm has been shifting toward batteryless, energy-harvesting architectures. Sustaining reliable operation in these systems requires intelligent management of highly volatile stored… 30 arXiv — Machine Learning research 6d ago A Fair Evaluation of Graph Foundation Models for Node Property Prediction arXiv:2606.24509v1 Announce Type: new Abstract: Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attracted a lot of attention. While many different types of models are called… 33 arXiv — NLP / Computation & Language research 6d ago Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs arXiv:2606.23915v1 Announce Type: new Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside… 28 arXiv — Machine Learning research 6d ago Automated Residual Plot Assessment With the R Package autovi and the Shiny Application autovi.web arXiv:2606.24236v1 Announce Type: cross Abstract: Visual assessment of residual plots is a common approach for diagnosing linear models, but it relies on manual evaluation, which does not scale well and can lead to inconsistent decisions across analysts. The lineup protocol,… 16 arXiv — Machine Learning research 6d ago PROTECT-90: A Fault Dataset for Power System Protection arXiv:2606.24298v1 Announce Type: cross Abstract: The increasing interest in data-driven methods for power system protection is accompanied by a lack of standardized, publicly available high-voltage waveform datasets that enable transparent and reproducible evaluation. To… 36 arXiv — Machine Learning research 6d ago EERLoss: A Novel Loss Function for Training Deep Biometric Models. A Case Study in Keystroke Dynamics arXiv:2606.24586v1 Announce Type: cross Abstract: Deep learning approaches to biometric verification are commonly trained by optimizing indirect objectives, creating a misalignment between the optimization process and the primary evaluation metric, typically the Equal Error Rate… 19 arXiv — NLP / Computation & Language research 6d ago Quantifying Prior Dominance in RAG Systems arXiv:2606.23695v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ''epistemic blindness'' - failing to distinguish genuine contextual… 28 arXiv — NLP / Computation & Language research 6d ago QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark… 32 arXiv — NLP / Computation & Language research 6d ago MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language,… 38 arXiv — NLP / Computation & Language research 6d ago Aspect-Based Sentiment Evolution and its Correlation with Review Rounds in Multi-Round Peer Reviews: A Deep Learning Approach arXiv:2606.24188v1 Announce Type: new Abstract: Mining sentiment information from the textual content of peer review comments offers valuable insights into the scientific evaluation process. However, previous studies are often constrained by coarse-grained analysis and the lack… 19 arXiv — NLP / Computation & Language research 6d ago SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization arXiv:2606.24259v1 Announce Type: new Abstract: Fine-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical… 4 arXiv — NLP / Computation & Language research 6d ago On the Stability of Prompt Ranking in Large Language Model Evaluation arXiv:2606.24381v1 Announce Type: new Abstract: Prompt-based interaction has become a dominant paradigm for using large language models (LLMs), where multiple candidate prompts are evaluated and the top-ranked one is selected for downstream use. This workflow implicitly assumes… 34 arXiv — NLP / Computation & Language research 6d ago Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models arXiv:2606.24610v1 Announce Type: new Abstract: The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range… 10 arXiv — NLP / Computation & Language research 6d ago AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability arXiv:2606.24589v1 Announce Type: cross Abstract: Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline… 25 arXiv — NLP / Computation & Language research 6d ago ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained… 15 arXiv — NLP / Computation & Language research 6d ago The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs arXiv:2504.17768v3 Announce Type: replace Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with… 29 Hugging Face Daily Papers research 6d ago Libretto: Giving LLM Agents a Sense of Musical Structure Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from… 18 OpenAI official-blog 6d ago Helping build shared standards for advanced AI OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation. 31 Hugging Face Daily Papers research 6d ago Counsel: A Meta-Evaluation Dataset for Agentic Tasks Abstract A large-scale dataset of human-metaevaluations of LLM critiques for agentic tasks is introduced to improve the calibration and reliability of automated evaluation methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As agentic systems tackle increasingly complex… 22 r/LocalLLaMA community 6d ago Human Evaluation of GLM-5.2 I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now. The last few days I've been testing it on a real world project, and it's… 6 Hugging Face Daily Papers research 7d ago EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents… 30 r/LocalLLaMA community 7d ago Boogu Base, Turbo, Edit - open-source unified image generation and editing model series Boogu-Image-0.1 is a competitive Apache-2.0 open-source unified image generation and editing model family , including Base , Turbo , Edit , and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing,… 22 Hugging Face Daily Papers research 7d ago DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to… 14 r/LocalLLaMA community 7d ago DeepSeek raises $7.4B USD at $60B valuation. Remarkably, Liang Wenfeng invests $3B in DeepSeek himself.   submitted by   /u/FullOf_Bad_Ideas [link]   [comments] 35 r/MachineLearning community 9d ago TSAuditor: A time-series auditing framework [P] This happened a few months ago when I was working on an analysis project that dealt with time-series data. The dataset was large (10 years of data). I was using a standard profiling tool to check the pipeline. Everything looked fine because the tool reported 3% missing data rate… 29 Page 2 of 10 · 500 articles ← Newer Older →