Tag

Funding

500 articles archived under #funding · RSS

arXiv — Machine Learning research 21d ago

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

arXiv:2606.07632v1 Announce Type: new Abstract: Proper accounting of the energy requirements and environmental impact of artificial intelligence (AI) systems is necessary for researchers, developers, policy makers, and users to assess the barriers to building systems at scale.…

36
arXiv — Machine Learning research 21d ago

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

arXiv:2606.07698v1 Announce Type: new Abstract: Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by…

23
arXiv — Machine Learning research 21d ago

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no…

13
arXiv — Machine Learning research 21d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under…

33
Hugging Face Daily Papers research 21d ago

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key…

20
TechCrunch — AI news-outlet 21d ago

Mercor’s Brendan Foody calls out Sequoia over ‘dual-pricing’ valuation tricks

Sequoia is just one of the top firms that sells same equity at two different prices.

28
The Information — AI news-outlet 21d ago

Databricks in Talks to Raise at Above $165 Billion Valuation

Databricks, a provider of database management software, has discussed raising more money in a funding round that could kick off within the next month, according to multiple people with direct knowledge of the conversations. Databricks has indicated to investors the new round…

13
Hugging Face Daily Papers research 21d ago

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Abstract A novel attack-agnostic robustness metric based on Fisher Information Matrix spectral norm is proposed, providing theoretical bounds and scalable evaluation methods for deep neural network robustness assessment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The…

12
Hugging Face Daily Papers research 21d ago

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by…

35
Hugging Face Daily Papers research 21d ago

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

Abstract Small adaptation interfaces extend a frozen Music Transformer model to multiple genres, showing consistent improvement in harmonic prediction but limited genre identity representation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Harmony is a compact symbolic layer…

6
r/MachineLearning community 22d ago

Open image generation models are closer to closed-source quality than this sub thinks [D]

I run evaluations on generative image models as part of my workflow, mostly comparing coherence, prompt adherence, and compositional accuracy across different architectures. The consensus here seems to be that open models are still a generation behind closed APIs. Based on my…

25
Hugging Face Daily Papers research 22d ago

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated…

30
Hugging Face Daily Papers research 22d ago

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Abstract Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search systems iteratively interact…

34
arXiv — Machine Learning research 22d ago

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly…

27
arXiv — Machine Learning research 22d ago

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,…

37
arXiv — NLP / Computation & Language research 22d ago

RECAP: Regression Evaluation for Continual Adaptation of Prompts

arXiv:2606.06698v1 Announce Type: cross Abstract: Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure…

38
arXiv — Machine Learning research 22d ago

Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices

arXiv:2606.07068v1 Announce Type: new Abstract: Background: Since 1990 many feature selection methods have been proposed across heterogeneous applications. To validate the usefulness of a new method, it needs to be compared against at least one baseline method from the existing…

32
arXiv — Machine Learning research 22d ago

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or…

12
arXiv — Machine Learning research 22d ago

Decision-Aware Evaluation of Physics-Informed Surrogates

arXiv:2606.07146v1 Announce Type: new Abstract: Physics-informed machine learning is often assessed by curve error, although engineering use depends on downstream decisions: ranking candidates, avoiding infeasible designs and limiting regret. We introduce pinn-gym, an open…

22
arXiv — NLP / Computation & Language research 22d ago

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

arXiv:2606.07379v1 Announce Type: cross Abstract: A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores…

4
arXiv — NLP / Computation & Language research 22d ago

Re-Centering Humans in LLM Personalization

arXiv:2606.06614v1 Announce Type: new Abstract: Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper,…

9
arXiv — NLP / Computation & Language research 22d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in…

33
arXiv — NLP / Computation & Language research 22d ago

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single,…

19
arXiv — NLP / Computation & Language research 22d ago

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

arXiv:2606.06959v1 Announce Type: new Abstract: Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of…

5
arXiv — NLP / Computation & Language research 22d ago

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

arXiv:2606.07020v1 Announce Type: new Abstract: Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis.…

19
arXiv — NLP / Computation & Language research 22d ago

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

arXiv:2606.07040v1 Announce Type: new Abstract: Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query,…

20
arXiv — NLP / Computation & Language research 22d ago

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We…

37
arXiv — NLP / Computation & Language research 22d ago

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

arXiv:2606.07190v1 Announce Type: new Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect…

21
arXiv — NLP / Computation & Language research 22d ago

Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation

arXiv:2606.07057v1 Announce Type: cross Abstract: Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which…

34
arXiv — NLP / Computation & Language research 22d ago

MMAE: A Massive Multitask Audio Editing Benchmark

arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,…

8
arXiv — NLP / Computation & Language research 22d ago

Reference-Free Evaluation of Taxonomies

arXiv:2505.11470v3 Announce Type: replace Abstract: We introduce two reference-free metrics for quality evaluation of taxonomies in the absence of labels. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, addressing…

31
arXiv — NLP / Computation & Language research 22d ago

SWE-IF: Aligning Code Evaluation with Human Preference

arXiv:2510.07315v2 Announce Type: replace Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human…

14
Hugging Face Daily Papers research 24d ago

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

38
Hugging Face Daily Papers research 24d ago

Benchmark Everything Everywhere All at Once

Abstract Automated benchmark creation system generates diverse evaluation datasets with minimal human intervention, enabling continuous model assessment across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Benchmarks are fundamental for evaluating and advancing…

27
Hugging Face Daily Papers research 25d ago

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

Abstract PropMe framework evaluates language model memorization by distinguishing between forced reproduction capabilities and natural propensity, using SimpleTrace for deterministic attribution and propensity-transformed metrics across open models and datasets. Generated by…

15
arXiv — Machine Learning research 25d ago

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

arXiv:2606.05169v1 Announce Type: new Abstract: We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by…

30
arXiv — Machine Learning research 25d ago

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

arXiv:2606.05308v1 Announce Type: new Abstract: With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the…

25
arXiv — Machine Learning research 25d ago

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

arXiv:2606.05403v1 Announce Type: new Abstract: Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation,…

4
arXiv — Machine Learning research 25d ago

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

arXiv:2606.05558v1 Announce Type: new Abstract: Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation…

4
arXiv — Machine Learning research 25d ago

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

arXiv:2606.05692v1 Announce Type: new Abstract: Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on…

35
arXiv — Machine Learning research 25d ago

Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data

arXiv:2606.05781v1 Announce Type: new Abstract: Deploying frontier large language models (LLMs) for domain-specific structured evaluation tasks often incurs substantial latency, cost, and data privacy overhead. We present a hybrid framework that combines a fine-tuned small…

34
arXiv — Machine Learning research 25d ago

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

arXiv:2606.05860v1 Announce Type: new Abstract: Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically…

21
arXiv — NLP / Computation & Language research 25d ago

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

arXiv:2606.05176v1 Announce Type: new Abstract: While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In…

20
arXiv — NLP / Computation & Language research 25d ago

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

arXiv:2606.05402v1 Announce Type: new Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a…

30
arXiv — NLP / Computation & Language research 25d ago

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not…

32
arXiv — NLP / Computation & Language research 25d ago

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

arXiv:2606.05874v1 Announce Type: new Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios…

23
arXiv — NLP / Computation & Language research 25d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a…

9
arXiv — NLP / Computation & Language research 25d ago

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an…

13
arXiv — NLP / Computation & Language research 25d ago

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

arXiv:2606.06267v1 Announce Type: new Abstract: Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by…

23
arXiv — NLP / Computation & Language research 25d ago

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

arXiv:2606.06286v1 Announce Type: new Abstract: Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a…

26

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Mercor’s Brendan Foody calls out Sequoia over &#8216;dual-pricing&#8217; valuation tricks

Databricks in Talks to Raise at Above $165 Billion Valuation

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

Open image generation models are closer to closed-source quality than this sub thinks [D]

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

RECAP: Regression Evaluation for Continual Adaptation of Prompts

Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

Decision-Aware Evaluation of Physics-Informed Surrogates

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Re-Centering Humans in LLM Personalization

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation

MMAE: A Massive Multitask Audio Editing Benchmark

Reference-Free Evaluation of Taxonomies

SWE-IF: Aligning Code Evaluation with Human Preference

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Benchmark Everything Everywhere All at Once

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

Mercor’s Brendan Foody calls out Sequoia over ‘dual-pricing’ valuation tricks