News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — Machine Learning research 21d ago Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment arXiv:2606.07632v1 Announce Type: new Abstract: Proper accounting of the energy requirements and environmental impact of artificial intelligence (AI) systems is necessary for researchers, developers, policy makers, and users to assess the barriers to building systems at scale.… 36 arXiv — Machine Learning research 21d ago Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction arXiv:2606.07698v1 Announce Type: new Abstract: Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by… 23 arXiv — Machine Learning research 21d ago Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no… 13 arXiv — Machine Learning research 21d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under… 33 Hugging Face Daily Papers research 21d ago Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key… 20 TechCrunch — AI news-outlet 21d ago Mercor’s Brendan Foody calls out Sequoia over ‘dual-pricing’ valuation tricks Sequoia is just one of the top firms that sells same equity at two different prices. 28 The Information — AI news-outlet 21d ago Databricks in Talks to Raise at Above $165 Billion Valuation Databricks, a provider of database management software, has discussed raising more money in a funding round that could kick off within the next month, according to multiple people with direct knowledge of the conversations. Databricks has indicated to investors the new round… 13 Hugging Face Daily Papers research 21d ago Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms Abstract A novel attack-agnostic robustness metric based on Fisher Information Matrix spectral norm is proposed, providing theoretical bounds and scalable evaluation methods for deep neural network robustness assessment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The… 12 Hugging Face Daily Papers research 21d ago Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation Abstract Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system. Generated by… 35 Hugging Face Daily Papers research 21d ago How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling Abstract Small adaptation interfaces extend a frozen Music Transformer model to multiple genres, showing consistent improvement in harmonic prediction but limited genre identity representation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Harmony is a compact symbolic layer… 6 r/MachineLearning community 22d ago Open image generation models are closer to closed-source quality than this sub thinks [D] I run evaluations on generative image models as part of my workflow, mostly comparing coherence, prompt adherence, and compositional accuracy across different architectures. The consensus here seems to be that open models are still a generation behind closed APIs. Based on my… 25 Hugging Face Daily Papers research 22d ago SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated… 30 Hugging Face Daily Papers research 22d ago Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback Abstract Critic-R framework enhances agentic search by closing the feedback loop between reasoning agents and retrieval models through critic evaluation and dual optimization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search systems iteratively interact… 34 arXiv — Machine Learning research 22d ago Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly… 27 arXiv — Machine Learning research 22d ago MacArena: Benchmarking Computer Use Agents on an Online macOS Environment arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,… 37 arXiv — NLP / Computation & Language research 22d ago RECAP: Regression Evaluation for Continual Adaptation of Prompts arXiv:2606.06698v1 Announce Type: cross Abstract: Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure… 38 arXiv — Machine Learning research 22d ago Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices arXiv:2606.07068v1 Announce Type: new Abstract: Background: Since 1990 many feature selection methods have been proposed across heterogeneous applications. To validate the usefulness of a new method, it needs to be compared against at least one baseline method from the existing… 32 arXiv — Machine Learning research 22d ago REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or… 12 arXiv — Machine Learning research 22d ago Decision-Aware Evaluation of Physics-Informed Surrogates arXiv:2606.07146v1 Announce Type: new Abstract: Physics-informed machine learning is often assessed by curve error, although engineering use depends on downstream decisions: ranking candidates, avoiding infeasible designs and limiting regret. We introduce pinn-gym, an open… 22 arXiv — NLP / Computation & Language research 22d ago Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests arXiv:2606.07379v1 Announce Type: cross Abstract: A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores… 4 arXiv — NLP / Computation & Language research 22d ago Re-Centering Humans in LLM Personalization arXiv:2606.06614v1 Announce Type: new Abstract: Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper,… 9 arXiv — NLP / Computation & Language research 22d ago UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in… 33 arXiv — NLP / Computation & Language research 22d ago Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single,… 19 arXiv — NLP / Computation & Language research 22d ago OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios arXiv:2606.06959v1 Announce Type: new Abstract: Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of… 5 arXiv — NLP / Computation & Language research 22d ago MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights arXiv:2606.07020v1 Announce Type: new Abstract: Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis.… 19 arXiv — NLP / Computation & Language research 22d ago Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling arXiv:2606.07040v1 Announce Type: new Abstract: Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query,… 20 arXiv — NLP / Computation & Language research 22d ago UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We… 37 arXiv — NLP / Computation & Language research 22d ago From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning arXiv:2606.07190v1 Announce Type: new Abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect… 21 arXiv — NLP / Computation & Language research 22d ago Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation arXiv:2606.07057v1 Announce Type: cross Abstract: Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which… 34 arXiv — NLP / Computation & Language research 22d ago MMAE: A Massive Multitask Audio Editing Benchmark arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,… 8 arXiv — NLP / Computation & Language research 22d ago Reference-Free Evaluation of Taxonomies arXiv:2505.11470v3 Announce Type: replace Abstract: We introduce two reference-free metrics for quality evaluation of taxonomies in the absence of labels. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, addressing… 31 arXiv — NLP / Computation & Language research 22d ago SWE-IF: Aligning Code Evaluation with Human Preference arXiv:2510.07315v2 Announce Type: replace Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human… 14 Hugging Face Daily Papers research 24d ago SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models… 38 Hugging Face Daily Papers research 24d ago Benchmark Everything Everywhere All at Once Abstract Automated benchmark creation system generates diverse evaluation datasets with minimal human intervention, enabling continuous model assessment across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Benchmarks are fundamental for evaluating and advancing… 27 Hugging Face Daily Papers research 25d ago LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs Abstract PropMe framework evaluates language model memorization by distinguishing between forced reproduction capabilities and natural propensity, using SimpleTrace for deterministic attribution and propensity-transformed metrics across open models and datasets. Generated by… 15 arXiv — Machine Learning research 25d ago The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models arXiv:2606.05169v1 Announce Type: new Abstract: We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by… 30 arXiv — Machine Learning research 25d ago Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference arXiv:2606.05308v1 Announce Type: new Abstract: With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the… 25 arXiv — Machine Learning research 25d ago Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation arXiv:2606.05403v1 Announce Type: new Abstract: Language models increasingly act as epistemic proxies, synthesizing evidence from multiple sources to inform decisions. Whether they evaluate the quality of that evidence, or merely aggregate it based on surface presentation,… 4 arXiv — Machine Learning research 25d ago Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents arXiv:2606.05558v1 Announce Type: new Abstract: Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation… 4 arXiv — Machine Learning research 25d ago Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions arXiv:2606.05692v1 Announce Type: new Abstract: Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on… 35 arXiv — Machine Learning research 25d ago Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data arXiv:2606.05781v1 Announce Type: new Abstract: Deploying frontier large language models (LLMs) for domain-specific structured evaluation tasks often incurs substantial latency, cost, and data privacy overhead. We present a hybrid framework that combines a fine-tuned small… 34 arXiv — Machine Learning research 25d ago GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis arXiv:2606.05860v1 Announce Type: new Abstract: Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically… 21 arXiv — NLP / Computation & Language research 25d ago PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis arXiv:2606.05176v1 Announce Type: new Abstract: While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In… 20 arXiv — NLP / Computation & Language research 25d ago ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces arXiv:2606.05402v1 Announce Type: new Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a… 30 arXiv — NLP / Computation & Language research 25d ago TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not… 32 arXiv — NLP / Computation & Language research 25d ago Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models arXiv:2606.05874v1 Announce Type: new Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios… 23 arXiv — NLP / Computation & Language research 25d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a… 9 arXiv — NLP / Computation & Language research 25d ago Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios arXiv:2606.06177v1 Announce Type: new Abstract: Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an… 13 arXiv — NLP / Computation & Language research 25d ago Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery arXiv:2606.06267v1 Announce Type: new Abstract: Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by… 23 arXiv — NLP / Computation & Language research 25d ago LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs arXiv:2606.06286v1 Announce Type: new Abstract: Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a… 26 Page 6 of 10 · 500 articles ← Newer Older →