News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow Hugging Face Daily Papers research 19d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 16 arXiv — NLP / Computation & Language research 19d ago Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention arXiv:2606.11205v1 Announce Type: cross Abstract: Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance… 5 arXiv — Machine Learning research 19d ago Few-Shot Resampling for Scalable Statistically-Sound Data Mining arXiv:2606.11235v1 Announce Type: new Abstract: A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the… 19 arXiv — Machine Learning research 19d ago LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data arXiv:2606.11268v1 Announce Type: new Abstract: Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data,… 20 arXiv — Machine Learning research 19d ago Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models arXiv:2606.11409v1 Announce Type: new Abstract: Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of… 5 arXiv — Machine Learning research 19d ago Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics arXiv:2606.11657v1 Announce Type: new Abstract: Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style… 29 arXiv — Machine Learning research 19d ago Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization arXiv:2606.12016v1 Announce Type: new Abstract: Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware,… 27 arXiv — Machine Learning research 19d ago Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization arXiv:2606.12077v1 Announce Type: new Abstract: Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance… 15 arXiv — Machine Learning research 19d ago Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training arXiv:2606.12240v1 Announce Type: new Abstract: Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional… 26 arXiv — Machine Learning research 19d ago Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification arXiv:2606.12252v1 Announce Type: new Abstract: Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly… 8 arXiv — NLP / Computation & Language research 19d ago PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference arXiv:2606.11196v1 Announce Type: new Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without… 20 arXiv — NLP / Computation & Language research 19d ago NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track arXiv:2606.11199v1 Announce Type: new Abstract: We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather… 24 arXiv — NLP / Computation & Language research 19d ago BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts arXiv:2606.11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting… 29 arXiv — NLP / Computation & Language research 19d ago Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,… 16 arXiv — NLP / Computation & Language research 19d ago Agent Skill Evaluation and Evolution: Frameworks and Benchmarks arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in… 20 arXiv — NLP / Computation & Language research 19d ago AI Coding Agents Can Reproduce Social Science Findings arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks… 8 arXiv — NLP / Computation & Language research 19d ago Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness arXiv:2606.11686v1 Announce Type: new Abstract: End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed… 14 arXiv — NLP / Computation & Language research 19d ago Automated Creativity Evaluation of Language Models Across Open-Ended Tasks arXiv:2606.11762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable… 14 arXiv — NLP / Computation & Language research 19d ago Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation arXiv:2606.12117v1 Announce Type: new Abstract: Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the… 27 arXiv — NLP / Computation & Language research 19d ago Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application arXiv:2606.12191v1 Announce Type: new Abstract: Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work… 18 Hugging Face Daily Papers research 19d ago TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders Abstract TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard… 6 Hugging Face Daily Papers research 19d ago Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application Abstract Large language model agents require specialized environments for training and evaluation, which can be categorized by their engineering lifecycle stages and evolved through various paradigms including neural and symbolic approaches. Generated by… 8 Hugging Face Daily Papers research 19d ago When Behavioral Safety Evaluation Fails: A Representation-Level Perspective Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by… 38 TechCrunch — AI news-outlet 19d ago Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in AI coding agent startup Niteshift has raised a $7 million seed round from a who's who of angels. It's betting companies will want power over, not lock-in with model makers. 31 Hugging Face Daily Papers research 19d ago Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests Abstract CapCode framework uses randomized testing with performance caps to detect and prevent shortcut exploitation in agent evaluation, while CapReward rewards systems that adhere to intended task specifications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A growing failure… 21 Hugging Face Daily Papers research 20d ago Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields Abstract Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent years have witnessed the rapid… 15 arXiv — Machine Learning research 20d ago Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models arXiv:2606.09861v1 Announce Type: new Abstract: While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete… 5 arXiv — Machine Learning research 20d ago Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,… 23 arXiv — Machine Learning research 20d ago Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection arXiv:2606.09874v1 Announce Type: new Abstract: Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often… 22 arXiv — Machine Learning research 20d ago FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses arXiv:2606.09878v1 Announce Type: new Abstract: Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model… 20 arXiv — Machine Learning research 20d ago SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting arXiv:2606.09917v1 Announce Type: new Abstract: Multivariate time series forecasting requires capturing the continuously evolving correlation structure among interacting variables. Existing state-space models process time series by scanning tokenized temporal or spatial… 30 arXiv — Machine Learning research 20d ago Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization arXiv:2606.10068v1 Announce Type: new Abstract: Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many… 35 arXiv — Machine Learning research 20d ago Structured Adaptive Tensor Prediction for Streaming Data arXiv:2606.10085v1 Announce Type: new Abstract: Matrix-valued time series arise in a wide range of applications, such as spatio-temporal data from medical imaging and geophysics. Existing methods are mainly designed for static settings and lack adaptability to streaming and… 33 arXiv — Machine Learning research 20d ago MMClima: A Framework for Multimodal Climate Science Data and Evaluation arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We… 20 arXiv — Machine Learning research 20d ago Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series arXiv:2606.10219v1 Announce Type: new Abstract: AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems:… 35 arXiv — NLP / Computation & Language research 20d ago Automated Scoring of Arabic Text Using Large Language Models: A Literature Review arXiv:2606.09830v1 Announce Type: new Abstract: In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and… 25 arXiv — NLP / Computation & Language research 20d ago VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation arXiv:2606.11079v1 Announce Type: new Abstract: Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to… 14 arXiv — NLP / Computation & Language research 20d ago LLM-Based Code Documentation Generation and Multi-Judge Evaluation arXiv:2606.09852v1 Announce Type: cross Abstract: High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates… 36 arXiv — NLP / Computation & Language research 20d ago $\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems arXiv:2606.10156v1 Announce Type: cross Abstract: As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce… 11 Hugging Face Daily Papers research 20d ago When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated… 12 r/MachineLearning community 20d ago Phinite — multi-agent OS with first-class agent identity, composable skills, behavioral evaluation [P] We spent the last year building what we think is the missing infrastructure layer for multi-agent systems. Open to everyone starting today. The technical problem: Agents have no identity. In microservices you have a service mesh + IAM. In agent systems you have a Python file. We… 12 Hugging Face Daily Papers research 20d ago Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle Abstract Reference-free faithfulness metrics suffer from a blind spot measuring only precision, leading to rewards for abstention; completeness in deterministic domains enables measurement of both precision and recall, revealing that high-precision models often have poor fact… 34 TechCrunch — AI news-outlet 20d ago Sandstone raises $30M to bring AI to in-house legal teams Sandstone's Series A was led by Lightspeed Partners, with participation from Sequoia. 22 TechCrunch — AI news-outlet 20d ago How an e-scooter founder raised $5 million to build space data centers Orbital founder Euwyn Poon built 250,000 scooters at Spin. Now he wants to launch 10,000 space data centers. 27 Hugging Face Daily Papers research 21d ago Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data Abstract Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences. Generated by… 37 Hugging Face Daily Papers research 21d ago Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill Abstract Skill-RM presents a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models… 19 arXiv — Machine Learning research 21d ago SRT: Super-Resolution for Time Series via Disentangled Rectified Flow arXiv:2606.07605v1 Announce Type: new Abstract: Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be… 15 arXiv — Machine Learning research 21d ago Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods arXiv:2606.07607v1 Announce Type: new Abstract: Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable… 9 arXiv — Machine Learning research 21d ago Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation arXiv:2606.07616v1 Announce Type: new Abstract: Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference… 5 arXiv — Machine Learning research 21d ago Learning Transfers: Kan Extensions for Neural Invariants arXiv:2606.07627v1 Announce Type: new Abstract: Transfer learning presumes that a representation learned on source tasks carries structure that remains usable on related target tasks. Standard evaluations probe this through target accuracy or distributional discrepancy, yet… 8 Page 5 of 10 · 500 articles ← Newer Older →