Tag

Funding

500 articles archived under #funding · RSS

Hugging Face Daily Papers research 19d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

16
arXiv — NLP / Computation & Language research 19d ago

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

arXiv:2606.11205v1 Announce Type: cross Abstract: Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance…

5
arXiv — Machine Learning research 19d ago

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

arXiv:2606.11235v1 Announce Type: new Abstract: A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the…

19
arXiv — Machine Learning research 19d ago

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

arXiv:2606.11268v1 Announce Type: new Abstract: Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data,…

20
arXiv — Machine Learning research 19d ago

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

arXiv:2606.11409v1 Announce Type: new Abstract: Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of…

5
arXiv — Machine Learning research 19d ago

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

arXiv:2606.11657v1 Announce Type: new Abstract: Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style…

29
arXiv — Machine Learning research 19d ago

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

arXiv:2606.12016v1 Announce Type: new Abstract: Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware,…

27
arXiv — Machine Learning research 19d ago

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

arXiv:2606.12077v1 Announce Type: new Abstract: Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance…

15
arXiv — Machine Learning research 19d ago

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

arXiv:2606.12240v1 Announce Type: new Abstract: Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional…

26
arXiv — Machine Learning research 19d ago

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

arXiv:2606.12252v1 Announce Type: new Abstract: Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly…

8
arXiv — NLP / Computation & Language research 19d ago

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

arXiv:2606.11196v1 Announce Type: new Abstract: Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without…

20
arXiv — NLP / Computation & Language research 19d ago

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

arXiv:2606.11199v1 Announce Type: new Abstract: We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather…

24
arXiv — NLP / Computation & Language research 19d ago

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

arXiv:2606.11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting…

29
arXiv — NLP / Computation & Language research 19d ago

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

arXiv:2606.11399v1 Announce Type: new Abstract: Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions,…

16
arXiv — NLP / Computation & Language research 19d ago

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in…

20
arXiv — NLP / Computation & Language research 19d ago

AI Coding Agents Can Reproduce Social Science Findings

arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks…

8
arXiv — NLP / Computation & Language research 19d ago

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

arXiv:2606.11686v1 Announce Type: new Abstract: End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed…

14
arXiv — NLP / Computation & Language research 19d ago

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

arXiv:2606.11762v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable…

14
arXiv — NLP / Computation & Language research 19d ago

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

arXiv:2606.12117v1 Announce Type: new Abstract: Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the…

27
arXiv — NLP / Computation & Language research 19d ago

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

arXiv:2606.12191v1 Announce Type: new Abstract: Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work…

18
Hugging Face Daily Papers research 19d ago

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Abstract TRL-Bench establishes a standardized benchmark for evaluating tabular representation learning models across multiple granularities, revealing that encoder performance varies by task type and requires capability-specific assessment rather than single leaderboard…

6
Hugging Face Daily Papers research 19d ago

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

Abstract Large language model agents require specialized environments for training and evaluation, which can be categorized by their engineering lifecycle stages and evolved through various paradigms including neural and symbolic approaches. Generated by…

8
Hugging Face Daily Papers research 19d ago

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Abstract Behavioral safety evaluations of large language models provide incomplete insights into internal robustness, as demonstrated by the audit gap between observable outputs and latent space vulnerabilities revealed through intervention-based testing. Generated by…

38
TechCrunch — AI news-outlet 19d ago

Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in

AI coding agent startup Niteshift has raised a $7 million seed round from a who's who of angels. It's betting companies will want power over, not lock-in with model makers.

31
Hugging Face Daily Papers research 19d ago

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Abstract CapCode framework uses randomized testing with performance caps to detect and prevent shortcut exploitation in agent evaluation, while CapReward rewards systems that adhere to intended task specifications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A growing failure…

21
Hugging Face Daily Papers research 20d ago

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Abstract Current AI agents struggle with long-horizon professional GUI workflows, achieving low success rates due to issues with workflow consistency and domain-specific software understanding. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent years have witnessed the rapid…

15
arXiv — Machine Learning research 20d ago

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

arXiv:2606.09861v1 Announce Type: new Abstract: While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete…

5
arXiv — Machine Learning research 20d ago

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: new Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,…

23
arXiv — Machine Learning research 20d ago

Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

arXiv:2606.09874v1 Announce Type: new Abstract: Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often…

22
arXiv — Machine Learning research 20d ago

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

arXiv:2606.09878v1 Announce Type: new Abstract: Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model…

20
arXiv — Machine Learning research 20d ago

SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting

arXiv:2606.09917v1 Announce Type: new Abstract: Multivariate time series forecasting requires capturing the continuously evolving correlation structure among interacting variables. Existing state-space models process time series by scanning tokenized temporal or spatial…

30
arXiv — Machine Learning research 20d ago

Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization

arXiv:2606.10068v1 Announce Type: new Abstract: Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many…

35
arXiv — Machine Learning research 20d ago

Structured Adaptive Tensor Prediction for Streaming Data

arXiv:2606.10085v1 Announce Type: new Abstract: Matrix-valued time series arise in a wide range of applications, such as spatio-temporal data from medical imaging and geophysics. Existing methods are mainly designed for static settings and lack adaptability to streaming and…

33
arXiv — Machine Learning research 20d ago

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We…

20
arXiv — Machine Learning research 20d ago

Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

arXiv:2606.10219v1 Announce Type: new Abstract: AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems:…

35
arXiv — NLP / Computation & Language research 20d ago

Automated Scoring of Arabic Text Using Large Language Models: A Literature Review

arXiv:2606.09830v1 Announce Type: new Abstract: In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and…

25
arXiv — NLP / Computation & Language research 20d ago

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

arXiv:2606.11079v1 Announce Type: new Abstract: Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to…

14
arXiv — NLP / Computation & Language research 20d ago

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

arXiv:2606.09852v1 Announce Type: cross Abstract: High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates…

36
arXiv — NLP / Computation & Language research 20d ago

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

arXiv:2606.10156v1 Announce Type: cross Abstract: As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce…

11
Hugging Face Daily Papers research 20d ago

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Abstract Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures. Generated…

12
r/MachineLearning community 20d ago

Phinite — multi-agent OS with first-class agent identity, composable skills, behavioral evaluation [P]

We spent the last year building what we think is the missing infrastructure layer for multi-agent systems. Open to everyone starting today. The technical problem: Agents have no identity. In microservices you have a service mesh + IAM. In agent systems you have a Python file. We…

12
Hugging Face Daily Papers research 20d ago

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

Abstract Reference-free faithfulness metrics suffer from a blind spot measuring only precision, leading to rewards for abstention; completeness in deterministic domains enables measurement of both precision and recall, revealing that high-precision models often have poor fact…

34
TechCrunch — AI news-outlet 20d ago

Sandstone raises $30M to bring AI to in-house legal teams

Sandstone's Series A was led by Lightspeed Partners, with participation from Sequoia.

22
TechCrunch — AI news-outlet 20d ago

How an e-scooter founder raised $5 million to build space data centers

Orbital founder Euwyn Poon built 250,000 scooters at Spin. Now he wants to launch 10,000 space data centers.

27
Hugging Face Daily Papers research 21d ago

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Abstract Self-Evaluation Elicitation (SEE) method improves model calibration for quality assessment through calibration-coupled reinforcement learning and masked distillation, demonstrating transferable quality evaluation beyond specific judge preferences. Generated by…

37
Hugging Face Daily Papers research 21d ago

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Abstract Skill-RM presents a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reward models…

19
arXiv — Machine Learning research 21d ago

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

arXiv:2606.07605v1 Announce Type: new Abstract: Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be…

15
arXiv — Machine Learning research 21d ago

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

arXiv:2606.07607v1 Announce Type: new Abstract: Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable…

9
arXiv — Machine Learning research 21d ago

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

arXiv:2606.07616v1 Announce Type: new Abstract: Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference…

5
arXiv — Machine Learning research 21d ago

Learning Transfers: Kan Extensions for Neural Invariants

arXiv:2606.07627v1 Announce Type: new Abstract: Transfer learning presumes that a representation learned on source tasks carries structure that remains usable on related target tasks. Standard evaluations probe this through target accuracy or distributional discrepancy, yet…

8

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

AI Coding Agents Can Reproduce Social Science Findings

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

SPDM: Geometry-Modulated State Space Modeling with Manifold Constraints for Time Series Forecasting

Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization

Structured Adaptive Tensor Prediction for Streaming Data

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

Automated Scoring of Arabic Text Using Large Language Models: A Literature Review

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Phinite — multi-agent OS with first-class agent identity, composable skills, behavioral evaluation [P]

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

Sandstone raises $30M to bring AI to in-house legal teams

How an e-scooter founder raised $5 million to build space data centers

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

Learning Transfers: Kan Extensions for Neural Invariants