Tag

Funding

500 articles archived under #funding · RSS

arXiv — NLP / Computation & Language research 13d ago

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

arXiv:2606.17188v1 Announce Type: cross Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal…

15
arXiv — NLP / Computation & Language research 13d ago

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning…

38
arXiv — NLP / Computation & Language research 13d ago

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

arXiv:2601.04574v2 Announce Type: replace Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate…

7
Hugging Face Daily Papers research 13d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive…

31
TechCrunch — AI news-outlet 13d ago

SpaceX valuation balloons to $2.6T, briefly passes Amazon

SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday.

13
TechCrunch — AI news-outlet 13d ago

SpaceX passes Amazon as valuation balloons to $2.7T

SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday.

31
Hugging Face Daily Papers research 13d ago

Artificial Intelligence Index Report 2026

Abstract Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's…

32
The Information — AI news-outlet 13d ago

SpaceX finalizes $60 billion deal to acquire Cursor

SpaceX announced it agreed to buy AI coding startup Cursor for $60 billion on Tuesday. The announcement came only a few days after SpaceX went public at a valuation of about $1.77 trillion. Since the IPO, SpaceX stock has risen 42% to close on Monday at $193.50, valuing it at…

37
Hugging Face Daily Papers research 14d ago

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long…

28
arXiv — Machine Learning research 14d ago

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,…

23
arXiv — Machine Learning research 14d ago

Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning

arXiv:2606.15115v1 Announce Type: new Abstract: Multi-objective optimization (MOO) has emerged as a powerful approach to solving complex optimization problems involving multiple objectives. In many practical scenarios, function evaluations are unavailable or prohibitively…

7
arXiv — Machine Learning research 14d ago

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit…

11
arXiv — Machine Learning research 14d ago

Repeated Bilateral Trade: The Quest for Fairness

arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the…

34
arXiv — Machine Learning research 14d ago

PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation

arXiv:2606.15452v1 Announce Type: new Abstract: Rare events in time series are critical to model but hard to learn due to data scarcity. Current generative models struggle with extreme values. We observe that rare events leave distinct topological fingerprints - transitions in…

17
arXiv — Machine Learning research 14d ago

Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

arXiv:2606.15887v1 Announce Type: new Abstract: Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR,…

4
arXiv — NLP / Computation & Language research 14d ago

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal…

22
arXiv — NLP / Computation & Language research 14d ago

ReportQA: QA-Based Radiology Report Evaluation

arXiv:2606.15037v1 Announce Type: new Abstract: Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus…

38
arXiv — NLP / Computation & Language research 14d ago

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior…

7
arXiv — NLP / Computation & Language research 14d ago

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

arXiv:2606.15610v1 Announce Type: new Abstract: LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or…

5
arXiv — NLP / Computation & Language research 14d ago

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation…

28
arXiv — NLP / Computation & Language research 14d ago

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning…

30
arXiv — NLP / Computation & Language research 14d ago

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

arXiv:2606.16000v1 Announce Type: new Abstract: We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be…

22
arXiv — NLP / Computation & Language research 14d ago

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

arXiv:2606.16026v1 Announce Type: new Abstract: We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across…

37
arXiv — NLP / Computation & Language research 14d ago

Evaluating LLM Personalization via Semantic Constraint Verification

arXiv:2606.16368v1 Announce Type: new Abstract: Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address…

38
OpenAI official-blog 14d ago

Predicting model behavior before release by simulating deployment

OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.

27
r/MachineLearning community 14d ago

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first…

6
The Information — AI news-outlet 14d ago

Nvidia Plans To Raise At Least $20 Billion In Bonds

Nvidia said Monday it plans to raise new debt even as the AI chip leader keeps generating tens of billions of dollars in cash every quarter. It will be the company’s first corporate bond sale since 2021, when it raised $5 billion. Bloomberg earlier reported that Nvidia would…

29
The Information — AI news-outlet 14d ago

Salesforce to Acquire Customer AI Agent Fin for $3.6 Billion

Salesforce has agreed to buy Fin, a startup that develops customer agents formerly known as Intercom, for $3.6 billion, as the software giant hopes to win new businesses from enterprises to adopt its own AI offering. The sale price is a big premium to Fin’s last valuation of $2…

18
The Information — AI news-outlet 14d ago

Exclusive: Nvidia Server Marketplace Startup Raises $100 Million at $800 Million Valuation

Data center software startup and AI-server broker Hydra Host has raised $100 million at a valuation of close to $800 million, led by Kindred Ventures. Nvidia, Cathie Wood’s ARK Invest, early CoreWeave backer Magnetar, and existing investors Founders Fund and Flume Ventures also…

26
arXiv — Machine Learning research 15d ago

A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series

arXiv:2606.13823v1 Announce Type: new Abstract: We study training-free fixed-length descriptors for multivariate time series and ask not merely whether such a descriptor performs well, but when it can be expected to work at all. Our object of study is $D(\tau)$, built from a…

15
arXiv — Machine Learning research 15d ago

DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation

arXiv:2606.14192v1 Announce Type: new Abstract: Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement…

9
arXiv — NLP / Computation & Language research 15d ago

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks…

29
arXiv — NLP / Computation & Language research 15d ago

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

arXiv:2606.13944v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt…

33
arXiv — NLP / Computation & Language research 15d ago

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models…

5
arXiv — NLP / Computation & Language research 15d ago

OdysSim: Building Foundation Models for Human Behavior Simulation

arXiv:2606.14199v1 Announce Type: new Abstract: Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register,…

8
arXiv — NLP / Computation & Language research 15d ago

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

arXiv:2606.14278v1 Announce Type: new Abstract: Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it…

21
arXiv — NLP / Computation & Language research 15d ago

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

arXiv:2606.14516v1 Announce Type: cross Abstract: AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats,…

38
r/LocalLLaMA community 15d ago

Quality evaluation of quants with limited time or tokens

About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3…

36
r/MachineLearning community 16d ago

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,…

24
Hugging Face Daily Papers research 17d ago

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Abstract Psychometric assessments of LLM behavior reveal that specific behavioral frameworks like Theory of Planned Behavior show better coherence with actual responses than broad personality traits, particularly within shared conversations. Generated by…

6
TechCrunch — AI news-outlet 17d ago

Mistral is rumored to be raising €3B at €20B valuation

The funding round would value the company at around €20 billion (about $23.15 billion), nearly double its Series C valuation of €11.7 billion.

23
Hugging Face official-blog 17d ago

olmo-eval: An evaluation workbench for the model development loop

Back to Articles olmo-eval: An evaluation workbench for the model development loop Enterprise Article Published June 12, 2026 Upvote - Tyler Murray undfined allenai Kyle Wiggers Ai2Comms allenai 💻 Code: https://github.com/allenai/olmo-eval While you're building an LLM, you…

23
Hugging Face Daily Papers research 17d ago

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large…

6
Hugging Face Daily Papers research 17d ago

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

Abstract WEAVER is a multi-view world model architecture that achieves high fidelity, consistency, and efficiency in robotic manipulation tasks through flow-matching loss and demonstrates superior performance in policy evaluation, improvement, and test-time planning. Generated…

27
arXiv — NLP / Computation & Language research 18d ago

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most…

23
arXiv — NLP / Computation & Language research 18d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to…

26
arXiv — NLP / Computation & Language research 18d ago

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to…

30
arXiv — NLP / Computation & Language research 18d ago

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

arXiv:2606.12730v1 Announce Type: cross Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in…

12
Hugging Face Daily Papers research 18d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search…

26
TechCrunch — AI news-outlet 18d ago

Theker just raised $85M to build the factory robot that doesn’t specialize in anything

Unlike humanoid robots designed around a fixed form — think Boston Dynamics — Theker's machines are built to be reconfigured.

18

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

SpaceX valuation balloons to $2.6T, briefly passes Amazon

SpaceX passes Amazon as valuation balloons to $2.7T

Artificial Intelligence Index Report 2026

SpaceX finalizes $60 billion deal to acquire Cursor

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

Repeated Bilateral Trade: The Quest for Fairness

PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation

Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

ReportQA: QA-Based Radiology Report Evaluation

A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

Evaluating LLM Personalization via Semantic Constraint Verification

Predicting model behavior before release by simulating deployment

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

Nvidia Plans To Raise At Least $20 Billion In Bonds

Salesforce to Acquire Customer AI Agent Fin for $3.6 Billion

Exclusive: Nvidia Server Marketplace Startup Raises $100 Million at $800 Million Valuation

A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series

DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

OdysSim: Building Foundation Models for Human Behavior Simulation

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Quality evaluation of quants with limited time or tokens

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Mistral is rumored to be raising €3B at €20B valuation

olmo-eval: An evaluation workbench for the model development loop

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Theker just raised $85M to build the factory robot that doesn&#8217;t specialize in anything

Theker just raised $85M to build the factory robot that doesn’t specialize in anything