News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — NLP / Computation & Language research 13d ago Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation arXiv:2606.17188v1 Announce Type: cross Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal… 15 arXiv — NLP / Computation & Language research 13d ago EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning… 38 arXiv — NLP / Computation & Language research 13d ago FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback arXiv:2601.04574v2 Announce Type: replace Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate… 7 Hugging Face Daily Papers research 13d ago GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive… 31 TechCrunch — AI news-outlet 13d ago SpaceX valuation balloons to $2.6T, briefly passes Amazon SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday. 13 TechCrunch — AI news-outlet 13d ago SpaceX passes Amazon as valuation balloons to $2.7T SpaceX's valuation has increased by $1 trillion since its shares started trading on Friday. 31 Hugging Face Daily Papers research 13d ago Artificial Intelligence Index Report 2026 Abstract Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's… 32 The Information — AI news-outlet 13d ago SpaceX finalizes $60 billion deal to acquire Cursor SpaceX announced it agreed to buy AI coding startup Cursor for $60 billion on Tuesday. The announcement came only a few days after SpaceX went public at a valuation of about $1.77 trillion. Since the IPO, SpaceX stock has risen 42% to close on Monday at $193.50, valuing it at… 37 Hugging Face Daily Papers research 14d ago Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long… 28 arXiv — Machine Learning research 14d ago Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,… 23 arXiv — Machine Learning research 14d ago Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning arXiv:2606.15115v1 Announce Type: new Abstract: Multi-objective optimization (MOO) has emerged as a powerful approach to solving complex optimization problems involving multiple objectives. In many practical scenarios, function evaluations are unavailable or prohibitively… 7 arXiv — Machine Learning research 14d ago Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit… 11 arXiv — Machine Learning research 14d ago Repeated Bilateral Trade: The Quest for Fairness arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the… 34 arXiv — Machine Learning research 14d ago PHINN: Persistent Homology Inspired Neural Network for Rare-Event Time Series Generation arXiv:2606.15452v1 Announce Type: new Abstract: Rare events in time series are critical to model but hard to learn due to data scarcity. Current generative models struggle with extreme values. We observe that rare events leave distinct topological fingerprints - transitions in… 17 arXiv — Machine Learning research 14d ago Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes arXiv:2606.15887v1 Announce Type: new Abstract: Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR,… 4 arXiv — NLP / Computation & Language research 14d ago Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal… 22 arXiv — NLP / Computation & Language research 14d ago ReportQA: QA-Based Radiology Report Evaluation arXiv:2606.15037v1 Announce Type: new Abstract: Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus… 38 arXiv — NLP / Computation & Language research 14d ago A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior… 7 arXiv — NLP / Computation & Language research 14d ago LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation arXiv:2606.15610v1 Announce Type: new Abstract: LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or… 5 arXiv — NLP / Computation & Language research 14d ago Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation… 28 arXiv — NLP / Computation & Language research 14d ago A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning… 30 arXiv — NLP / Computation & Language research 14d ago GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science arXiv:2606.16000v1 Announce Type: new Abstract: We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be… 22 arXiv — NLP / Computation & Language research 14d ago In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation arXiv:2606.16026v1 Announce Type: new Abstract: We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across… 37 arXiv — NLP / Computation & Language research 14d ago Evaluating LLM Personalization via Semantic Constraint Verification arXiv:2606.16368v1 Announce Type: new Abstract: Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address… 38 OpenAI official-blog 14d ago Predicting model behavior before release by simulating deployment OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy. 27 r/MachineLearning community 14d ago Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D] I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first… 6 The Information — AI news-outlet 14d ago Nvidia Plans To Raise At Least $20 Billion In Bonds Nvidia said Monday it plans to raise new debt even as the AI chip leader keeps generating tens of billions of dollars in cash every quarter. It will be the company’s first corporate bond sale since 2021, when it raised $5 billion. Bloomberg earlier reported that Nvidia would… 29 The Information — AI news-outlet 14d ago Salesforce to Acquire Customer AI Agent Fin for $3.6 Billion Salesforce has agreed to buy Fin, a startup that develops customer agents formerly known as Intercom, for $3.6 billion, as the software giant hopes to win new businesses from enterprises to adopt its own AI offering. The sale price is a big premium to Fin’s last valuation of $2… 18 The Information — AI news-outlet 14d ago Exclusive: Nvidia Server Marketplace Startup Raises $100 Million at $800 Million Valuation Data center software startup and AI-server broker Hydra Host has raised $100 million at a valuation of close to $800 million, led by Kindred Ventures. Nvidia, Cathie Wood’s ARK Invest, early CoreWeave backer Magnetar, and existing investors Founders Fund and Flume Ventures also… 26 arXiv — Machine Learning research 15d ago A Stationarity-and-Coupling Criterion for Training-Free Time-Lagged Spectral Embeddings of Multivariate Time Series arXiv:2606.13823v1 Announce Type: new Abstract: We study training-free fixed-length descriptors for multivariate time series and ask not merely whether such a descriptor performs well, but when it can be expected to work at all. Our object of study is $D(\tau)$, built from a… 15 arXiv — Machine Learning research 15d ago DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation arXiv:2606.14192v1 Announce Type: new Abstract: Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement… 9 arXiv — NLP / Computation & Language research 15d ago The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks… 29 arXiv — NLP / Computation & Language research 15d ago LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values arXiv:2606.13944v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt… 33 arXiv — NLP / Computation & Language research 15d ago Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment arXiv:2606.14037v1 Announce Type: new Abstract: As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models… 5 arXiv — NLP / Computation & Language research 15d ago OdysSim: Building Foundation Models for Human Behavior Simulation arXiv:2606.14199v1 Announce Type: new Abstract: Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register,… 8 arXiv — NLP / Computation & Language research 15d ago Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge arXiv:2606.14278v1 Announce Type: new Abstract: Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it… 21 arXiv — NLP / Computation & Language research 15d ago Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results arXiv:2606.14516v1 Announce Type: cross Abstract: AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats,… 38 r/LocalLLaMA community 15d ago Quality evaluation of quants with limited time or tokens About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3… 36 r/MachineLearning community 16d ago The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R] We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,… 24 Hugging Face Daily Papers research 17d ago Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior Abstract Psychometric assessments of LLM behavior reveal that specific behavioral frameworks like Theory of Planned Behavior show better coherence with actual responses than broad personality traits, particularly within shared conversations. Generated by… 6 TechCrunch — AI news-outlet 17d ago Mistral is rumored to be raising €3B at €20B valuation The funding round would value the company at around €20 billion (about $23.15 billion), nearly double its Series C valuation of €11.7 billion. 23 Hugging Face official-blog 17d ago olmo-eval: An evaluation workbench for the model development loop Back to Articles olmo-eval: An evaluation workbench for the model development loop Enterprise Article Published June 12, 2026 Upvote - Tyler Murray undfined allenai Kyle Wiggers Ai2Comms allenai 💻 Code: https://github.com/allenai/olmo-eval While you're building an LLM, you… 23 Hugging Face Daily Papers research 17d ago Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models Abstract Compute-aware evaluation framework using FLOPs and risk-compute curves reveals non-monotonic effects of alignment training and varying attack costs across different harm categories. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Adversarial robustness evaluations of large… 6 Hugging Face Daily Papers research 17d ago WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation Abstract WEAVER is a multi-view world model architecture that achieves high fidelity, consistency, and efficiency in robotic manipulation tasks through flow-matching loss and demonstrates superior performance in policy evaluation, improvement, and test-time planning. Generated… 27 arXiv — NLP / Computation & Language research 18d ago LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most… 23 arXiv — NLP / Computation & Language research 18d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to… 26 arXiv — NLP / Computation & Language research 18d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to… 30 arXiv — NLP / Computation & Language research 18d ago Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior arXiv:2606.12730v1 Announce Type: cross Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in… 12 Hugging Face Daily Papers research 18d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search… 26 TechCrunch — AI news-outlet 18d ago Theker just raised $85M to build the factory robot that doesn’t specialize in anything Unlike humanoid robots designed around a fixed form — think Boston Dynamics — Theker's machines are built to be reconfigured. 18 Page 4 of 10 · 500 articles ← Newer Older →