News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — NLP / Computation & Language research 13d ago SpeechDx: A Multi-Task Benchmark for Clinical Speech AI arXiv:2606.17339v1 Announce Type: cross Abstract: Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated… 15 arXiv — NLP / Computation & Language research 13d ago PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents arXiv:2606.17467v1 Announce Type: cross Abstract: Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with… 17 arXiv — NLP / Computation & Language research 13d ago EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent arXiv:2606.17698v1 Announce Type: cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked.… 24 arXiv — NLP / Computation & Language research 13d ago Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering arXiv:2606.17799v1 Announce Type: cross Abstract: Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically… 33 arXiv — NLP / Computation & Language research 13d ago Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models arXiv:2606.18142v1 Announce Type: cross Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts,… 21 arXiv — NLP / Computation & Language research 13d ago The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act arXiv:2606.18158v1 Announce Type: cross Abstract: Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the… 38 arXiv — NLP / Computation & Language research 13d ago EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning… 38 Hugging Face Daily Papers research 13d ago ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions Abstract ChLogic benchmark reveals persistent performance gaps between English and Chinese logical reasoning in large language models, influenced by surface realization differences and translation artifacts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models… 37 Hugging Face Daily Papers research 13d ago ProCUA-SFT Technical Report Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents… 4 Hugging Face Daily Papers research 13d ago Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and… 19 OpenAI official-blog 13d ago Introducing LifeSciBench Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions. 19 r/LocalLLaMA community 13d ago bartowski/command-a-plus-05-2026-GGUF · Hugging Face Try with latest llama.cpp version. Share your t/s benchmarks & feedback   submitted by   /u/pmttyji [link]   [comments] 6 r/MachineLearning community 13d ago I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D] Spent the last few weeks on a benchmark/harness that tries to answer one question honestly: did a robot arm actually do the demonstrated task, or did the success metric just get fooled? The setup: compile a human demo into an object-centric graph (what changed in the world:… 7 NVIDIA Developer Blog official-blog 13d ago NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium.... 17 Hugging Face Daily Papers research 13d ago MVEB: Massive Video Embedding Benchmark Abstract A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset… 7 The Information — AI news-outlet 13d ago Index Startup Ornn Launches Anthropic, OpenAI Token Benchmarks Ornn, a startup that tracks the cost of computing power for artificial intelligence, has launched a service to track the price of tokens produced by the leading AI labs. The new benchmark comes as AI firms’ customers and financial backers search for better ways to track major AI… 9 Hugging Face Daily Papers research 14d ago Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long… 28 Hugging Face Daily Papers research 14d ago PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Abstract PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.… 13 arXiv — Machine Learning research 14d ago Benchmarking Instance-Dependent Label Noise with Controlled Corruptions arXiv:2606.14965v1 Announce Type: new Abstract: Synthetic instance-dependent label noise (IDN) benchmarks are widely used to evaluate noisy-label learning methods, yet existing approaches typically generate noise through imperfect annotators or classifier raters, leaving the… 21 arXiv — Machine Learning research 14d ago Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,… 23 arXiv — Machine Learning research 14d ago EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data… 9 arXiv — Machine Learning research 14d ago Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models arXiv:2606.15436v1 Announce Type: new Abstract: Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and… 28 arXiv — NLP / Computation & Language research 14d ago Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models arXiv:2606.15044v1 Announce Type: new Abstract: Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that… 35 arXiv — NLP / Computation & Language research 14d ago CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction arXiv:2606.15069v1 Announce Type: new Abstract: Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the… 20 arXiv — NLP / Computation & Language research 14d ago Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation arXiv:2606.15152v1 Announce Type: new Abstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal… 10 arXiv — NLP / Computation & Language research 14d ago Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus arXiv:2606.15345v1 Announce Type: new Abstract: Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and… 21 arXiv — NLP / Computation & Language research 14d ago EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management arXiv:2606.15532v1 Announce Type: new Abstract: Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only… 26 arXiv — NLP / Computation & Language research 14d ago Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation… 28 arXiv — NLP / Computation & Language research 14d ago EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries arXiv:2606.15735v1 Announce Type: new Abstract: Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making.… 26 arXiv — NLP / Computation & Language research 14d ago Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations arXiv:2606.15903v1 Announce Type: new Abstract: Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes… 21 arXiv — NLP / Computation & Language research 14d ago FinBalance: A Multi-Document Accounting Reconciliation Benchmark arXiv:2606.15949v1 Announce Type: new Abstract: Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a… 32 arXiv — NLP / Computation & Language research 14d ago A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning… 30 arXiv — NLP / Computation & Language research 14d ago Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design arXiv:2606.16009v1 Announce Type: new Abstract: Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains… 23 arXiv — NLP / Computation & Language research 14d ago Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs arXiv:2606.16011v1 Announce Type: new Abstract: Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a… 28 arXiv — NLP / Computation & Language research 14d ago AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We… 33 arXiv — NLP / Computation & Language research 14d ago GRACE: Step-Level Benchmark for Faithful Reasoning over Context arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can… 15 arXiv — NLP / Computation & Language research 14d ago Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework arXiv:2606.16211v1 Announce Type: new Abstract: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However,… 36 Hugging Face Daily Papers research 14d ago VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models Abstract VibeThinker-3B demonstrates that compact models can achieve state-of-the-art performance on verifiable reasoning tasks through specialized training techniques, challenging conventional scaling assumptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This technical… 16 Hugging Face Daily Papers research 14d ago VisualClaw: A Real-Time, Personalized Agent for the Physical World Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as… 32 r/LocalLLaMA community 14d ago HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...) Link to last post Before anything else, I'd like to sincerely thank u/jipok_ for helping out by highlighting a few weak questions, categories and scoring issues, which have now been addressed (Dropping >100 questions, tuning the scoring methodology for more accuracy, etc).… 19 r/LocalLLaMA community 14d ago Evalatro: an open benchmark where LLMs play the real Balatro Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game. It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics. Then the idea grew into something… 21 r/LocalLLaMA community 14d ago I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or… 34 arXiv — Machine Learning research 15d ago Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability arXiv:2606.14245v1 Announce Type: new Abstract: Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability… 33 arXiv — Machine Learning research 15d ago Can Deep Neural Networks Improve Compression of Very Large Scientific Data? arXiv:2606.14353v1 Announce Type: new Abstract: Error-bounded lossy compression is a fundamental technique for managing the rapidly growing volumes of scientific data produced by modern simulations and observational instruments. Most state-of-the-art-compressors follow a… 36 arXiv — Machine Learning research 15d ago Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications… 5 arXiv — Machine Learning research 15d ago EM-NeSy: Expectation Maximization for Neurosymbolic Learning arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating… 38 arXiv — NLP / Computation & Language research 15d ago The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks… 29 arXiv — NLP / Computation & Language research 15d ago Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce… 25 arXiv — NLP / Computation & Language research 15d ago Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents arXiv:2606.13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this… 10 arXiv — NLP / Computation & Language research 15d ago Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR arXiv:2606.14391v1 Announce Type: new Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior… 15 Page 5 of 10 · 500 articles ← Newer Older →