Tag

Benchmark

500 articles archived under #benchmark · RSS

arXiv — NLP / Computation & Language research 13d ago

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

arXiv:2606.17339v1 Announce Type: cross Abstract: Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated…

15
arXiv — NLP / Computation & Language research 13d ago

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

arXiv:2606.17467v1 Announce Type: cross Abstract: Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with…

17
arXiv — NLP / Computation & Language research 13d ago

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

arXiv:2606.17698v1 Announce Type: cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked.…

24
arXiv — NLP / Computation & Language research 13d ago

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

arXiv:2606.17799v1 Announce Type: cross Abstract: Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically…

33
arXiv — NLP / Computation & Language research 13d ago

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

arXiv:2606.18142v1 Announce Type: cross Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts,…

21
arXiv — NLP / Computation & Language research 13d ago

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

arXiv:2606.18158v1 Announce Type: cross Abstract: Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the…

38
arXiv — NLP / Computation & Language research 13d ago

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning…

38
Hugging Face Daily Papers research 13d ago

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Abstract ChLogic benchmark reveals persistent performance gaps between English and Chinese logical reasoning in large language models, influenced by surface realization differences and translation artifacts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

37
Hugging Face Daily Papers research 13d ago

ProCUA-SFT Technical Report

Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents…

4
Hugging Face Daily Papers research 13d ago

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and…

19
OpenAI official-blog 13d ago

Introducing LifeSciBench

Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.

19
r/LocalLLaMA community 13d ago

bartowski/command-a-plus-05-2026-GGUF · Hugging Face

Try with latest llama.cpp version. Share your t/s benchmarks & feedback   submitted by   /u/pmttyji [link]   [comments]

6
r/MachineLearning community 13d ago

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

Spent the last few weeks on a benchmark/harness that tries to answer one question honestly: did a robot arm actually do the demonstrated task, or did the success metric just get fooled? The setup: compile a human demo into an object-centric graph (what changed in the world:…

7
NVIDIA Developer Blog official-blog 13d ago

NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance

NVIDIA delivered a clean sweep in MLPerf Training v6.0, the latest edition of industry-standard AI training benchmarks developed by the MLCommons consortium....

17
Hugging Face Daily Papers research 13d ago

MVEB: Massive Video Embedding Benchmark

Abstract A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset…

7
The Information — AI news-outlet 13d ago

Index Startup Ornn Launches Anthropic, OpenAI Token Benchmarks

Ornn, a startup that tracks the cost of computing power for artificial intelligence, has launched a service to track the price of tokens produced by the leading AI labs. The new benchmark comes as AI firms’ customers and financial backers search for better ways to track major AI…

9
Hugging Face Daily Papers research 14d ago

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long…

28
Hugging Face Daily Papers research 14d ago

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Abstract PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.…

13
arXiv — Machine Learning research 14d ago

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

arXiv:2606.14965v1 Announce Type: new Abstract: Synthetic instance-dependent label noise (IDN) benchmarks are widely used to evaluate noisy-label learning methods, yet existing approaches typically generate noise through imperfect annotators or classifier raters, leaving the…

21
arXiv — Machine Learning research 14d ago

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

arXiv:2606.15058v1 Announce Type: new Abstract: This study examines whether machine learning (ML) models can outperform the naive random walk benchmark in forecasting the monthly USD/CAD exchange rate. Using daily data from the Bank of Canada spanning January 2017 to May 2026,…

23
arXiv — Machine Learning research 14d ago

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

arXiv:2606.15240v1 Announce Type: new Abstract: Vessel trajectory prediction is important for intelligent shipping, maritime surveillance, and navigation safety. However, existing public maritime AIS resources are often limited by inconsistent forecasting protocols, uneven data…

9
arXiv — Machine Learning research 14d ago

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

arXiv:2606.15436v1 Announce Type: new Abstract: Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and…

28
arXiv — NLP / Computation & Language research 14d ago

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

arXiv:2606.15044v1 Announce Type: new Abstract: Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that…

35
arXiv — NLP / Computation & Language research 14d ago

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

arXiv:2606.15069v1 Announce Type: new Abstract: Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the…

20
arXiv — NLP / Computation & Language research 14d ago

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

arXiv:2606.15152v1 Announce Type: new Abstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal…

10
arXiv — NLP / Computation & Language research 14d ago

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

arXiv:2606.15345v1 Announce Type: new Abstract: Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and…

21
arXiv — NLP / Computation & Language research 14d ago

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

arXiv:2606.15532v1 Announce Type: new Abstract: Emotional intelligence (EI) in Large Language Models (LLMs) is often evaluated through static understanding tasks or single-response dialogue generation. However, emotion management is interactive: a good model should not only…

26
arXiv — NLP / Computation & Language research 14d ago

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

arXiv:2606.15643v1 Announce Type: new Abstract: Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation…

28
arXiv — NLP / Computation & Language research 14d ago

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

arXiv:2606.15735v1 Announce Type: new Abstract: Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making.…

26
arXiv — NLP / Computation & Language research 14d ago

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

arXiv:2606.15903v1 Announce Type: new Abstract: Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes…

21
arXiv — NLP / Computation & Language research 14d ago

FinBalance: A Multi-Document Accounting Reconciliation Benchmark

arXiv:2606.15949v1 Announce Type: new Abstract: Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a…

32
arXiv — NLP / Computation & Language research 14d ago

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

arXiv:2606.15974v1 Announce Type: new Abstract: Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning…

30
arXiv — NLP / Computation & Language research 14d ago

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

arXiv:2606.16009v1 Announce Type: new Abstract: Machine interpreting (MI), the live, real-time branch of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains…

23
arXiv — NLP / Computation & Language research 14d ago

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

arXiv:2606.16011v1 Announce Type: new Abstract: Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a…

28
arXiv — NLP / Computation & Language research 14d ago

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

arXiv:2606.16127v1 Announce Type: new Abstract: The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We…

33
arXiv — NLP / Computation & Language research 14d ago

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can…

15
arXiv — NLP / Computation & Language research 14d ago

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

arXiv:2606.16211v1 Announce Type: new Abstract: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However,…

36
Hugging Face Daily Papers research 14d ago

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

Abstract VibeThinker-3B demonstrates that compact models can achieve state-of-the-art performance on verifiable reasoning tasks through specialized training techniques, challenging conventional scaling assumptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This technical…

16
Hugging Face Daily Papers research 14d ago

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as…

32
r/LocalLLaMA community 14d ago

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

Link to last post Before anything else, I'd like to sincerely thank u/jipok_ for helping out by highlighting a few weak questions, categories and scoring issues, which have now been addressed (Dropping >100 questions, tuning the scoring methodology for more accuracy, etc).…

19
r/LocalLLaMA community 14d ago

Evalatro: an open benchmark where LLMs play the real Balatro

Hey! I made Evalatro - an open benchmark where your LLMs play actual Balatro. Real game. It started because I kept asking Claude to help me beat levels while playing (yeah, I'm too weak). I'd just throw screenshots at it and ask for tactics. Then the idea grew into something…

21
r/LocalLLaMA community 14d ago

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

So every time I pick a model for a feature or random use-case I have I end up having like 12 tabs open — usually OpenRouter for price and context, Artificial Analysis for benchmarks, Design Arena for the UI/frontend Elo if thats relevant, a status/model page for throughput or…

34
arXiv — Machine Learning research 15d ago

Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

arXiv:2606.14245v1 Announce Type: new Abstract: Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability…

33
arXiv — Machine Learning research 15d ago

Can Deep Neural Networks Improve Compression of Very Large Scientific Data?

arXiv:2606.14353v1 Announce Type: new Abstract: Error-bounded lossy compression is a fundamental technique for managing the rapidly growing volumes of scientific data produced by modern simulations and observational instruments. Most state-of-the-art-compressors follow a…

36
arXiv — Machine Learning research 15d ago

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications…

5
arXiv — Machine Learning research 15d ago

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

arXiv:2606.14463v1 Announce Type: new Abstract: Neurosymbolic (NeSy) models integrate neural networks and symbolic reasoning for robust and interpretable AI. State-of-the-art NeSy models require that the symbolic component is expressed in a differentiable way, often complicating…

38
arXiv — NLP / Computation & Language research 15d ago

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv:2606.13685v1 Announce Type: new Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks…

29
arXiv — NLP / Computation & Language research 15d ago

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce…

25
arXiv — NLP / Computation & Language research 15d ago

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

arXiv:2606.13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this…

10
arXiv — NLP / Computation & Language research 15d ago

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

arXiv:2606.14391v1 Announce Type: new Abstract: Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior…

15

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ProCUA-SFT Technical Report

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Introducing LifeSciBench

bartowski/command-a-plus-05-2026-GGUF · Hugging Face

I built a leakage-clean verifier for robot manipulation, is this useful? Am I solving a non-problem? [D]

NVIDIA Blackwell Tops MLPerf Training 6.0 with Industry-Leading Scale and Performance

MVEB: Massive Video Embedding Benchmark

Index Startup Ornn Launches Anthropic, OpenAI Token Benchmarks

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

Machine Learning and the Random Walk Puzzle: Forecasting the CAD/USD Exchange Rate with Expanding Window Evaluation and SHAP Interpretability

EnvShip-Bench: An Environment-Enhanced Benchmark for Short-Term Vessel Trajectory Prediction

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

FinBalance: A Multi-Document Accounting Reconciliation Benchmark

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VisualClaw: A Real-Time, Personalized Agent for the Physical World

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

Evalatro: an open benchmark where LLMs play the real Balatro

I got tired of juggling OpenRouter + Artificial Analysis + Design Arena tabs to pick a model, so I put them in one filterable table

Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

Can Deep Neural Networks Improve Compression of Very Large Scientific Data?

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

EM-NeSy: Expectation Maximization for Neurosymbolic Learning

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR