Tag

Benchmark

500 articles archived under #benchmark · RSS

arXiv — Machine Learning research 11d ago

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

arXiv:2606.19636v1 Announce Type: new Abstract: Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic…

20
arXiv — Machine Learning research 11d ago

Efficient Neural Network Model Selection for Few-Class Application Datasets

arXiv:2606.19712v1 Announce Type: new Abstract: While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are…

29
arXiv — NLP / Computation & Language research 11d ago

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

arXiv:2606.19352v1 Announce Type: new Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented…

16
arXiv — NLP / Computation & Language research 11d ago

LaViSA: A Language and Vision Structural Ambiguity Benchmark

arXiv:2606.19552v1 Announce Type: new Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving…

22
arXiv — NLP / Computation & Language research 11d ago

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

arXiv:2606.19881v1 Announce Type: new Abstract: Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector…

38
arXiv — NLP / Computation & Language research 11d ago

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

arXiv:2606.20255v1 Announce Type: new Abstract: We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for…

34
arXiv — NLP / Computation & Language research 11d ago

Benchmarking Agentic Review Systems

arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems…

15
arXiv — NLP / Computation & Language research 11d ago

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

arXiv:2606.19788v1 Announce Type: cross Abstract: We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies,…

29
arXiv — NLP / Computation & Language research 11d ago

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

arXiv:2606.19830v1 Announce Type: cross Abstract: Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to…

6
arXiv — NLP / Computation & Language research 11d ago

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

arXiv:2507.00875v3 Announce Type: replace Abstract: Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology,…

38
arXiv — NLP / Computation & Language research 11d ago

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and…

22
Hugging Face Daily Papers research 11d ago

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Abstract FAPO optimizes LLM pipelines by combining prompt editing with structural changes, demonstrating superior performance across multiple benchmarks and security tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-step LLM pipelines fail through interactions among…

38
Hugging Face Daily Papers research 11d ago

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Abstract FreeStyle is a scalable dual-reference generation framework that uses community LoRA mining to create large-scale style-content triplets while addressing content leakage through disentanglement mechanisms and a comprehensive benchmark. Generated by…

16
Hugging Face Daily Papers research 11d ago

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

27
Hugging Face Daily Papers research 11d ago

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Abstract A two-stage iterative framework alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems. Generated by…

23
r/LocalLLaMA community 11d ago

Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads

rtk , headroom , and caveman keep showing up whenever someone posts about cutting their token bill 60-90%. I wanted to know what they save on an actual bill instead of a benchmark, so I replayed all three over my own Claude Code history. My corpus was 500 of my own Claude Code…

11
r/MachineLearning community 11d ago

Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D]

I have been thinking a lot about how poorly isolated benchmark metrics capture real conversational system quality once models are deployed into multi-turn environments. You can have strong STT scores, decent latency, high task completion rates, and still end up with…

25
r/LocalLLaMA community 11d ago

GLM-5.2 Is The Best Open Weight Creative Writing Model

As Per Sam Paech's Creative Writing Benchmark on EQ Bench: https://eqbench.com/creative_writing.html   submitted by   /u/Few_Painter_5588 [link]   [comments]

24
Hugging Face Daily Papers research 11d ago

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Abstract 3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks. Generated by…

4
Hugging Face Daily Papers research 11d ago

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be…

6
Hugging Face Daily Papers research 11d ago

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and…

29
Hugging Face Daily Papers research 11d ago

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

Abstract A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predictive code…

17
r/LocalLLaMA community 11d ago

Le Chaton Fat Flash local when?

We are very happy with Le Chaton Fat SOTA but most of us would like to run it locally. You know, for privacy and sovereignty reasons. Does anyone have any updates when a local "flash" or "small" version is available?   submitted by   /u/corpo_monkey [link]  …

31
arXiv — Machine Learning research 12d ago

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

arXiv:2606.18338v1 Announce Type: new Abstract: The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule…

23
arXiv — Machine Learning research 12d ago

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

arXiv:2606.18367v1 Announce Type: new Abstract: Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on…

19
arXiv — Machine Learning research 12d ago

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under…

7
arXiv — Machine Learning research 12d ago

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

arXiv:2606.18640v1 Announce Type: new Abstract: Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized…

37
arXiv — NLP / Computation & Language research 12d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory…

22
arXiv — Machine Learning research 12d ago

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

arXiv:2606.18970v1 Announce Type: new Abstract: Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However,…

38
arXiv — Machine Learning research 12d ago

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

arXiv:2606.19036v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection…

16
arXiv — NLP / Computation & Language research 12d ago

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the…

19
arXiv — NLP / Computation & Language research 12d ago

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

arXiv:2606.18471v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic…

11
arXiv — NLP / Computation & Language research 12d ago

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators…

37
arXiv — NLP / Computation & Language research 12d ago

RedactionBench

arXiv:2606.18782v1 Announce Type: new Abstract: Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction…

22
arXiv — NLP / Computation & Language research 12d ago

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is…

6
arXiv — NLP / Computation & Language research 12d ago

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

arXiv:2606.18686v1 Announce Type: cross Abstract: Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce…

30
arXiv — NLP / Computation & Language research 12d ago

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge…

35
arXiv — NLP / Computation & Language research 12d ago

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

arXiv:2505.23851v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution…

38
arXiv — NLP / Computation & Language research 12d ago

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on…

35
Hugging Face Daily Papers research 12d ago

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Abstract IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical…

24
Hugging Face Daily Papers research 12d ago

Physics-IQ Verified

Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video…

29
Hugging Face Daily Papers research 12d ago

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Abstract A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory…

23
Hugging Face official-blog 12d ago

Is it agentic enough? Benchmarking open models on your own tooling

Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a…

26
r/MachineLearning community 12d ago

How do you analyze the relative "strength" of probes? [R]

This question is related to topics like language+ models (including multimodal) and things like "circuit" analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I'm trying to orient to the SoTA. I found this old post on trying…

21
arXiv — NLP / Computation & Language research 13d ago

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

arXiv:2606.17579v1 Announce Type: cross Abstract: Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input…

22
arXiv — NLP / Computation & Language research 13d ago

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

arXiv:2606.17354v1 Announce Type: new Abstract: Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations…

16
arXiv — NLP / Computation & Language research 13d ago

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned,…

12
arXiv — NLP / Computation & Language research 13d ago

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the…

7
arXiv — NLP / Computation & Language research 13d ago

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

arXiv:2606.17905v1 Announce Type: new Abstract: Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests…

10
arXiv — NLP / Computation & Language research 13d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

arXiv:2606.18237v1 Announce Type: new Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale…

36

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

Efficient Neural Network Model Selection for Few-Class Application Datasets

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

LaViSA: A Language and Vision Structural Ambiguity Benchmark

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

Benchmarking Agentic Review Systems

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads

Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D]

GLM-5.2 Is The Best Open Weight Creative Writing Model

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets

Le Chaton Fat Flash local when?

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

VISUALSKILL: Multimodal Skills for Computer-Use Agents

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

RedactionBench

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Physics-IQ Verified

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Is it agentic enough? Benchmarking open models on your own tooling

How do you analyze the relative "strength" of probes? [R]

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues