News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — Machine Learning research 11d ago Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation arXiv:2606.19636v1 Announce Type: new Abstract: Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic… 20 arXiv — Machine Learning research 11d ago Efficient Neural Network Model Selection for Few-Class Application Datasets arXiv:2606.19712v1 Announce Type: new Abstract: While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are… 29 arXiv — NLP / Computation & Language research 11d ago Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards arXiv:2606.19352v1 Announce Type: new Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented… 16 arXiv — NLP / Computation & Language research 11d ago LaViSA: A Language and Vision Structural Ambiguity Benchmark arXiv:2606.19552v1 Announce Type: new Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving… 22 arXiv — NLP / Computation & Language research 11d ago REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection arXiv:2606.19881v1 Announce Type: new Abstract: Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector… 38 arXiv — NLP / Computation & Language research 11d ago The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse arXiv:2606.20255v1 Announce Type: new Abstract: We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for… 34 arXiv — NLP / Computation & Language research 11d ago Benchmarking Agentic Review Systems arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems… 15 arXiv — NLP / Computation & Language research 11d ago CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models arXiv:2606.19788v1 Announce Type: cross Abstract: We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies,… 29 arXiv — NLP / Computation & Language research 11d ago JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines arXiv:2606.19830v1 Announce Type: cross Abstract: Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to… 6 arXiv — NLP / Computation & Language research 11d ago TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law arXiv:2507.00875v3 Announce Type: replace Abstract: Translating Hong Kong Court Judgments from English to Traditional Chinese is mandated by Articles 8-9 of the Basic Law, yet remains constrained by a shortage of parallel resources and rigorous demands on legal terminology,… 38 arXiv — NLP / Computation & Language research 11d ago ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and… 22 Hugging Face Daily Papers research 11d ago FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines Abstract FAPO optimizes LLM pipelines by combining prompt editing with structural changes, demonstrating superior performance across multiple benchmarks and security tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-step LLM pipelines fail through interactions among… 38 Hugging Face Daily Papers research 11d ago FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining Abstract FreeStyle is a scalable dual-reference generation framework that uses community LoRA mining to create large-scale style-content triplets while addressing content leakage through disentanglement mechanisms and a comprehensive benchmark. Generated by… 16 Hugging Face Daily Papers research 11d ago Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 27 Hugging Face Daily Papers research 11d ago REVES: REvision and VErification--Augmented Training for Test-Time Scaling Abstract A two-stage iterative framework alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems. Generated by… 23 r/LocalLLaMA community 11d ago Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads rtk , headroom , and caveman keep showing up whenever someone posts about cutting their token bill 60-90%. I wanted to know what they save on an actual bill instead of a benchmark, so I replayed all three over my own Claude Code history. My corpus was 500 of my own Claude Code… 11 r/MachineLearning community 11d ago Voice debugging at the conversation level seems far more useful than isolated benchmark metrics [D] I have been thinking a lot about how poorly isolated benchmark metrics capture real conversational system quality once models are deployed into multi-turn environments. You can have strong STT scores, decent latency, high task completion rates, and still end up with… 25 r/LocalLLaMA community 11d ago GLM-5.2 Is The Best Open Weight Creative Writing Model As Per Sam Paech's Creative Writing Benchmark on EQ Bench: https://eqbench.com/creative_writing.html   submitted by   /u/Few_Painter_5588 [link]   [comments] 24 Hugging Face Daily Papers research 11d ago MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction Abstract 3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks. Generated by… 4 Hugging Face Daily Papers research 11d ago iOSWorld: A Benchmark for Personally Intelligent Phone Agents Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be… 6 Hugging Face Daily Papers research 11d ago MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and… 29 Hugging Face Daily Papers research 11d ago A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets Abstract A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predictive code… 17 r/LocalLLaMA community 11d ago Le Chaton Fat Flash local when? We are very happy with Le Chaton Fat SOTA but most of us would like to run it locally. You know, for privacy and sovereignty reasons. Does anyone have any updates when a local "flash" or "small" version is available?   submitted by   /u/corpo_monkey [link]  … 31 arXiv — Machine Learning research 12d ago ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets arXiv:2606.18338v1 Announce Type: new Abstract: The search for life beyond Earth will depend on detecting faint signatures in the atmospheres of potentially habitable exoplanets. Interpreting those signatures requires understanding the host planet's climate: the same molecule… 23 arXiv — Machine Learning research 12d ago Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting arXiv:2606.18367v1 Announce Type: new Abstract: Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on… 19 arXiv — Machine Learning research 12d ago TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under… 7 arXiv — Machine Learning research 12d ago MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes arXiv:2606.18640v1 Announce Type: new Abstract: Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized… 37 arXiv — NLP / Computation & Language research 12d ago GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory… 22 arXiv — Machine Learning research 12d ago A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI arXiv:2606.18970v1 Announce Type: new Abstract: Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However,… 38 arXiv — Machine Learning research 12d ago Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts arXiv:2606.19036v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection… 16 arXiv — NLP / Computation & Language research 12d ago VISUALSKILL: Multimodal Skills for Computer-Use Agents arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the… 19 arXiv — NLP / Computation & Language research 12d ago Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text arXiv:2606.18471v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic… 11 arXiv — NLP / Computation & Language research 12d ago LegalWorld: A Life-Cycle Interactive Environment for Legal Agents arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators… 37 arXiv — NLP / Computation & Language research 12d ago RedactionBench arXiv:2606.18782v1 Announce Type: new Abstract: Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction… 22 arXiv — NLP / Computation & Language research 12d ago G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment arXiv:2606.18989v1 Announce Type: new Abstract: Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is… 6 arXiv — NLP / Computation & Language research 12d ago ForecastBench-Sim: A Simulated-World Forecasting Benchmark arXiv:2606.18686v1 Announce Type: cross Abstract: Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce… 30 arXiv — NLP / Computation & Language research 12d ago IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages arXiv:2606.19157v1 Announce Type: cross Abstract: AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge… 35 arXiv — NLP / Computation & Language research 12d ago ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark arXiv:2505.23851v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution… 38 arXiv — NLP / Computation & Language research 12d ago FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on… 35 Hugging Face Daily Papers research 12d ago IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products Abstract IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical… 24 Hugging Face Daily Papers research 12d ago Physics-IQ Verified Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video… 29 Hugging Face Daily Papers research 12d ago Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games Abstract A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory… 23 Hugging Face official-blog 12d ago Is it agentic enough? Benchmarking open models on your own tooling Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a… 26 r/MachineLearning community 12d ago How do you analyze the relative "strength" of probes? [R] This question is related to topics like language+ models (including multimodal) and things like "circuit" analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I'm trying to orient to the SoTA. I found this old post on trying… 21 arXiv — NLP / Computation & Language research 13d ago LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks arXiv:2606.17579v1 Announce Type: cross Abstract: Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input… 22 arXiv — NLP / Computation & Language research 13d ago Translating the Untranslatable: An Operationalizable Ontology for Untranslatability arXiv:2606.17354v1 Announce Type: new Abstract: Untranslatability, cases where meaning cannot be directly preserved across languages, is well-studied in linguistics but underexplored in NLP. As machine translation (MT) systems improve on standard benchmarks, their limitations… 16 arXiv — NLP / Computation & Language research 13d ago NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned,… 12 arXiv — NLP / Computation & Language research 13d ago The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the… 7 arXiv — NLP / Computation & Language research 13d ago ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions arXiv:2606.17905v1 Announce Type: new Abstract: Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests… 10 arXiv — NLP / Computation & Language research 13d ago ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues arXiv:2606.18237v1 Announce Type: new Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale… 36 Page 4 of 10 · 500 articles ← Newer Older →