News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — NLP / Computation & Language research 6d ago AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of… 36 arXiv — NLP / Computation & Language research 6d ago NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? arXiv:2606.24530v1 Announce Type: new Abstract: We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real… 21 arXiv — NLP / Computation & Language research 6d ago The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking arXiv:2606.24627v1 Announce Type: new Abstract: Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect… 4 arXiv — NLP / Computation & Language research 6d ago CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation arXiv:2606.24714v1 Announce Type: new Abstract: Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening… 33 arXiv — NLP / Computation & Language research 6d ago Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War arXiv:2606.24391v1 Announce Type: cross Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept… 29 arXiv — NLP / Computation & Language research 6d ago ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained… 15 arXiv — NLP / Computation & Language research 6d ago CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark arXiv:2409.11363v2 Announce Type: replace Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially,… 20 arXiv — NLP / Computation & Language research 6d ago Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions arXiv:2501.11790v5 Announce Type: replace Abstract: Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable… 29 arXiv — NLP / Computation & Language research 6d ago Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs arXiv:2505.18542v4 Announce Type: replace Abstract: Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has focused on extracting linear action flows from instructional… 32 Hugging Face Daily Papers research 6d ago NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? Abstract NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation… 21 Hugging Face Daily Papers research 6d ago Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning Abstract Text-to-image models fail to generate counterfactual scenes because they rely on tightly coupled visual-textual patterns rather than causal reasoning, demonstrating limited understanding beyond pattern matching. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image… 26 r/MachineLearning community 6d ago DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R] DeepSWE delivers four advances over existing public benchmarks: Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity: Tasks span a broad pool of 91 repositories across 5… 9 Hugging Face official-blog 6d ago Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Back to Articles a]:hidden"> Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Published June 24, 2026 Update on GitHub Upvote 2 Daniel Gert Nielsen daniel-treble treble-technologies Shivam Saini whojavumusic treble-technologies Alessia Milo alessia-treble… 11 Vercel — AI dev-tools 6d ago GLM 5.2 Fast via Wafer now available on AI Gateway GLM 5.2 Fast via Wafer is now available on AI Gateway . Based on our own benchmarking across small-context, large-context, and tool-call scenarios, Wafer delivers a 2x higher throughput than other providers serving GLM-5.2 on serverless, leading on decode and end-to-end speed… 7 r/LocalLLaMA community 6d ago OpenMythos benchmarks Hey everyone! OpenMythos benchmarks are finally here sorry it took about a week to post these. The delay was mainly because SWE-bench results weren't matching up with Qwen 3.6 27B official numbers. Turns out Qwen used a different eval harness and also refined/filtered the… 12 r/LocalLLaMA community 6d ago I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention. I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8… 10 Hacker News — AI on Front Page community 6d ago Krea 2: SOTA open-weights 12B image model Article URL: https://www.krea.ai/blog/krea-2-technical-report Comments URL: https://news.ycombinator.com/item?id=48646659 Points: 247 # Comments: 33 4 Hugging Face Daily Papers research 6d ago Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City Abstract Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.… 5 r/LocalLLaMA community 6d ago CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't… 19 r/LocalLLaMA community 6d ago Human Evaluation of GLM-5.2 I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now. The last few days I've been testing it on a real world project, and it's… 6 Hugging Face Daily Papers research 6d ago HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions Abstract HAKARI-Bench provides a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct With the rapid spread of… 23 Hugging Face Daily Papers research 7d ago DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured… 19 Hugging Face Daily Papers research 7d ago EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents… 30 Hugging Face Daily Papers research 7d ago DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to… 14 Hugging Face Daily Papers research 7d ago Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark Abstract PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms. Generated by… 5 r/MachineLearning community 7d ago Non-deterministic Vulnerability Detection Benchmark System [P] I work in firmware adjacent to AI, so not an ML guy exactly, so that's why I've come here. For work we got a bit concerned about Mythos and all the hype made me explore some benchmarking work. I now have this pretty cool benchmark that's about 80% done sitting around and haven't… 26 r/MachineLearning community 7d ago Syntactically robust NLI for semantics of imperfectly generated text? [R] Hi all, I'm looking for literature on relatively specific tooling. In autoregressive LLMs, there is substantial published work that used NLI on sub-claims produced by LLMs to gauge correctness of LLM answers. In diffusion (or D-) LLMs, the SoTA model generations that I see… 37 r/LocalLLaMA community 7d ago NEX-N2-mini: "There is no Pareto frontier. I am Pareto". This Qwen3.5-MoE fine tune fixed 3.5 and 3.6 overthinking apparently on my tests. I have been testing all popular MoE for my Mac and it seems I just found gold: 3.5/3.6 level of reasoning (if not slightly superior) at a fraction of the reasoning tokens used (wasted). Dynamic plot with other benchmarks here: https://benchmark-yourself.streamlit.app/… 4 r/LocalLLaMA community 7d ago Gemma 4 QAT 31B responds better to KV cache quantization too I've run benchmark from this post and got even better results on Gemma 4 31B   submitted by   /u/justicecurcian [link]   [comments] 29 Hugging Face Daily Papers research 7d ago SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction Abstract SpatialAvatar-0 enables high-quality 4D head avatar generation by combining feed-forward prediction with per-subject refinement through a shared Gaussian representation, achieving superior performance across multiple benchmarks. Generated by… 20 Hugging Face Daily Papers research 8d ago GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents Abstract Current memory agents lack reliable shared institutional deployment due to challenges in balancing utility, access control, and forgetting across multiple principals with diverse authorization contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory benchmarks for… 5 r/LocalLLaMA community 8d ago Leaderboard for quantized models, similar to artificial analysis? Artificial analysis’ leaderboard for models is somewhat useful for comparing model intelligence, but does not take into account quantization for open models. Is there a way to better compare quantized open models against each other and proprietary models other than running them… 35 Hugging Face Daily Papers research 8d ago WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents Abstract WorldLines benchmark evaluates long-term memory in embodied agents through household scenarios, while ObsMem framework addresses challenges in partial observability and memory translation for decision-making. Generated by Qwen/Qwen2.5-Coder-32B-Instruct To assist humans… 19 r/LocalLLaMA community 8d ago Best local model for vision - 2nd benchmark update - 21 Jun 2026 I previously posted the first results of my VLM benchmark . There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it… 9 r/LocalLLaMA community 8d ago GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg) Saw this breakdown from Theo (t3.gg) on X showing the latest DeepSWE leaderboard stats for the new GLM-5.2 open-weight model.The good news: it's officially surpassing GPT-5.4 and the entire Gemini lineup in raw coding capability. Seeing an open-weight model punch that high is… 15 r/LocalLLaMA community 10d ago Some llama.cpp B70 SYCL benchmarks build: dd4623a74 (9640) | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma4 12B Q8_0 | 11.78 GiB | 11.91 B | SYCL | -1 | pp512 | 1578.19 ± 7.82 |… 11 r/LocalLLaMA community 10d ago I benchmarked Claude's "Fast C++". It wasn't faster   submitted by   /u/User_Deprecated [link]   [comments] 15 Hugging Face Daily Papers research 10d ago Context-Aware RL for Agentic and Multimodal LLMs Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by… 21 Hugging Face Daily Papers research 10d ago The Data Manifold under the Microscope Abstract A benchmarking framework is introduced to study data-manifold geometry by extending dSprites and COIL-20 datasets with additional transformation dimensions and dense sampling, enabling accurate estimation of curvature, reach, and volume for theoretical analysis and… 36 r/LocalLLaMA community 10d ago Benchmarking or benchmarketing? Maybe I’m getting cynical, but LLM benchmarking is starting to feel less like measurement and more like marketing and positioning. Every week there’s a new leaderboard score, new chart, new eval suite, or some claim that a model is suddenly the best. It feels like benchmarks… 35 r/LocalLLaMA community 10d ago New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts You can read about it here: https://artificialanalysis.ai/articles/aa-briefcase This is a solid benchmark from Artificial Analysis. It basically tests an LLMs ability to plan and execute tasks. And more importantly, it is a new benchmark that is not saturated, so no one can… 32 Hugging Face Daily Papers research 10d ago Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages Abstract Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 33 r/LocalLLaMA community 10d ago Has anyone here used VibeThinker-3B outside benchmarks? Just curious, given the hype and benchmark numbers. Curious about real-world behavior: debugging, coding assistance, reasoning over messy prompts, local latency, failure modes, and whether it actually feels useful versus just optimized for verifiable evals.… 23 Hugging Face Daily Papers research 10d ago No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages Abstract Research addresses code generation challenges for no-resource programming languages by developing benchmarks and proposing a method that combines further pre-training with weight difference transfer to create specialized instruction-following models at reduced… 27 r/LocalLLaMA community 10d ago Researchers trained a Deep Research agent with 32 H100s and open-sourced everything Ohio State University's NLP team released QUEST-35B, an open-source Deep Research agent trained using ~32 H100s and ~8K synthetic samples. The team open-sourced the training recipe, code, weights and datasets. Benchmark results show competitive performance against several… 13 Hugging Face Daily Papers research 10d ago JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines Abstract Game development frameworks and benchmarks were created using data from game jam competitions to evaluate code generation and project-level programming capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Current AI-driven game development has made substantial… 25 Hugging Face Daily Papers research 10d ago DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis Abstract A large-scale real-world dataset called DF3DV-1K is introduced to address the lack of clean and cluttered image sets for distractor-free radiance field research, containing 1,048 scenes with 89,924 images across 128 distractor types and 161 scene themes, along with a… 5 arXiv — NLP / Computation & Language research 11d ago Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment arXiv:2606.19558v1 Announce Type: cross Abstract: Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a… 32 arXiv — Machine Learning research 11d ago IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for… 35 arXiv — Machine Learning research 11d ago MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery arXiv:2606.19624v1 Announce Type: new Abstract: Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the… 16 Page 3 of 10 · 500 articles ← Newer Older →