Tag

Benchmark

500 articles archived under #benchmark · RSS

arXiv — NLP / Computation & Language research 6d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of…

36
arXiv — NLP / Computation & Language research 6d ago

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

arXiv:2606.24530v1 Announce Type: new Abstract: We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real…

21
arXiv — NLP / Computation & Language research 6d ago

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

arXiv:2606.24627v1 Announce Type: new Abstract: Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect…

4
arXiv — NLP / Computation & Language research 6d ago

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

arXiv:2606.24714v1 Announce Type: new Abstract: Chinese news text contains dense written forms such as scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names. These forms are frequent in real listening…

33
arXiv — NLP / Computation & Language research 6d ago

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv:2606.24391v1 Announce Type: cross Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept…

29
arXiv — NLP / Computation & Language research 6d ago

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

arXiv:2606.24648v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained…

15
arXiv — NLP / Computation & Language research 6d ago

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

arXiv:2409.11363v2 Announce Type: replace Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially,…

20
arXiv — NLP / Computation & Language research 6d ago

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

arXiv:2501.11790v5 Announce Type: replace Abstract: Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable…

29
arXiv — NLP / Computation & Language research 6d ago

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

arXiv:2505.18542v4 Announce Type: replace Abstract: Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has focused on extracting linear action flows from instructional…

32
Hugging Face Daily Papers research 6d ago

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Abstract NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation…

21
Hugging Face Daily Papers research 6d ago

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Abstract Text-to-image models fail to generate counterfactual scenes because they rely on tightly coupled visual-textual patterns rather than causal reasoning, demonstrating limited understanding beyond pattern matching. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Text-to-image…

26
r/MachineLearning community 6d ago

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

DeepSWE delivers four advances over existing public benchmarks: Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity: Tasks span a broad pool of 91 repositories across 5…

9
Hugging Face official-blog 6d ago

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Back to Articles a]:hidden"> Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World Published June 24, 2026 Update on GitHub Upvote 2 Daniel Gert Nielsen daniel-treble treble-technologies Shivam Saini whojavumusic treble-technologies Alessia Milo alessia-treble…

11
Vercel — AI dev-tools 6d ago

GLM 5.2 Fast via Wafer now available on AI Gateway

GLM 5.2 Fast via Wafer is now available on AI Gateway . Based on our own benchmarking across small-context, large-context, and tool-call scenarios, Wafer delivers a 2x higher throughput than other providers serving GLM-5.2 on serverless, leading on decode and end-to-end speed…

7
r/LocalLLaMA community 6d ago

OpenMythos benchmarks

Hey everyone! OpenMythos benchmarks are finally here sorry it took about a week to post these. The delay was mainly because SWE-bench results weren't matching up with Qwen 3.6 27B official numbers. Turns out Qwen used a different eval harness and also refined/filtered the…

12
r/LocalLLaMA community 6d ago

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

I ran a small benchmark on LLMs for medical scribing. Reason: most discussion around AI scribe safety focuses on hallucinations. That matters, but in notes I kept seeing another problem: models often leave out clinically relevant details from the conversation. So I evaluated 8…

10
Hacker News — AI on Front Page community 6d ago

Krea 2: SOTA open-weights 12B image model

Article URL: https://www.krea.ai/blog/krea-2-technical-report Comments URL: https://news.ycombinator.com/item?id=48646659 Points: 247 # Comments: 33

4
Hugging Face Daily Papers research 6d ago

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Abstract Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.…

5
r/LocalLLaMA community 6d ago

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Ran three open-weight TTS models head to head on CPU. Intel Xeon, 4 cores, 15.6GB RAM, no GPU. Five configs, six text lengths from 12 to 1712 chars, 5 timed reps per cell after warmup, 150 timed runs total. Every audio output scored with UTMOS (utmos22_strong) so quality isn't…

19
r/LocalLLaMA community 6d ago

Human Evaluation of GLM-5.2

I've seen plenty of benchmarks that put GLM-5.2 below many of the closed source alternatives but at their heels. I thought to myself, next version GLM will totally be where the best frontiers are at now. The last few days I've been testing it on a real world project, and it's…

6
Hugging Face Daily Papers research 6d ago

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

Abstract HAKARI-Bench provides a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct With the rapid spread of…

23
Hugging Face Daily Papers research 7d ago

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured…

19
Hugging Face Daily Papers research 7d ago

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents…

30
Hugging Face Daily Papers research 7d ago

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to…

14
Hugging Face Daily Papers research 7d ago

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Abstract PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms. Generated by…

5
r/MachineLearning community 7d ago

Non-deterministic Vulnerability Detection Benchmark System [P]

I work in firmware adjacent to AI, so not an ML guy exactly, so that's why I've come here. For work we got a bit concerned about Mythos and all the hype made me explore some benchmarking work. I now have this pretty cool benchmark that's about 80% done sitting around and haven't…

26
r/MachineLearning community 7d ago

Syntactically robust NLI for semantics of imperfectly generated text? [R]

Hi all, I'm looking for literature on relatively specific tooling. In autoregressive LLMs, there is substantial published work that used NLI on sub-claims produced by LLMs to gauge correctness of LLM answers. In diffusion (or D-) LLMs, the SoTA model generations that I see…

37
r/LocalLLaMA community 7d ago

NEX-N2-mini: "There is no Pareto frontier. I am Pareto". This Qwen3.5-MoE fine tune fixed 3.5 and 3.6 overthinking apparently on my tests.

I have been testing all popular MoE for my Mac and it seems I just found gold: 3.5/3.6 level of reasoning (if not slightly superior) at a fraction of the reasoning tokens used (wasted). Dynamic plot with other benchmarks here: https://benchmark-yourself.streamlit.app/…

4
r/LocalLLaMA community 7d ago

Gemma 4 QAT 31B responds better to KV cache quantization too

I've run benchmark from this post and got even better results on Gemma 4 31B   submitted by   /u/justicecurcian [link]   [comments]

29
Hugging Face Daily Papers research 7d ago

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

Abstract SpatialAvatar-0 enables high-quality 4D head avatar generation by combining feed-forward prediction with per-subject refinement through a shared Gaussian representation, achieving superior performance across multiple benchmarks. Generated by…

20
Hugging Face Daily Papers research 8d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Abstract Current memory agents lack reliable shared institutional deployment due to challenges in balancing utility, access control, and forgetting across multiple principals with diverse authorization contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory benchmarks for…

5
r/LocalLLaMA community 8d ago

Leaderboard for quantized models, similar to artificial analysis?

Artificial analysis’ leaderboard for models is somewhat useful for comparing model intelligence, but does not take into account quantization for open models. Is there a way to better compare quantized open models against each other and proprietary models other than running them…

35
Hugging Face Daily Papers research 8d ago

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Abstract WorldLines benchmark evaluates long-term memory in embodied agents through household scenarios, while ObsMem framework addresses challenges in partial observability and memory translation for decision-making. Generated by Qwen/Qwen2.5-Coder-32B-Instruct To assist humans…

19
r/LocalLLaMA community 8d ago

Best local model for vision - 2nd benchmark update - 21 Jun 2026

I previously posted the first results of my VLM benchmark . There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it…

9
r/LocalLLaMA community 8d ago

GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg)

Saw this breakdown from Theo (t3.gg) on X showing the latest DeepSWE leaderboard stats for the new GLM-5.2 open-weight model.The good news: it's officially surpassing GPT-5.4 and the entire Gemini lineup in raw coding capability. Seeing an open-weight model punch that high is…

15
r/LocalLLaMA community 10d ago

Some llama.cpp B70 SYCL benchmarks

build: dd4623a74 (9640) | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | gemma4 12B Q8_0 | 11.78 GiB | 11.91 B | SYCL | -1 | pp512 | 1578.19 ± 7.82 |…

11
r/LocalLLaMA community 10d ago

I benchmarked Claude's "Fast C++". It wasn't faster

  submitted by   /u/User_Deprecated [link]   [comments]

15
Hugging Face Daily Papers research 10d ago

Context-Aware RL for Agentic and Multimodal LLMs

Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by…

21
Hugging Face Daily Papers research 10d ago

The Data Manifold under the Microscope

Abstract A benchmarking framework is introduced to study data-manifold geometry by extending dSprites and COIL-20 datasets with additional transformation dimensions and dense sampling, enabling accurate estimation of curvature, reach, and volume for theoretical analysis and…

36
r/LocalLLaMA community 10d ago

Benchmarking or benchmarketing?

Maybe I’m getting cynical, but LLM benchmarking is starting to feel less like measurement and more like marketing and positioning. Every week there’s a new leaderboard score, new chart, new eval suite, or some claim that a model is suddenly the best. It feels like benchmarks…

35
r/LocalLLaMA community 10d ago

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

You can read about it here: https://artificialanalysis.ai/articles/aa-briefcase This is a solid benchmark from Artificial Analysis. It basically tests an LLMs ability to plan and execute tasks. And more importantly, it is a new benchmark that is not saturated, so no one can…

32
Hugging Face Daily Papers research 10d ago

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Abstract Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

33
r/LocalLLaMA community 10d ago

Has anyone here used VibeThinker-3B outside benchmarks?

Just curious, given the hype and benchmark numbers. Curious about real-world behavior: debugging, coding assistance, reasoning over messy prompts, local latency, failure modes, and whether it actually feels useful versus just optimized for verifiable evals.…

23
Hugging Face Daily Papers research 10d ago

No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages

Abstract Research addresses code generation challenges for no-resource programming languages by developing benchmarks and proposing a method that combines further pre-training with weight difference transfer to create specialized instruction-following models at reduced…

27
r/LocalLLaMA community 10d ago

Researchers trained a Deep Research agent with 32 H100s and open-sourced everything

Ohio State University's NLP team released QUEST-35B, an open-source Deep Research agent trained using ~32 H100s and ~8K synthetic samples. The team open-sourced the training recipe, code, weights and datasets. Benchmark results show competitive performance against several…

13
Hugging Face Daily Papers research 10d ago

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Abstract Game development frameworks and benchmarks were created using data from game jam competitions to evaluate code generation and project-level programming capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Current AI-driven game development has made substantial…

25
Hugging Face Daily Papers research 10d ago

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Abstract A large-scale real-world dataset called DF3DV-1K is introduced to address the lack of clean and cluttered image sets for distractor-free radiance field research, containing 1,048 scenes with 89,924 images across 128 distractor types and 161 scene themes, along with a…

5
arXiv — NLP / Computation & Language research 11d ago

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

arXiv:2606.19558v1 Announce Type: cross Abstract: Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a…

32
arXiv — Machine Learning research 11d ago

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for…

35
arXiv — Machine Learning research 11d ago

MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery

arXiv:2606.19624v1 Announce Type: new Abstract: Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the…

16

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

GLM 5.2 Fast via Wafer now available on AI Gateway

OpenMythos benchmarks

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

Krea 2: SOTA open-weights 12B image model

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

Human Evaluation of GLM-5.2

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Non-deterministic Vulnerability Detection Benchmark System [P]

Syntactically robust NLI for semantics of imperfectly generated text? [R]

NEX-N2-mini: "There is no Pareto frontier. I am Pareto". This Qwen3.5-MoE fine tune fixed 3.5 and 3.6 overthinking apparently on my tests.

Gemma 4 QAT 31B responds better to KV cache quantization too

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Leaderboard for quantized models, similar to artificial analysis?

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Best local model for vision - 2nd benchmark update - 21 Jun 2026

GLM-5.2 benchmarked on DeepSWE: Beats Gemini & GPT-5.4, but the token volume/cost makes it wildly inefficient? (Theo - t3.gg)

Some llama.cpp B70 SYCL benchmarks

I benchmarked Claude's "Fast C++". It wasn't faster

Context-Aware RL for Agentic and Multimodal LLMs

The Data Manifold under the Microscope

Benchmarking or benchmarketing?

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Has anyone here used VibeThinker-3B outside benchmarks?

No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages

Researchers trained a Deep Research agent with 32 H100s and open-sourced everything

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery