Tag

Benchmark

500 articles archived under #benchmark · RSS

r/LocalLLaMA community 21d ago

Gemma 4 26B A4B IT QAT Comparison

Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me. I did not use any AI other than asking Gemini 3.1 Pro if it was statistically significant because I was too tired to do…

31
arXiv — Machine Learning research 21d ago

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

arXiv:2606.07550v1 Announce Type: new Abstract: Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction…

35
arXiv — Machine Learning research 21d ago

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

arXiv:2606.07591v1 Announce Type: new Abstract: AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research…

14
arXiv — Machine Learning research 21d ago

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

arXiv:2606.07610v1 Announce Type: new Abstract: State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful…

6
arXiv — Machine Learning research 21d ago

Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models

arXiv:2606.07623v1 Announce Type: new Abstract: This paper develops a model-theoretic framework for verifying context-conditioned language-model behavior by replacing benchmark labels with finite semantic certificates. The first problem is finite determinacy: when do examples in…

25
arXiv — Machine Learning research 21d ago

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no…

13
arXiv — Machine Learning research 21d ago

A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

arXiv:2606.07789v1 Announce Type: new Abstract: Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent…

26
Hugging Face Daily Papers research 21d ago

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key…

20
Hugging Face Daily Papers research 21d ago

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by…

19
Hugging Face Daily Papers research 21d ago

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

Abstract SWE-Explore introduces a benchmark for evaluating coding agents' repository exploration capabilities by requiring ranked lists of relevant code regions within line budgets, demonstrating that agentic exploration outperforms traditional retrieval methods. Generated by…

11
Hugging Face Daily Papers research 21d ago

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a…

7
r/LocalLLaMA community 21d ago

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The…

14
r/LocalLLaMA community 21d ago

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling , and u/complexminded pointed out the tool-eval-bench utility by…

9
r/LocalLLaMA community 21d ago

LocalLLaMA post tier list

Since there is much (justified) whining about post quality, I thought it would be helpful to get a sense of what people actually DO like. Here's my take: S-tier: -GGUFs/MLX or benchmark data for new best-in-class local model released - New Optimizations that are actually a big…

17
r/LocalLLaMA community 21d ago

When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking

  submitted by   /u/Honest-Kangaroo-1830 [link]   [comments]

12
Hugging Face Daily Papers research 21d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

Abstract UnpredictaBench evaluates large language models' capacity to sample from target distributions, revealing significant gaps in their ability to simulate unpredictable systems despite recent advances in output diversity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We…

7
r/LocalLLaMA community 21d ago

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative decoding combined with KV cache…

20
Hugging Face Daily Papers research 21d ago

GENEB: Why Genomic Models Are Hard to Compare

Abstract GENEB presents a comprehensive benchmark for evaluating genomic foundation models across diverse tasks and architectures under a unified protocol. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in genomic foundation models is difficult to assess due to fragmented…

25
Hugging Face Daily Papers research 22d ago

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated…

30
Smol AI News news-outlet 22d ago

not much happened today

**FrontierCode** benchmark by **Cognition** highlights the challenge of coding tasks with the best model, **Opus 4.8**, scoring only about **13%** on the hardest subset, indicating coding is less solved than benchmarks suggest. The trend toward using **loops** as a control…

5
Hugging Face Daily Papers research 22d ago

MMAE: A Massive Multitask Audio Editing Benchmark

Abstract MMAE presents a comprehensive benchmark for instruction-based audio editing across multiple modalities and complexity levels, revealing significant gaps in current model capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce MMAE, a Massive Multitask…

24
arXiv — Machine Learning research 22d ago

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly…

27
arXiv — Machine Learning research 22d ago

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,…

37
arXiv — Machine Learning research 22d ago

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

arXiv:2606.06717v1 Announce Type: new Abstract: While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets,…

32
arXiv — Machine Learning research 22d ago

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

arXiv:2606.06881v1 Announce Type: new Abstract: Blood glucose forecasting models are foundational for modern diabetes management systems, as reliable short-term predictions can enable proactive interventions, support automated insulin delivery, and reduce the risk of hypo- and…

38
arXiv — Machine Learning research 22d ago

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv:2606.06920v1 Announce Type: new Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)…

17
arXiv — Machine Learning research 22d ago

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or…

12
arXiv — Machine Learning research 22d ago

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

arXiv:2606.07387v1 Announce Type: new Abstract: State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose…

15
arXiv — Machine Learning research 22d ago

CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

arXiv:2606.07488v1 Announce Type: new Abstract: Personalized virtual heart simulations face challenges in model personalization and computational cost. While neural surrogates offer state-of-the-art solutions, they typically address either efficient personalization or training…

28
arXiv — Machine Learning research 22d ago

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

arXiv:2606.06509v1 Announce Type: cross Abstract: Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically…

17
arXiv — NLP / Computation & Language research 22d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in…

33
arXiv — NLP / Computation & Language research 22d ago

An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

arXiv:2606.06879v1 Announce Type: new Abstract: Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features…

12
arXiv — NLP / Computation & Language research 22d ago

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

arXiv:2606.06959v1 Announce Type: new Abstract: Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of…

5
arXiv — NLP / Computation & Language research 22d ago

Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

arXiv:2606.06960v1 Announce Type: new Abstract: Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit…

12
arXiv — NLP / Computation & Language research 22d ago

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

arXiv:2606.07020v1 Announce Type: new Abstract: Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis.…

19
arXiv — NLP / Computation & Language research 22d ago

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require…

19
arXiv — NLP / Computation & Language research 22d ago

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We…

37
arXiv — NLP / Computation & Language research 22d ago

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic…

19
arXiv — NLP / Computation & Language research 22d ago

How reliable are LLMs when it comes to playing dice?

arXiv:2606.07515v1 Announce Type: new Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a…

33
arXiv — NLP / Computation & Language research 22d ago

MMAE: A Massive Multitask Audio Editing Benchmark

arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,…

8
arXiv — NLP / Computation & Language research 22d ago

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

arXiv:2606.07297v1 Announce Type: cross Abstract: Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved),…

10
arXiv — NLP / Computation & Language research 22d ago

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

arXiv:2606.07435v1 Announce Type: cross Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI…

28
Vercel — AI dev-tools 22d ago

DeepSeek enters the fight for token volume, Anthropic continues to dominate spend

Every month, AI Gateway routes tens of trillions of tokens between production applications and AI labs, giving us visibility into what AI usage actually looks like, separate from leaderboards and benchmarks. We publish the data monthly in the AI Gateway production index. May…

18
Hugging Face Daily Papers research 22d ago

PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

Abstract PaperFlow is a framework for scientific paper recommendation that processes user profiles, daily paper streams, and interest drift through three stages: profiling, recommending, and adapting, using a longitudinal benchmark with 24 users, 50 daily streams, and 1,200…

19
Hugging Face Daily Papers research 22d ago

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

Abstract SubtleMemory benchmark evaluates AI agents' ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships. Generated by…

33
Hugging Face Daily Papers research 22d ago

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Abstract ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing…

12
Hugging Face Daily Papers research 22d ago

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world…

11
Hugging Face Daily Papers research 22d ago

OpenSkill: Open-World Self-Evolution for LLM Agents

Abstract OpenSkill enables self-evolving agents to develop skills and verification signals from scratch using open-world resources without target-task supervision, achieving high automated performance across benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Self-evolving…

30
Hugging Face Daily Papers research 22d ago

dots.tts Technical Report

Abstract A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques. Generated by…

32
r/LocalLLaMA community 22d ago

Qwen 3.6 27B on DeepSWE

Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar to 3.6 Plus and it really gets me…

21

Gemma 4 26B A4B IT QAT Comparison

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

CoVEBench: Can Video Editing Models Handle Complex Instructions?

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

LocalLLaMA post tier list

When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup

GENEB: Why Genomic Models Are Hard to Compare

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

not much happened today

MMAE: A Massive Multitask Audio Editing Benchmark

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets

GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

How reliable are LLMs when it comes to playing dice?

MMAE: A Massive Multitask Audio Editing Benchmark

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

DeepSeek enters the fight for token volume, Anthropic continues to dominate spend

PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

OpenSkill: Open-World Self-Evolution for LLM Agents

dots.tts Technical Report

Qwen 3.6 27B on DeepSWE