News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow r/LocalLLaMA community 21d ago Gemma 4 26B A4B IT QAT Comparison Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me. I did not use any AI other than asking Gemini 3.1 Pro if it was statistically significant because I was too tired to do… 31 arXiv — Machine Learning research 21d ago Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark arXiv:2606.07550v1 Announce Type: new Abstract: Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction… 35 arXiv — Machine Learning research 21d ago ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research arXiv:2606.07591v1 Announce Type: new Abstract: AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research… 14 arXiv — Machine Learning research 21d ago LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training arXiv:2606.07610v1 Announce Type: new Abstract: State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful… 6 arXiv — Machine Learning research 21d ago Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models arXiv:2606.07623v1 Announce Type: new Abstract: This paper develops a model-theoretic framework for verifying context-conditioned language-model behavior by replacing benchmark labels with finite semantic certificates. The first problem is finite determinacy: when do examples in… 25 arXiv — Machine Learning research 21d ago Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity arXiv:2606.07726v1 Announce Type: new Abstract: Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no… 13 arXiv — Machine Learning research 21d ago A Framework for Evaluating and Benchmarking Concept Drift Detection Methods arXiv:2606.07789v1 Announce Type: new Abstract: Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent… 26 Hugging Face Daily Papers research 21d ago Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting Abstract AI evaluation results suffer from inconsistent reporting across platforms, prompting the development of EvalCards, an operational framework that standardizes benchmark metadata, evaluation data, and model information into a unified, interpretable record with four key… 20 Hugging Face Daily Papers research 21d ago CoVEBench: Can Video Editing Models Handle Complex Instructions? Abstract A new benchmark called CoVEBench is introduced to evaluate compositional video editing capabilities, addressing limitations of existing models in handling complex, multi-step editing tasks while preserving spatiotemporal content. Generated by… 19 Hugging Face Daily Papers research 21d ago SWE-Explore: Benchmarking How Coding Agents Explore Repositories Abstract SWE-Explore introduces a benchmark for evaluating coding agents' repository exploration capabilities by requiring ranked lists of relevant code regions within line budgets, demonstrating that agentic exploration outperforms traditional retrieval methods. Generated by… 11 Hugging Face Daily Papers research 21d ago SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a… 7 r/LocalLLaMA community 21d ago I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU I fine-tuned NVIDIA's Parakeet TDT 0.6B v2 for clinical speech and am releasing the weights as Omi Med STT v1 (CC-BY-4.0). Disclosure: I'm the founder of Omi Health and built this. Happy to dig into the training mix, benchmark, failure cases, quantization, or anything else. The… 14 r/LocalLLaMA community 21d ago Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling , and u/complexminded pointed out the tool-eval-bench utility by… 9 r/LocalLLaMA community 21d ago LocalLLaMA post tier list Since there is much (justified) whining about post quality, I thought it would be helpful to get a sense of what people actually DO like. Here's my take: S-tier: -GGUFs/MLX or benchmark data for new best-in-class local model released - New Optimizations that are actually a big… 17 r/LocalLLaMA community 21d ago When every other post is an AI generated benchmark report, a question about the best model, or a slop-coded application or engine that pretends to be groundbreaking   submitted by   /u/Honest-Kangaroo-1830 [link]   [comments] 12 Hugging Face Daily Papers research 21d ago UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs Abstract UnpredictaBench evaluates large language models' capacity to sample from target distributions, revealing significant gaps in their ability to simulate unpredictable systems despite recent advances in output diversity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We… 7 r/LocalLLaMA community 21d ago [Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative decoding combined with KV cache… 20 Hugging Face Daily Papers research 21d ago GENEB: Why Genomic Models Are Hard to Compare Abstract GENEB presents a comprehensive benchmark for evaluating genomic foundation models across diverse tasks and architectures under a unified protocol. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in genomic foundation models is difficult to assess due to fragmented… 25 Hugging Face Daily Papers research 22d ago SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations Abstract SoCRATES presents a realistic multi-domain benchmark for evaluating proactive LLM mediators across various socio-cognitive adaptation axes, demonstrating that even top-performing models only resolve about one-third of the consensus gap in conflict resolution. Generated… 30 Smol AI News news-outlet 22d ago not much happened today **FrontierCode** benchmark by **Cognition** highlights the challenge of coding tasks with the best model, **Opus 4.8**, scoring only about **13%** on the hardest subset, indicating coding is less solved than benchmarks suggest. The trend toward using **loops** as a control… 5 Hugging Face Daily Papers research 22d ago MMAE: A Massive Multitask Audio Editing Benchmark Abstract MMAE presents a comprehensive benchmark for instruction-based audio editing across multiple modalities and complexity levels, revealing significant gaps in current model capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce MMAE, a Massive Multitask… 24 arXiv — Machine Learning research 22d ago Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios arXiv:2606.06546v1 Announce Type: new Abstract: Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly… 27 arXiv — Machine Learning research 22d ago MacArena: Benchmarking Computer Use Agents on an Online macOS Environment arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,… 37 arXiv — Machine Learning research 22d ago ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets arXiv:2606.06717v1 Announce Type: new Abstract: While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding pockets and struggle to sample effective ligands for challenging low-pocketability targets,… 32 arXiv — Machine Learning research 22d ago GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting arXiv:2606.06881v1 Announce Type: new Abstract: Blood glucose forecasting models are foundational for modern diabetes management systems, as reliable short-term predictions can enable proactive interventions, support automated insulin delivery, and reduce the risk of hypo- and… 38 arXiv — Machine Learning research 22d ago The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning arXiv:2606.06920v1 Announce Type: new Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)… 17 arXiv — Machine Learning research 22d ago REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference arXiv:2606.07141v1 Announce Type: new Abstract: Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or… 12 arXiv — Machine Learning research 22d ago Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation arXiv:2606.07387v1 Announce Type: new Abstract: State-of-the-art text-to-music generation systems rely on massive proprietary datasets and industrial-scale compute, making it impossible to disentangle architectural contributions from resource advantages. We propose… 15 arXiv — Machine Learning research 22d ago CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations arXiv:2606.07488v1 Announce Type: new Abstract: Personalized virtual heart simulations face challenges in model personalization and computational cost. While neural surrogates offer state-of-the-art solutions, they typically address either efficient personalization or training… 28 arXiv — Machine Learning research 22d ago Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction arXiv:2606.06509v1 Announce Type: cross Abstract: Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance gains are driven mainly by more expressive models or by better representation of clinically… 17 arXiv — NLP / Computation & Language research 22d ago UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs arXiv:2606.06622v1 Announce Type: new Abstract: We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in… 33 arXiv — NLP / Computation & Language research 22d ago An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection arXiv:2606.06879v1 Announce Type: new Abstract: Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features… 12 arXiv — NLP / Computation & Language research 22d ago OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios arXiv:2606.06959v1 Announce Type: new Abstract: Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of… 5 arXiv — NLP / Computation & Language research 22d ago Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments arXiv:2606.06960v1 Announce Type: new Abstract: Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit… 12 arXiv — NLP / Computation & Language research 22d ago MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights arXiv:2606.07020v1 Announce Type: new Abstract: Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis.… 19 arXiv — NLP / Computation & Language research 22d ago mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages? arXiv:2606.07069v1 Announce Type: new Abstract: We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require… 19 arXiv — NLP / Computation & Language research 22d ago UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We… 37 arXiv — NLP / Computation & Language research 22d ago M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic… 19 arXiv — NLP / Computation & Language research 22d ago How reliable are LLMs when it comes to playing dice? arXiv:2606.07515v1 Announce Type: new Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a… 33 arXiv — NLP / Computation & Language research 22d ago MMAE: A Massive Multitask Audio Editing Benchmark arXiv:2606.07229v1 Announce Type: cross Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation,… 8 arXiv — NLP / Computation & Language research 22d ago SWE-Explore: Benchmarking How Coding Agents Explore Repositories arXiv:2606.07297v1 Announce Type: cross Abstract: Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved),… 10 arXiv — NLP / Computation & Language research 22d ago The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders? arXiv:2606.07435v1 Announce Type: cross Abstract: Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI… 28 Vercel — AI dev-tools 22d ago DeepSeek enters the fight for token volume, Anthropic continues to dominate spend Every month, AI Gateway routes tens of trillions of tokens between production applications and AI labs, giving us visibility into what AI usage actually looks like, separate from leaderboards and benchmarks. We publish the data monthly in the AI Gateway production index. May… 18 Hugging Face Daily Papers research 22d ago PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams Abstract PaperFlow is a framework for scientific paper recommendation that processes user profiles, daily paper streams, and interest drift through three stages: profiling, recommending, and adapting, using a longitudinal benchmark with 24 users, 50 daily streams, and 1,200… 19 Hugging Face Daily Papers research 22d ago SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents Abstract SubtleMemory benchmark evaluates AI agents' ability to handle complex relational memory structures that emerge during prolonged interactions, revealing limitations in current memory systems for preserving and utilizing nuanced memory relationships. Generated by… 33 Hugging Face Daily Papers research 22d ago When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents Abstract ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing… 12 Hugging Face Daily Papers research 22d ago WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world… 11 Hugging Face Daily Papers research 22d ago OpenSkill: Open-World Self-Evolution for LLM Agents Abstract OpenSkill enables self-evolving agents to develop skills and verification signals from scratch using open-world resources without target-task supervision, achieving high automated performance across benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Self-evolving… 30 Hugging Face Daily Papers research 22d ago dots.tts Technical Report Abstract A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques. Generated by… 32 r/LocalLLaMA community 22d ago Qwen 3.6 27B on DeepSWE Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar to 3.6 Plus and it really gets me… 21 Page 8 of 10 · 500 articles ← Newer Older →