Tag

Benchmark

500 articles archived under #benchmark · RSS

r/LocalLLaMA community 22d ago

Qwen 3.6 27B on DeepSWE

Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar to 3.6 Plus and it really gets me…

21
r/LocalLLaMA community 22d ago

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ

Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks . BeeLlama.cpp (my llama.cpp fork) was used as inference engine due to support of additional types:…

31
r/LocalLLaMA community 23d ago

Gemma 4 31B QAT Q4 vs standard Q4 — Top1 KLD benchmark results have me confused. Someone please explain or poke holes in this.

Edited - After digging into this some more and reviewing unsloth post for better understanding, the divergence APPEARS to stem from I did not use the BF16 QAT model as the "reference" model.... The QAT vs standard Q4 comparison in our benchmark is not apples-to-apples . The QAT…

11
r/LocalLLaMA community 23d ago

AMD MI50 on Debian Testing is doing great and getting better.

There is probably some relevant information to other cards here but my benchmarks are on dual MI50 32GB cards because that is what I have, and thought I would share with the community. Install instructions at the end. I'll put a dump of the full llama-benchy tables in a comment…

21
r/LocalLLaMA community 23d ago

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised with the result! By using llama.cpp patched with the…

17
r/LocalLLaMA community 23d ago

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!

TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher. A number of people in the comments under my previous post asked a fair question: what if we…

21
r/LocalLLaMA community 24d ago

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought. Namely for Honcho workload tiers and differing cron jobs. Not every workload benefits from an…

35
r/LocalLLaMA community 24d ago

dots.tts 2B🎙️ SOTA TTS from RedNote

🔗 Blog: https://rednote-hilab.github.io/dots.tts-demo/ 🔗 GitHub: https://github.com/rednote-hilab/dots.tts 🔗 Technical Report: https://arxiv.org/abs/2608.16894 dots.tts 🎙️ New open-source TTS from RedNote (Xiaohongshu) ✨ 2B parameters (Apache 2.0) ✨ Fully continuous…

16
Hugging Face Daily Papers research 24d ago

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Abstract Large language models deployed as coding agents exhibit significant safety violations in realistic project environments, necessitating new evaluation approaches beyond simple prompt refusal assessments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language models…

38
Hugging Face Daily Papers research 24d ago

Benchmark Everything Everywhere All at Once

Abstract Automated benchmark creation system generates diverse evaluation datasets with minimal human intervention, enabling continuous model assessment across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Benchmarks are fundamental for evaluating and advancing…

27
r/LocalLLaMA community 24d ago

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag) Cheap KV cache with good precision? Sign me up! Oh, vLLM…

12
r/MachineLearning community 24d ago

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Sharing a small CPU inference benchmark for nvidia/parakeet-tdt-0.6b-v3 that turned up a result I didn't expect going in. Setup: 2 x86-64 vCPUs (AVX2/FMA), 7.7GB RAM, no GPU. Test audio: 16.78s Harvard sentences at 16kHz mono. Results: Inference path RTF Peak Memory CPU…

26
r/MachineLearning community 24d ago

An autonomous research agent was the #1 contributor in OpenAI's Hiring Competition Parameter Golf (by merged records)[R]

https://preview.redd.it/kucy7n6nrg5h1.png?width=1600&format=png&auto=webp&s=b1c2e537667fbca3d1736fc103296c7374270d9c An autonomous research agent ended up with more merged leaderboard records than any individual human contributor in OpenAI's spring hiring competition, Parameter…

27
Hugging Face Daily Papers research 24d ago

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Abstract ForeSci is a temporally controlled benchmark that evaluates LLM agents' ability to make forward-looking research decisions from historical evidence across fast-moving AI domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI research often requires decisions before…

4
r/LocalLLaMA community 25d ago

Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside)

I completed a Python bug hunting benchmark with Gemma 4 12B. I used the Unsloth Dynamic Q5 GGUF model. The model has good capabilities. Default settings in LM Studio disable the reasoning. Fix the LM Studio reasoning configuration. LM Studio looks for Qwen tokens. Gemma 4 uses…

30
Hugging Face Daily Papers research 25d ago

Towards One-to-Many Temporal Grounding

Abstract One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

11
Hugging Face Daily Papers research 25d ago

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Abstract Mechanical engineering drawing understanding is improved through a specialized dataset and domain-specific model that outperforms existing baselines by leveraging multi-stage training and high-density visual question answering annotations. Generated by…

9
r/MachineLearning community 25d ago

Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? [d]

Hello everyone, Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? I am working on a project idea related to library-specific code generation. The concrete case is a specific Python library used in a…

18
Smol AI News news-outlet 25d ago

not much happened today

**Anthropic's Mythos/Opus cycle** sparked mixed reactions with praise for **Claude Mythos**'s one-shot workflows and concerns over **Opus 4.8** benchmark regressions. **Opus 4.7** showed strong chemistry task performance, "making Claude a chemist." **Sakana AI** launched an…

23
Hugging Face Daily Papers research 25d ago

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

Abstract AdaPlanBench presents a dynamic interactive benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Planning for real-world…

18
Hugging Face Daily Papers research 25d ago

RobotValues: Evaluating Household Robots When Human Values Conflict

Abstract RobotValues benchmark evaluates household robot planners in value-conflict scenarios, revealing that vision-language models exhibit default value preferences and struggle to override them when instructed to prioritize conflicting values. Generated by…

8
arXiv — Machine Learning research 25d ago

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

arXiv:2606.05169v1 Announce Type: new Abstract: We give a stereological theory of LLM benchmark coverage. For any suite with effective dimensionality d_eff, the visible Hausdorff distance between two convex capability profiles consistent with the same scores is bounded by…

30
arXiv — Machine Learning research 25d ago

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

arXiv:2606.05170v1 Announce Type: new Abstract: At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all…

27
arXiv — Machine Learning research 25d ago

Flash-WAM: Modality-Aware Distillation for World Action Models

arXiv:2606.05254v1 Announce Type: new Abstract: World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time…

13
arXiv — Machine Learning research 25d ago

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

arXiv:2606.05692v1 Announce Type: new Abstract: Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on…

35
arXiv — NLP / Computation & Language research 25d ago

Generic Triple-Latent Compression with Gated Associative Retrieval

arXiv:2606.05175v1 Announce Type: new Abstract: We study generic triple-latent sequence models that maintain a running token state and compressed pair-memory pathway to capture higher-order token interactions without benchmark-specific parsing. The triple-latent family improves…

6
arXiv — NLP / Computation & Language research 25d ago

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four…

5
arXiv — NLP / Computation & Language research 25d ago

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

arXiv:2606.05183v1 Announce Type: new Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial…

20
arXiv — NLP / Computation & Language research 25d ago

ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation

arXiv:2606.05421v1 Announce Type: new Abstract: When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other,…

6
arXiv — NLP / Computation & Language research 25d ago

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

arXiv:2606.05553v1 Announce Type: new Abstract: Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether…

10
arXiv — NLP / Computation & Language research 25d ago

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

arXiv:2606.05570v1 Announce Type: new Abstract: Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not…

32
arXiv — NLP / Computation & Language research 25d ago

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

arXiv:2606.05622v1 Announce Type: new Abstract: Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still…

37
arXiv — NLP / Computation & Language research 25d ago

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

arXiv:2606.05744v1 Announce Type: new Abstract: Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their…

5
arXiv — NLP / Computation & Language research 25d ago

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

arXiv:2606.05793v1 Announce Type: new Abstract: While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral…

34
arXiv — NLP / Computation & Language research 25d ago

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both.…

21
arXiv — NLP / Computation & Language research 25d ago

Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

arXiv:2606.05970v1 Announce Type: new Abstract: Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks.…

23
arXiv — NLP / Computation & Language research 25d ago

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

arXiv:2606.06088v1 Announce Type: new Abstract: We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts:…

18
arXiv — NLP / Computation & Language research 25d ago

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

arXiv:2606.06242v1 Announce Type: new Abstract: Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic…

9
Hugging Face Daily Papers research 25d ago

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Abstract Role-playing language agents require dynamic character development that evolves through narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character…

19
Hugging Face Daily Papers research 25d ago

Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing

Abstract RE-Edit benchmark evaluates image editing systems on five reasoning dimensions to assess logical consistency beyond visual plausibility. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Diffusion-based image editing has achieved strong visual fidelity under natural language…

6
Hugging Face Daily Papers research 25d ago

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

Abstract Self-Evolving Prompt Optimization (SePO) enhances agent performance by jointly optimizing both task and prompt agent system prompts through evolutionary search, demonstrating superior accuracy across diverse benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

11
Hugging Face Daily Papers research 25d ago

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Abstract Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks. Generated by…

20
Hugging Face Daily Papers research 25d ago

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

Abstract VideoKR presents a large-scale video reasoning dataset and benchmark designed to enhance knowledge-intensive video understanding through expert-domain content and human-in-the-loop example generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce VideoKR,…

24
ThursdAI news-outlet 25d ago

📅 ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more

From CoreWeave: This week was kind of nuts, tons of new OpenSource goodness, 3 guests on the show (Arena, Nous Research and NVIDIA) and image gen SOTA models racing to the top.

10
r/MachineLearning community 25d ago

Scrap the LLMs. Scoring 4.76% on the brand new ARC-3 using pure code, a 2012 AMD CPU, and zero AI tokens.[P]

Hey everyone, The ARC Prize 2026 just launched the interactive ARC-AGI-3 track, and the collective AI world is panic-renting massive H100 clusters trying to get multi-billion parameter LLMs to navigate these dynamic environments. Predictably, out-of-the-box LLMs are faceplanting…

31
Hugging Face Daily Papers research 25d ago

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Abstract A bilingual multi-attribute benchmark for instruction-guided speech editing is introduced to systematically evaluate speech modification capabilities across atomic and compositional tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Instruction-guided speech editing…

16
The Information — AI news-outlet 25d ago

Benchmark Joins the Late-Stage Crowd

For most of this decade, venture capital has morphed from a game of spotting the best young startups to making a more complex range of investments—from public stocks and cryptocurrencies to multibillion-dollar bets on older startups. General Catalyst even bought a hospital…

37
Ars Technica — AI news-outlet 25d ago

These LLMs are the best at resisting Russian propaganda

Estonian government benchmark shows how dozens of models combat Russia's "strategic narratives."

10
r/LocalLLaMA community 25d ago

cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants

We are happy to share cyankiwi AWQ update: better AWQ implementation, now with NVFP4 and FP8 Dynamic quantization support. We measured KL divergence against the BF16 baseline for 4-bit Qwen3.6 quants, on synthesized Qwen3.6 BF16 GPQA Diamond responses. cyankiwi AWQ release comes…

38
Hugging Face Daily Papers research 25d ago

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

Abstract SuperMemory-VQA is introduced as an egocentric visual question answering dataset designed to evaluate AI assistants on long-term memory tasks through real-world activities recorded with AI glasses. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI glasses present a…

33

Qwen 3.6 27B on DeepSWE

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ

Gemma 4 31B QAT Q4 vs standard Q4 — Top1 KLD benchmark results have me confused. Someone please explain or poke holes in this.

AMD MI50 on Debian Testing is doing great and getting better.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

dots.tts 2B🎙️ SOTA TTS from RedNote

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Benchmark Everything Everywhere All at Once

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

An autonomous research agent was the #1 contributor in OpenAI's Hiring Competition Parameter Golf (by merged records)[R]

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Benchmark & Reality Check on Gemma 4 12B: Great model, but your local settings are probably breaking it (Fix inside)

Towards One-to-Many Temporal Grounding

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Is it allowed to use OpenAI API outputs to create a silver code dataset or benchmark for a specific Python library? [d]

not much happened today

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

RobotValues: Evaluating Household Robots When Human Values Conflict

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

Flash-WAM: Modality-Aware Distillation for World Action Models

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Generic Triple-Latent Compression with Gated Associative Retrieval

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

📅 ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more

Scrap the LLMs. Scoring 4.76% on the brand new ARC-3 using pure code, a 2012 AMD CPU, and zero AI tokens.[P]

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Benchmark Joins the Late-Stage Crowd

These LLMs are the best at resisting Russian propaganda

cyankiwi AWQ 4-bit — 26.05 update, NVFP4 + FP8 Dynamic quantization and benchmarks across Qwen3.6 4-bit quants

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory