Tag

Benchmark

500 articles archived under #benchmark · RSS

arXiv — Machine Learning research 30m ago

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing…

36
arXiv — Machine Learning research 30m ago

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

arXiv:2606.28433v1 Announce Type: new Abstract: One goal in reinforcement learning (RL) research is to understand general-purpose sequential decision-making, using benchmark simulators as a proxy for learning in deployment settings. When running experiments, however, the goal of…

5
arXiv — Machine Learning research 30m ago

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived…

16
arXiv — Machine Learning research 30m ago

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using…

27
arXiv — Machine Learning research 30m ago

KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory

arXiv:2606.29243v1 Announce Type: new Abstract: We present KrishokChat, the first citation-grounded Bengali agricultural instruction-tuning dataset for crop advisory in low-resource settings. We establish a foundation of 290 hierarchical Knowledge Nodes, extracting disease…

30
arXiv — NLP / Computation & Language research 30m ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

arXiv:2606.29082v1 Announce Type: new Abstract: Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on…

4
arXiv — NLP / Computation & Language research 30m ago

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

arXiv:2606.29213v1 Announce Type: new Abstract: OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts…

30
arXiv — NLP / Computation & Language research 30m ago

mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

arXiv:2606.29467v1 Announce Type: new Abstract: Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health…

25
arXiv — NLP / Computation & Language research 30m ago

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

arXiv:2606.29534v1 Announce Type: new Abstract: Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a…

23
arXiv — NLP / Computation & Language research 30m ago

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against…

8
arXiv — NLP / Computation & Language research 30m ago

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

arXiv:2606.29809v1 Announce Type: new Abstract: Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating…

27
arXiv — NLP / Computation & Language research 30m ago

SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models

arXiv:2606.29815v1 Announce Type: new Abstract: Evaluating code large language models (Code LLMs) requires reliable detection of data leakage, where benchmark performance is artificially inflated by exposure to benchmark data during pre-training. Existing approaches either…

7
arXiv — NLP / Computation & Language research 30m ago

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical…

10
arXiv — NLP / Computation & Language research 30m ago

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

arXiv:2606.29933v1 Announce Type: new Abstract: The alignment of language models is typically studied through the lens of capability benchmarks, but the dynamics of how models change during post-training remain poorly understood. We argue that the physical sciences, and…

16
r/LocalLLaMA community 4h ago

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Been running Qwen3.6-27B (8-bit) through my coding harness for a few days, alongside GLM5.2. The harness uses 3 critics — code review, test review, Playwright e2e — each with fresh context before accepting output. Qwen3.6 is legit for a 27B dense model. Benchmarks weren't lying.…

19
TechCrunch — AI news-outlet 10h ago

Arena, the AI leaderboard everyone uses, is now a $100M business

The startup, which runs a popular free AI leaderboard, launched its commercial service just last September.

23
r/MachineLearning community 12h ago

Adaptive Mixture of Experts Gate (AMG) [R]

[Project] Post-hoc Adaptive MoE Gating on Qwen3.6-35B — empirical benchmarking of an open research gap Adaptive MoE routing — selecting a variable number of experts per token based on routing confidence — has been studied in papers (XMoE 2024, DynMoE ICLR 2025, TopP routing…

5
arXiv — Machine Learning research 1d ago

Learning in Markovian bandits with non-observable states and constrained decision epochs

arXiv:2606.27448v1 Announce Type: new Abstract: This paper studies the problem of regret minimization in Markovian bandits with \emph{non-observable states} and possibly \emph{constrained} decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the…

26
arXiv — Machine Learning research 1d ago

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets…

21
arXiv — NLP / Computation & Language research 1d ago

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

arXiv:2606.27378v1 Announce Type: new Abstract: We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks.…

29
arXiv — NLP / Computation & Language research 1d ago

Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval

arXiv:2606.27401v1 Announce Type: cross Abstract: Semantic code search and clone detection are essential for software development, maintenance, and reuse. This paper evaluates the effectiveness, efficiency, and scalability of contemporary deep learning models for first-stage…

35
arXiv — Machine Learning research 1d ago

Benchmarking Multi-Modal Graph-based Social Media Popularity Prediction

arXiv:2606.27539v1 Announce Type: cross Abstract: Social media popularity prediction aims to forecast the future reach or influence of online content from early-stage observations. Accurate prediction enables key downstream applications, such as advertising optimization and…

25
arXiv — NLP / Computation & Language research 1d ago

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated,…

32
arXiv — NLP / Computation & Language research 1d ago

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume…

27
arXiv — NLP / Computation & Language research 1d ago

CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models

arXiv:2606.27383v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used as research assistants, yet it remains unclear whether they can calibrate research takeaways to the strength and scope of the supporting evidence. We study evidence-calibrated…

17
arXiv — NLP / Computation & Language research 1d ago

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write…

11
arXiv — NLP / Computation & Language research 1d ago

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

arXiv:2604.13072v2 Announce Type: replace Abstract: OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments. Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be…

28
Hacker News — AI on Front Page community 1d ago

GLM 5.2 beats Claude in our benchmarks

Article URL: https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/ Comments URL: https://news.ycombinator.com/item?id=48709670 Points: 273 # Comments: 109

22
r/LocalLLaMA community 1d ago

Are there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it?

hey, I’m currently getting enough VRAM to run something in the GLM-5.2 range, but I’m wondering: do we actually have a solid ranking that compares closed-source and open-weight LLMs side by side? I’ve been trying to find a clear “closed vs open” leaderboard, but most benchmarks…

26
r/LocalLLaMA community 1d ago

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"?

After spending countless hours testing on 3 "potato" laptops (Intel i3, 8GB RAM, Win11, integrated GPU), that's my conclusion. For reliably extracting data from images to JSON on low-end hardware, nothing else even comes close. Yet, it’s completely missing from major benchmarks…

23
r/LocalLLaMA community 2d ago

US Ban Benchmark Updated: Toe-to-toe Between Two Big Names!

OpenAI ties with Anthropic in this benchmark following the preview of GPT 5.6 just yesterday. Chinese models have no hope of catching up forever, while Gemini's figure is yet to be updated.   submitted by   /u/Complete-Sea6655 [link]   [comments]

30
r/MachineLearning community 2d ago

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

When evaluating migrating production LLM workloads off commercial cloud APIs, the conversation usually gets oversimplified into a trade-off between quality and infrastructure cost. To look past clean, isolated averages, I built a repeatable evaluation matrix using a real-world…

29
r/LocalLLaMA community 2d ago

Running GLM5.2 on budget hardware < $2500.

Too many times I hear people whine about not being ble to run SOTA models or claim it would require $50k, or $100k. https://www.ebay.com/itm/398079051468 Epcy Motherboard & CPU - $460 https://www.ebay.com/itm/206374955959 P40 24gb - $230 get 2 - $460…

19
r/MachineLearning community 2d ago

I silently break training codes or configs so I made pybench [P]

It is like pytest but for statistical tests: it ensures no regression of your metrics at a statistical level. It manages tedious things such that seeds, past benchmark results, ... Simple CLI working like pytest but with benchmarks/ directory instead of tests/: pybench # 1st…

38
r/LocalLLaMA community 3d ago

"What should I do?" - consider post-training

This is in response to the common post where OP has acquired some cool hardware and is wondering what to do with it. The standard response is always (1) download model X, (2) benchmark it on tps, (3) share screenshots. I argue this is boring and intellectually lazy, and propose…

18
r/LocalLLaMA community 3d ago

What's one local AI workflow you wish you'd discovered sooner?

There are a lot of posts about the models and benchmarks, but I am more interested in the workflows that people use. What is one workflow that really saved you time or made your local LLM more useful? It could be anything—RAG, MCP, coding agents, organizing prompt, document…

23
Hugging Face Daily Papers research 3d ago

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Abstract A web-based benchmark evaluates agent generalization across challenging scenarios, revealing significant gaps between current agentic systems and human performance in temporal perception, graphical understanding, and 3D reasoning. Generated by…

10
Hugging Face Daily Papers research 3d ago

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Abstract CoffeeBench evaluates LLM agents in a multi-agent economic simulation where firms interact over 90 days to maximize profits, revealing differences in communication patterns and performance among various models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As LLM agents…

4
Hugging Face Daily Papers research 3d ago

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Abstract JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates across various benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Speculative decoding (SD)…

17
arXiv — Machine Learning research 4d ago

The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators

arXiv:2606.26294v1 Announce Type: new Abstract: Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier,…

25
arXiv — Machine Learning research 4d ago

Otter Weather: Skillful and Computationally Efficient Medium-Range Weather Forecasting

arXiv:2606.26421v1 Announce Type: new Abstract: State-of-the-art medium-range AI weather models can outperform traditional Numerical Weather Prediction (NWP) but require massive training budgets. This restricts usage for under-resourced groups and severely limits fast model…

4
arXiv — NLP / Computation & Language research 4d ago

DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

arXiv:2606.26429v1 Announce Type: cross Abstract: Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce…

24
arXiv — Machine Learning research 4d ago

Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication

arXiv:2606.26541v1 Announce Type: new Abstract: Data from affected populations are crucial for informing humanitarian response, but their value depends on timely and consistent interpretation of nuanced accounts of need. Humanitarian organizations often lack the staff, time, and…

4
arXiv — Machine Learning research 4d ago

RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations

arXiv:2606.27247v1 Announce Type: new Abstract: In NLP, mental health conditions are often modeled as isolated phenomena, without interpersonal context. We use Reddit posts about long-distance relationships to capture both mental health distress and associated relational…

24
arXiv — NLP / Computation & Language research 4d ago

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

arXiv:2606.26101v1 Announce Type: new Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a…

21
arXiv — NLP / Computation & Language research 4d ago

Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning

arXiv:2606.26108v1 Announce Type: new Abstract: Larger language models consistently outperform smaller ones on reasoning benchmarks, yet the reasoning differences underlying this gap remain underexplored. Across benchmarks in mathematics, physics, chemistry, and programming, we…

35
arXiv — NLP / Computation & Language research 4d ago

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

arXiv:2606.26650v1 Announce Type: new Abstract: In this paper, we present CAT-Q, Cost-efficient and Accurate Ternary Quantization, for compressing and accelerating LLMs. Unlike existing state-of-the-art ternary quantization methods that rely on data-intensive and costly…

9
arXiv — NLP / Computation & Language research 4d ago

SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context

arXiv:2606.26654v1 Announce Type: new Abstract: Personalized language-model assistants are often evaluated through a memory lens: can a model recall preferences users have explicitly stated in dialogue? More comprehensive personalization demands a harder capability -- inferring…

13
arXiv — NLP / Computation & Language research 4d ago

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

arXiv:2606.27047v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem solving…

16
arXiv — NLP / Computation & Language research 4d ago

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

arXiv:2606.27187v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in…

25

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

Position: RL Researchers Need to Distinguish Between Solving Simulators and Using Simulators as a Proxy

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization

Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought

Arena, the AI leaderboard everyone uses, is now a $100M business

Adaptive Mixture of Experts Gate (AMG) [R]

Learning in Markovian bandits with non-observable states and constrained decision epochs

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

Recall Before Rerank: Benchmarking Deep Learning Models for Large-Scale Code-to-Code Retrieval

Benchmarking Multi-Modal Graph-based Social Media Popularity Prediction

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

GLM 5.2 beats Claude in our benchmarks

Are there good closed vs open LLM rankings? Also, are 70B–350B models actually worth it?

Is Qwen3-VL-2B the only viable VLM for JSON extraction on a "potato"?

US Ban Benchmark Updated: Toe-to-toe Between Two Big Names!

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

Running GLM5.2 on budget hardware < $2500.

I silently break training codes or configs so I made pybench [P]

"What should I do?" - consider post-training

What's one local AI workflow you wish you'd discovered sooner?

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators

Otter Weather: Skillful and Computationally Efficient Medium-Range Weather Forecasting

DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication

RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

Where Larger Models Excel: The Primacy of Constraint-Guided Reasoning

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models