Tag

Benchmark

500 articles archived under #benchmark · RSS

r/LocalLLaMA community 25d ago

The DeepSWE benchmark was runned rather incompetently and the results are completely invalid

  submitted by   /u/Charuru [link]   [comments]

10
r/MachineLearning community 25d ago

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

The Google paper on metacognition for hallucination reduction makes a distinction that is underappreciated in benchmarks. Calibration is not about being right more often. It is about matching confidence to correctness. A perfectly calibrated model can still be wrong twenty five…

26
Hugging Face Daily Papers research 25d ago

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Abstract MedSP1000 introduces an interactive benchmark derived from standardized patients to evaluate clinical agents' dynamic performance across encounters, revealing limitations of current large language models in medical applications. Generated by…

18
Hugging Face Daily Papers research 25d ago

PaintBench: Deterministic Evaluation of Precise Visual Editing

Abstract PaintBench presents a scalable benchmark for precise visual editing tasks, revealing low performance across models and identifying key challenges in geometric transformations and structural manipulations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While current…

12
Hugging Face Daily Papers research 26d ago

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Abstract Research reveals significant disparities between text and image generation capabilities in multimodal models, with effective textual knowledge editing not transferring reliably to visual output, necessitating modality-aware editing approaches. Generated by…

9
Hugging Face Daily Papers research 26d ago

Cosmos 3: Omnimodal World Models for Physical AI

Abstract Cosmos 3 is an omnimodal world model that processes and generates multiple data types through a unified mixture-of-transformers architecture, achieving state-of-the-art performance in various understanding and generation tasks. Generated by…

38
Hugging Face Daily Papers research 26d ago

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Abstract AutoLab benchmark evaluates long-horizon iterative optimization capabilities of frontier models across diverse domains, revealing that persistent iteration and time awareness are more critical than initial performance quality. Generated by…

17
r/MachineLearning community 26d ago

Repo for implementations of various Transformer Attn mechanisms [P]

Initially, I developed this so I can easily switch between different Attention mechanisms for my Small Language Model (SLM) experiments and benchmarking. However, I also realized that these implementations can be applicable in Computer Vision, modernize Vision Encoders, RL, and…

14
Hugging Face Daily Papers research 26d ago

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Abstract Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are rapidly evolving from coding assistants…

21
Hugging Face Daily Papers research 26d ago

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Abstract OVO-S-Bench presents a comprehensive benchmark for evaluating streaming spatial intelligence in multimodal language models through human-annotated questions spanning multiple abstraction levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal agents in robotics,…

23
arXiv — Machine Learning research 26d ago

Spectral Scaling Laws of Muon

arXiv:2606.04058v1 Announce Type: new Abstract: Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the…

13
arXiv — Machine Learning research 26d ago

Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge

arXiv:2606.04191v1 Announce Type: new Abstract: We describe our approach to the CTF4Science Lorenz challenge, a benchmark that mixes short-horizon forecasting, long-time distribution matching, and trajectory reconstruction across nine task pairs. The key discovery is that no…

28
arXiv — Machine Learning research 26d ago

Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models

arXiv:2606.04326v1 Announce Type: new Abstract: Concept bottleneck models predict outcomes from high-level concepts detected in inputs. Although concepts provide a simple way to reap benefits from interpretability, very few datasets include concept labels. This limits…

35
arXiv — Machine Learning research 26d ago

DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data

arXiv:2606.04399v1 Announce Type: new Abstract: In the paradigm of decentralized learning, a group of agents collaborate to train a global model using distributed datasets without a central server. Although the power of collaboration has been verified by many state-of-the-art…

7
arXiv — Machine Learning research 26d ago

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

arXiv:2606.04620v1 Announce Type: new Abstract: LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs, thus making them difficult to deploy on embedded systems. Toward this, state-of-the-art…

25
arXiv — NLP / Computation & Language research 26d ago

When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

arXiv:2606.04389v1 Announce Type: new Abstract: Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly…

14
arXiv — NLP / Computation & Language research 26d ago

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

arXiv:2606.04442v1 Announce Type: new Abstract: AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both…

16
arXiv — NLP / Computation & Language research 26d ago

GENEB: Why Genomic Models Are Hard to Compare

arXiv:2606.04525v1 Announce Type: new Abstract: Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not…

26
arXiv — NLP / Computation & Language research 26d ago

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We…

27
arXiv — NLP / Computation & Language research 26d ago

LifeSide: Benchmarking Agents as Lifelong Digital Companions

arXiv:2606.04660v1 Announce Type: new Abstract: Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and…

16
arXiv — NLP / Computation & Language research 26d ago

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

arXiv:2606.04874v1 Announce Type: new Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success,…

37
arXiv — NLP / Computation & Language research 26d ago

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

arXiv:2606.04915v1 Announce Type: new Abstract: Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled…

18
arXiv — NLP / Computation & Language research 26d ago

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

arXiv:2606.05112v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and…

32
arXiv — NLP / Computation & Language research 26d ago

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

arXiv:2606.04244v1 Announce Type: cross Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when…

7
arXiv — NLP / Computation & Language research 26d ago

Can Generalist Agents Automate Data Curation?

arXiv:2606.04261v1 Announce Type: cross Abstract: Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask…

13
arXiv — NLP / Computation & Language research 26d ago

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

arXiv:2606.04455v1 Announce Type: cross Abstract: Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We…

13
Hugging Face Daily Papers research 26d ago

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Abstract BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance. Generated by…

37
Hugging Face Daily Papers research 26d ago

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

Abstract AI-generated images with realistic text and layouts pose a significant misinformation threat requiring new detection benchmarks and methods beyond surface-level credibility assessment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent generative models can now produce…

32
The Information — AI news-outlet 26d ago

Benchmark Raises $1.25 Billion Fund to Back Mature Startups

Benchmark has raised two new funds, one of which will invest in later-stage startups, breaking with the firm’s tradition of exclusively backing early-stage companies. The firm will invest in those more mature companies through a $1.25 billion fund and earlier ones through a $750…

34
r/LocalLLaMA community 26d ago

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

I don't really understand the gemma hype. Qwen outperforms gemma gb for gb, and kv cache is lighter. Sure gemma-4-12b-it might be a slight better coder than Qwen3.5-9b, but you could also just use omnicoder-9b (Qwen3.5-9b finetune for coding). Note: Benchmark results come from…

19
r/LocalLLaMA community 26d ago

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

I think the dust has settled(95+%) for Qwen3.6/3.5-MTP. After the initial PR, so much optimizations & fixes. Even sometime ago today, there's a MTP related PR got merged & released( b9495 ). So try this latest version & share your benchmarks t/s*. Great work by u/am17an & other…

26
Hugging Face Daily Papers research 26d ago

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Abstract DOMINO enables domain-specific data synthesis through an inductive approach that learns domain representations from reference examples, improving code benchmark performance without requiring explicit domain descriptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
r/LocalLLaMA community 26d ago

How does the new abliteration tool Apostate compare with others? - Abliterlitics

Why Qwen 2.5 7B? Apostate is a new abliteration tool by heterodoxin. He asked me to benchmark it. Qwen 2.5 7B was recommended by heterodoxin as it's the most tested model for Apostate. I abliterated the model with Heretic v1.3.0 and Apostate. The models are available on…

33
Hugging Face Daily Papers research 27d ago

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

Abstract PaddleOCR-VL-1.6 enhances document parsing performance through targeted data optimization and progressive post-training techniques, achieving state-of-the-art results on OmniDocBench v1.6. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce PaddleOCR-VL-1.6, an…

9
Hugging Face Daily Papers research 27d ago

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Abstract AutoMedBench presents a comprehensive benchmark for autonomous medical-AI research that evaluates agent performance across five workflow stages, revealing validation as the weakest stage and highlighting the importance of reliable pipeline execution and verification in…

21
arXiv — Machine Learning research 27d ago

AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret

arXiv:2606.02663v1 Announce Type: new Abstract: Recent advances in machine learning have produced probabilistic weather forecasting models comparable to state-of-the-art numerical weather predictors. But no model consistently dominates spatio-temporally, and relative performance…

38
arXiv — Machine Learning research 27d ago

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

arXiv:2606.02670v1 Announce Type: new Abstract: Many recent multivariate time series anomaly detection (MT-SAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this…

30
arXiv — Machine Learning research 27d ago

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation…

27
arXiv — Machine Learning research 27d ago

Rethinking Molecular Text Representations for LLMs: An Empirical Study

arXiv:2606.03057v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations…

7
arXiv — NLP / Computation & Language research 27d ago

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

arXiv:2606.02584v1 Announce Type: new Abstract: Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often…

36
arXiv — NLP / Computation & Language research 27d ago

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

arXiv:2606.02837v1 Announce Type: new Abstract: Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have…

16
arXiv — NLP / Computation & Language research 27d ago

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

arXiv:2606.02907v1 Announce Type: new Abstract: Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the…

21
arXiv — NLP / Computation & Language research 27d ago

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

arXiv:2606.02971v1 Announce Type: new Abstract: Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal…

10
arXiv — NLP / Computation & Language research 27d ago

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

arXiv:2606.03027v1 Announce Type: new Abstract: Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed…

17
arXiv — NLP / Computation & Language research 27d ago

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

arXiv:2606.03220v1 Announce Type: new Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task…

16
arXiv — NLP / Computation & Language research 27d ago

Benchmarking Speech-to-Speech Translation Models

arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and…

5
arXiv — NLP / Computation & Language research 27d ago

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

arXiv:2606.03284v1 Announce Type: new Abstract: Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual,…

30
arXiv — NLP / Computation & Language research 27d ago

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

arXiv:2606.03301v1 Announce Type: new Abstract: We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by…

33
arXiv — NLP / Computation & Language research 27d ago

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions…

6
arXiv — NLP / Computation & Language research 27d ago

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

arXiv:2606.03363v1 Announce Type: new Abstract: Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases,…

15

The DeepSWE benchmark was runned rather incompetently and the results are completely invalid

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

PaintBench: Deterministic Evaluation of Precise Visual Editing

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Cosmos 3: Omnimodal World Models for Physical AI

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Repo for implementations of various Transformer Attn mechanisms [P]

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Spectral Scaling Laws of Muon

Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge

Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models

DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data

QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy

When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

GENEB: Why Genomic Models Are Hard to Compare

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

LifeSide: Benchmarking Agents as Lifelong Digital Companions

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Can Generalist Agents Automate Data Curation?

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

Benchmark Raises $1.25 Billion Fund to Back Mature Startups

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

How does the new abliteration tool Apostate compare with others? - Abliterlitics

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Rethinking Molecular Text Representations for LLMs: An Empirical Study

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Benchmarking Speech-to-Speech Translation Models

SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge