News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow r/LocalLLaMA community 25d ago The DeepSWE benchmark was runned rather incompetently and the results are completely invalid   submitted by   /u/Charuru [link]   [comments] 10 r/MachineLearning community 25d ago Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D] The Google paper on metacognition for hallucination reduction makes a distinction that is underappreciated in benchmarks. Calibration is not about being right more often. It is about matching confidence to correctness. A perfectly calibrated model can still be wrong twenty five… 26 Hugging Face Daily Papers research 25d ago Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases Abstract MedSP1000 introduces an interactive benchmark derived from standardized patients to evaluate clinical agents' dynamic performance across encounters, revealing limitations of current large language models in medical applications. Generated by… 18 Hugging Face Daily Papers research 25d ago PaintBench: Deterministic Evaluation of Precise Visual Editing Abstract PaintBench presents a scalable benchmark for precise visual editing tasks, revealing low performance across models and identifying key challenges in geometric transformations and structural manipulations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While current… 12 Hugging Face Daily Papers research 26d ago Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs Abstract Research reveals significant disparities between text and image generation capabilities in multimodal models, with effective textual knowledge editing not transferring reliably to visual output, necessitating modality-aware editing approaches. Generated by… 9 Hugging Face Daily Papers research 26d ago Cosmos 3: Omnimodal World Models for Physical AI Abstract Cosmos 3 is an omnimodal world model that processes and generates multiple data types through a unified mixture-of-transformers architecture, achieving state-of-the-art performance in various understanding and generation tasks. Generated by… 38 Hugging Face Daily Papers research 26d ago AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? Abstract AutoLab benchmark evaluates long-horizon iterative optimization capabilities of frontier models across diverse domains, revealing that persistent iteration and time awareness are more critical than initial performance quality. Generated by… 17 r/MachineLearning community 26d ago Repo for implementations of various Transformer Attn mechanisms [P] Initially, I developed this so I can easily switch between different Attention mechanisms for my Small Language Model (SLM) experiments and benchmarking. However, I also realized that these implementations can be applicable in Computer Vision, modernize Vision Encoders, RL, and… 14 Hugging Face Daily Papers research 26d ago Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems Abstract Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are rapidly evolving from coding assistants… 21 Hugging Face Daily Papers research 26d ago OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs Abstract OVO-S-Bench presents a comprehensive benchmark for evaluating streaming spatial intelligence in multimodal language models through human-annotated questions spanning multiple abstraction levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal agents in robotics,… 23 arXiv — Machine Learning research 26d ago Spectral Scaling Laws of Muon arXiv:2606.04058v1 Announce Type: new Abstract: Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the… 13 arXiv — Machine Learning research 26d ago Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge arXiv:2606.04191v1 Announce Type: new Abstract: We describe our approach to the CTF4Science Lorenz challenge, a benchmark that mixes short-horizon forecasting, long-time distribution matching, and trajectory reconstruction across nine task pairs. The key discovery is that no… 28 arXiv — Machine Learning research 26d ago Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models arXiv:2606.04326v1 Announce Type: new Abstract: Concept bottleneck models predict outcomes from high-level concepts detected in inputs. Although concepts provide a simple way to reap benefits from interpretability, very few datasets include concept labels. This limits… 35 arXiv — Machine Learning research 26d ago DPDL: Towards Differential Privacy Preservation in Decentralized Stochastic Learning on Non-IID Data arXiv:2606.04399v1 Announce Type: new Abstract: In the paradigm of decentralized learning, a group of agents collaborate to train a global model using distributed datasets without a central server. Although the power of collaboration has been verified by many state-of-the-art… 7 arXiv — Machine Learning research 26d ago QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy arXiv:2606.04620v1 Announce Type: new Abstract: LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs, thus making them difficult to deploy on embedded systems. Toward this, state-of-the-art… 25 arXiv — NLP / Computation & Language research 26d ago When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling arXiv:2606.04389v1 Announce Type: new Abstract: Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly… 14 arXiv — NLP / Computation & Language research 26d ago MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning arXiv:2606.04442v1 Announce Type: new Abstract: AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both… 16 arXiv — NLP / Computation & Language research 26d ago GENEB: Why Genomic Models Are Hard to Compare arXiv:2606.04525v1 Announce Type: new Abstract: Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not… 26 arXiv — NLP / Computation & Language research 26d ago VCIFBench: Evaluating Complex Instruction Following for Video Understanding arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We… 27 arXiv — NLP / Computation & Language research 26d ago LifeSide: Benchmarking Agents as Lifelong Digital Companions arXiv:2606.04660v1 Announce Type: new Abstract: Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and… 16 arXiv — NLP / Computation & Language research 26d ago Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents arXiv:2606.04874v1 Announce Type: new Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success,… 37 arXiv — NLP / Computation & Language research 26d ago Caliper: Probing Lexical Anchors versus Causal Structure in LLMs arXiv:2606.04915v1 Announce Type: new Abstract: Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled… 18 arXiv — NLP / Computation & Language research 26d ago Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases arXiv:2606.05112v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and… 32 arXiv — NLP / Computation & Language research 26d ago VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark arXiv:2606.04244v1 Announce Type: cross Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when… 7 arXiv — NLP / Computation & Language research 26d ago Can Generalist Agents Automate Data Curation? arXiv:2606.04261v1 Announce Type: cross Abstract: Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask… 13 arXiv — NLP / Computation & Language research 26d ago The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? arXiv:2606.04455v1 Announce Type: cross Abstract: Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We… 13 Hugging Face Daily Papers research 26d ago BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution Abstract BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance. Generated by… 37 Hugging Face Daily Papers research 26d ago SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation Abstract AI-generated images with realistic text and layouts pose a significant misinformation threat requiring new detection benchmarks and methods beyond surface-level credibility assessment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent generative models can now produce… 32 The Information — AI news-outlet 26d ago Benchmark Raises $1.25 Billion Fund to Back Mature Startups Benchmark has raised two new funds, one of which will invest in later-stage startups, breaking with the firm’s tradition of exclusively backing early-stage companies. The firm will invest in those more mature companies through a $1.25 billion fund and earlier ones through a $750… 34 r/LocalLLaMA community 26d ago gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint I don't really understand the gemma hype. Qwen outperforms gemma gb for gb, and kv cache is lighter. Sure gemma-4-12b-it might be a slight better coder than Qwen3.5-9b, but you could also just use omnicoder-9b (Qwen3.5-9b finetune for coding). Note: Benchmark results come from… 19 r/LocalLLaMA community 26d ago llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s I think the dust has settled(95+%) for Qwen3.6/3.5-MTP. After the initial PR, so much optimizations & fixes. Even sometime ago today, there's a MTP related PR got merged & released( b9495 ). So try this latest version & share your benchmarks t/s*. Great work by u/am17an & other… 26 Hugging Face Daily Papers research 26d ago Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning Abstract DOMINO enables domain-specific data synthesis through an inductive approach that learns domain representations from reference examples, improving code benchmark performance without requiring explicit domain descriptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 35 r/LocalLLaMA community 26d ago How does the new abliteration tool Apostate compare with others? - Abliterlitics Why Qwen 2.5 7B? Apostate is a new abliteration tool by heterodoxin. He asked me to benchmark it. Qwen 2.5 7B was recommended by heterodoxin as it's the most tested model for Apostate. I abliterated the model with Heretic v1.3.0 and Apostate. The models are available on… 33 Hugging Face Daily Papers research 27d ago PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training Abstract PaddleOCR-VL-1.6 enhances document parsing performance through targeted data optimization and progressive post-training techniques, achieving state-of-the-art results on OmniDocBench v1.6. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce PaddleOCR-VL-1.6, an… 9 Hugging Face Daily Papers research 27d ago AutoMedBench: Towards Medical AutoResearch with Agentic AI Models Abstract AutoMedBench presents a comprehensive benchmark for autonomous medical-AI research that evaluates agent performance across five workflow stages, revealing validation as the weakest stage and highlighting the importance of reliable pipeline execution and verification in… 21 arXiv — Machine Learning research 27d ago AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret arXiv:2606.02663v1 Announce Type: new Abstract: Recent advances in machine learning have produced probabilistic weather forecasting models comparable to state-of-the-art numerical weather predictors. But no model consistently dominates spatio-temporally, and relative performance… 38 arXiv — Machine Learning research 27d ago Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate arXiv:2606.02670v1 Announce Type: new Abstract: Many recent multivariate time series anomaly detection (MT-SAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this… 30 arXiv — Machine Learning research 27d ago Gate AI: LLM Security Benchmark Evaluation Methodology and Results arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation… 27 arXiv — Machine Learning research 27d ago Rethinking Molecular Text Representations for LLMs: An Empirical Study arXiv:2606.03057v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations… 7 arXiv — NLP / Computation & Language research 27d ago IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation arXiv:2606.02584v1 Announce Type: new Abstract: Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often… 36 arXiv — NLP / Computation & Language research 27d ago Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling arXiv:2606.02837v1 Announce Type: new Abstract: Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have… 16 arXiv — NLP / Computation & Language research 27d ago Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States arXiv:2606.02907v1 Announce Type: new Abstract: Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the… 21 arXiv — NLP / Computation & Language research 27d ago EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction arXiv:2606.02971v1 Announce Type: new Abstract: Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal… 10 arXiv — NLP / Computation & Language research 27d ago SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia arXiv:2606.03027v1 Announce Type: new Abstract: Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed… 17 arXiv — NLP / Computation & Language research 27d ago WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts arXiv:2606.03220v1 Announce Type: new Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task… 16 arXiv — NLP / Computation & Language research 27d ago Benchmarking Speech-to-Speech Translation Models arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and… 5 arXiv — NLP / Computation & Language research 27d ago SEA-NLI: Natural Language Inference as a Lens into Southeast Asian Cultural Understanding arXiv:2606.03284v1 Announce Type: new Abstract: Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual,… 30 arXiv — NLP / Computation & Language research 27d ago SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series arXiv:2606.03301v1 Announce Type: new Abstract: We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by… 33 arXiv — NLP / Computation & Language research 27d ago Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions… 6 arXiv — NLP / Computation & Language research 27d ago EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge arXiv:2606.03363v1 Announce Type: new Abstract: Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases,… 15 Page 10 of 10 · 500 articles ← Newer