News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow arXiv — NLP / Computation & Language research 15d ago MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition arXiv:2606.14459v1 Announce Type: new Abstract: Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech… 6 arXiv — NLP / Computation & Language research 15d ago SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model arXiv:2606.14574v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type… 8 arXiv — NLP / Computation & Language research 15d ago LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations arXiv:2606.14600v1 Announce Type: new Abstract: Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce… 32 arXiv — NLP / Computation & Language research 15d ago WorkBench Revisited: Workplace Agents Two Years On arXiv:2606.13715v1 Announce Type: cross Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best… 34 arXiv — NLP / Computation & Language research 15d ago Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs arXiv:2606.13815v1 Announce Type: cross Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the… 37 arXiv — NLP / Computation & Language research 15d ago ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where… 4 arXiv — NLP / Computation & Language research 15d ago MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We… 23 arXiv — NLP / Computation & Language research 15d ago FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification arXiv:2508.05782v2 Announce Type: replace Abstract: Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a… 29 arXiv — NLP / Computation & Language research 15d ago Residual Context Diffusion Language Models arXiv:2601.22954v2 Announce Type: replace Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a… 7 arXiv — NLP / Computation & Language research 15d ago C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning arXiv:2603.05167v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce… 20 r/LocalLLaMA community 15d ago Gemma 4 models benchmarked on with Triple GPU Hearing good things about Gemma 4. Ran a few models across my llama box. Kubuntu 26.04 OS. AMD Ryzen 5 3600 6-core CPU. 48 GiB of DDR4 3600 Mhz RAM. Nvidia GTX-1070 at 8GiB VRAM ( X 3 ) with 24GiB total VRAM. GPUs have power limit set to 120, 121, 122 watts using: sudo… 29 r/LocalLLaMA community 15d ago Quality evaluation of quants with limited time or tokens About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3… 36 r/LocalLLaMA community 15d ago Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192) First of all shout out to Aiden/Antirez & geniuses at the Nvidia community threads. I'm merely claude-vibing off of their works. That a said, i thought i'd share recipes & learnings & benchmarks so far on running big MOE models on two dgx sparks at a reasonable speed for agent… 14 r/LocalLLaMA community 16d ago I don’t know who needs to hear this but 128GB BD-R XL M-DISC is SOTA for consumer-available archival optical storage (for backing up your models) If you’re trying to download and preserve your local LLMs in case of future availability issues due to AI-related politics, your best bet is either 128gb or 100gb Blu-Ray optical disks, more specifically BD-R XL M-DISC standard format which are archival-grade and built to last… 21 r/LocalLLaMA community 16d ago GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X. The model now supports a 1M context window and two thinking modes: max and high. z.ai recommends using max for coding. Vote on X What should we prioritize most? Longer context window MIT-licensed open weights No price increase Other links: GLM 5.2 announcement LLM Benchmark… 32 r/LocalLLaMA community 17d ago Diffusion Gemma is 4x faster, but makes 6x more mistakes! Benchmarked the new Gemma diffusion model against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we… 14 NVIDIA Developer Blog official-blog 17d ago NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how... 8 Hugging Face Daily Papers research 17d ago The Cold-Start Safety Gap in LLM Agents Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe… 37 Hugging Face Daily Papers research 17d ago ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs Abstract Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension. Generated by… 27 Smol AI News news-outlet 18d ago not much happened today **Anthropic** suspended access to **Claude Fable 5** and **Mythos 5** due to **US export controls**, sparking a debate on **model sovereignty** and geopolitical risks for frontier AI vendors. **Artificial Analysis** updated its coding agent benchmark, replacing **SWE-Bench Pro**… 17 arXiv — NLP / Computation & Language research 18d ago Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants arXiv:2606.12608v1 Announce Type: new Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping… 8 arXiv — NLP / Computation & Language research 18d ago How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation arXiv:2606.12789v1 Announce Type: new Abstract: Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present… 22 arXiv — NLP / Computation & Language research 18d ago LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global… 19 arXiv — NLP / Computation & Language research 18d ago Polar: A Benchmark for Evaluating Political Bias in LLMs arXiv:2606.12922v1 Announce Type: new Abstract: Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that… 28 arXiv — NLP / Computation & Language research 18d ago LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most… 23 arXiv — NLP / Computation & Language research 18d ago M\"OVE: A Holistic LLM Benchmark for the German Public Sector arXiv:2606.13111v1 Announce Type: new Abstract: We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public… 30 arXiv — NLP / Computation & Language research 18d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to… 26 arXiv — NLP / Computation & Language research 18d ago SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation arXiv:2606.13647v1 Announce Type: new Abstract: We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual… 25 arXiv — NLP / Computation & Language research 18d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to… 30 arXiv — NLP / Computation & Language research 18d ago SupraBench: A Benchmark for Supramolecular Chemistry arXiv:2606.13477v1 Announce Type: cross Abstract: Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per… 38 Hugging Face Daily Papers research 18d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search… 26 Hugging Face Daily Papers research 18d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments Abstract EvoArena benchmark and EvoMem memory paradigm address the challenge of dynamic environments in LLM agents by modeling progressive updates and structured memory evolution, showing improved performance on evolving tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large… 5 Hugging Face Daily Papers research 18d ago WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Abstract WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods. Generated by… 38 Hugging Face Daily Papers research 18d ago InterleaveThinker: Reinforcing Agentic Interleaved Generation Abstract InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks. Generated by… 36 r/LocalLLaMA community 18d ago New models released: Nex-N2 Pro 397B and Nex-N2 Mini 35B They are FTs of Qwen3.5 and the benchmarks look pretty good https://huggingface.co/nex-agi/Nex-N2-mini https://huggingface.co/nex-agi/Nex-N2-Pro   submitted by   /u/1ncehost [link]   [comments] 23 r/LocalLLaMA community 18d ago DiffusionGemma under real workloads feels very different from benchmark demos okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100… 29 Hugging Face Daily Papers research 18d ago τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems Abstract A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models. Generated by… 6 Smol AI News news-outlet 19d ago not much happened today **Anthropic** reversed its covert degradation policy on **Claude Fable 5** after public backlash, sparking debates on governance, transparency, and access to frontier AI models. The model shows strong capabilities with mixed benchmark results, including **87.8% on WeirdML** and… 19 Smol AI News news-outlet 19d ago not much happened today **Anthropic's Fable/Mythos export-control crisis** dominates AI news, highlighting the intersection of **national security** and frontier model access. Technical voices like **François Chollet** criticize opaque regulatory actions and advocate for **standardized benchmarks for… 6 Hugging Face Daily Papers research 19d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 16 arXiv — Machine Learning research 19d ago GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction arXiv:2606.11382v1 Announce Type: new Abstract: Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases,… 20 arXiv — NLP / Computation & Language research 19d ago GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs arXiv:2606.11562v1 Announce Type: cross Abstract: Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node… 37 arXiv — Machine Learning research 19d ago Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics arXiv:2606.11657v1 Announce Type: new Abstract: Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style… 29 arXiv — Machine Learning research 19d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks arXiv:2606.12344v1 Announce Type: new Abstract: General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch,… 27 arXiv — NLP / Computation & Language research 19d ago Benchmarking Large Language Models for Safety Data Extraction arXiv:2606.11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks… 27 arXiv — NLP / Computation & Language research 19d ago BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts arXiv:2606.11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting… 29 arXiv — NLP / Computation & Language research 19d ago Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs arXiv:2606.11232v1 Announce Type: new Abstract: Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same… 14 arXiv — NLP / Computation & Language research 19d ago Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite arXiv:2606.11257v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use,… 35 arXiv — NLP / Computation & Language research 19d ago Agent Skill Evaluation and Evolution: Frameworks and Benchmarks arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in… 20 arXiv — NLP / Computation & Language research 19d ago AI Coding Agents Can Reproduce Social Science Findings arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks… 8 Page 6 of 10 · 500 articles ← Newer Older →