News / #benchmark Tag Benchmark 500 articles archived under #benchmark · RSS Sign in to follow Hugging Face Daily Papers research 4d ago OpenBioRQ: Unsolved Biomedical Research Questions for Agents Abstract A new biomedical benchmark evaluates agentic models' ability to verify sources and avoid false citations by testing unsolved research questions with no answer keys, revealing significant failures in retrieval-grounded reasoning and tool usage. Generated by… 9 r/LocalLLaMA community 4d ago Stop waiting for Qwen3.7 Openweights. Ornith-1.0, a family of open-source LLMs specialized for agentic coding. Ornith-1.0 spans the full parameter sizes, including 9B Dense, 35B MoE, and 397B MoE. It achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks. Hugging Face:… 36 GitHub Blog — AI & ML official-blog 4d ago Evaluating performance and efficiency of the GitHub Copilot agentic harness across models and tasks Explore how the GitHub Copilot agentic harness delivers strong results across multiple benchmarks and leading token efficiency, while maintaining flexibility to choose among more than 20 models. The post Evaluating performance and efficiency of the GitHub Copilot agentic harness… 19 Hugging Face Daily Papers research 4d ago Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching Abstract Lite Any Stereo V2 (LAS2) presents an efficient stereo matching approach that achieves state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent advances in… 9 r/LocalLLaMA community 4d ago Ornith-1.0 released on Hugging Face Including 9B Dense, 31B Dense, 35B MoE, and 397B MoE and reporting sota on different benchmark (let's see if this holds). https://huggingface.co/collections/deepreinforce-ai/ornith-10   submitted by   /u/paf1138 [link]   [comments] 26 r/MachineLearning community 4d ago CALHippo - Mapping neurons and glial cells in the human brain hippocampus in 3D using SOTA segmentation and density estimation models [R] Hello everyone! I'm posting our research work as you might be interested in how we used ML to map part of the brain cells of the human hippocampus :) We used various human brain slices at high resolution (1 micrometer per pixel) and developed a custom segmentation pipeline that… 32 r/MachineLearning community 4d ago I stopped trusting model benchmarks and started running my own eval set, here is what changed[D] Three things broke my faith in published benchmarks recently. One, Kimi K2.7 Code shipped with plus 21.8 percent on Kimi Code Bench v2, plus 11 percent on Program Bench, plus 31.5 percent on MLS Bench Lite. All three are Moonshot's own benchmarks. None were submitted to DeepSWE,… 23 Hugging Face Daily Papers research 4d ago Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models Abstract Autoregressive video diffusion extends diffusion distillation frameworks to real-time streaming generation through causal training paradigms, achieving state-of-the-art performance with fast convergence and interactive world modeling capabilities. Generated by… 4 Hugging Face Daily Papers research 4d ago Improved Large Language Diffusion Models Abstract Masked diffusion language models with fully bidirectional attention outperform autoregressive counterparts on various benchmarks while maintaining competitiveness with established models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern large language models are… 18 Hugging Face Daily Papers research 4d ago ShutterMuse: Capture-Time Photography Guidance with MLLMs Abstract Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-world photography… 12 Smol AI News news-outlet 4d ago not much happened today **Z.ai's GLM-5.2** leads in coding and agent benchmarks with top scores like **1595** on Code Arena: Frontend and **34.29%** reasoning accuracy with zero failures. Databricks improved GLM-5.2 speed to **392 tok/s** using hardware and optimizations. **Ornith-1.0**, a new… 13 arXiv — Machine Learning research 5d ago MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios arXiv:2606.24950v1 Announce Type: new Abstract: Financial decision-making is contextual: forecasting prices, valuing companies, and assessing event exposure weigh price history, accounting fundamentals, macroeconomic regime, and contemporaneous text. A benchmark over these four… 25 arXiv — Machine Learning research 5d ago Are Tabular Foundation Models Robust to Realistic Query Distribution Shifts in Microbiome Data? arXiv:2606.24995v1 Announce Type: new Abstract: Tabular foundation models (TFMs) achieve strong performance on microbiome abundance data, yet their robustness under realistic distribution shift remains poorly characterized. We introduce a benchmark that evaluates the robustness… 22 arXiv — Machine Learning research 5d ago From Forecasting Leaderboards to Deployment Decisions: A Fail-Closed Certification Protocol arXiv:2606.24996v1 Announce Type: new Abstract: Forecasting leaderboards rank models by predictive quality, but their winners are often read as deployment-ready top-1 advice. That reading can fail when forecasts are passed through a fixed decision interface, such as an alert… 23 arXiv — NLP / Computation & Language research 5d ago Do Thinking Tokens Help with Safety? arXiv:2606.25013v1 Announce Type: cross Abstract: Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and… 37 arXiv — Machine Learning research 5d ago FDN: Interpretable Spatiotemporal Forecasting with Future Decomposition Networks arXiv:2606.25201v1 Announce Type: new Abstract: Spatiotemporal systems comprise a collection of spatially distributed yet interdependent entities each generating unique dynamic signals. Highly sophisticated methods have been proposed in recent years delivering state-of-the-art… 21 arXiv — Machine Learning research 5d ago TopoCast: A Topological Fidelity Framework for Evaluating Transformer-Based Time Series Forecasting arXiv:2606.25439v1 Announce Type: new Abstract: Deep learning-based models have achieved state-of-the-art performance in Time Series Forecasting (TSF), yet their evaluation remains dominated by pointwise error metrics such as Mean Squared Error (MSE), which quantify numerical… 37 arXiv — NLP / Computation & Language research 5d ago LLM Performance on a Real, Double-Marked GCSE Benchmark arXiv:2606.24973v1 Announce Type: new Abstract: We introduce a dataset of 32,534 double-marked real student responses to GCSE mock exams (GCSEs are the UK's national exams, taken at age ~16), spanning 328 questions across five subjects and including handwritten work. We test… 26 arXiv — NLP / Computation & Language research 5d ago LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges arXiv:2606.25057v1 Announce Type: new Abstract: The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent… 11 arXiv — NLP / Computation & Language research 5d ago Riazi-8B: An Urdu Large Language Model for Mathematical Reasoning arXiv:2606.25568v1 Announce Type: new Abstract: Recent LLMs demonstrate strong mathematical reasoning capabilities, but existing gains rely heavily on English-centric training resources and benchmarks. As a result, reasoning performance degrades substantially in low-resource… 27 arXiv — NLP / Computation & Language research 5d ago Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability arXiv:2606.25819v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still largely assume… 26 arXiv — NLP / Computation & Language research 5d ago SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations… 29 arXiv — NLP / Computation & Language research 5d ago Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models arXiv:2606.26079v1 Announce Type: new Abstract: Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI… 31 arXiv — NLP / Computation & Language research 5d ago Evaluating LLMs on Real-World Software Performance Optimization arXiv:2606.25530v1 Announce Type: cross Abstract: Software performance optimization is a notoriously complex and manual task. Despite the growing use of Large Language Models (LLMs) for code refinement, we still lack benchmarks that capture how optimization actually happens in… 17 arXiv — NLP / Computation & Language research 5d ago Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet… 14 arXiv — NLP / Computation & Language research 5d ago How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations arXiv:2606.26041v1 Announce Type: cross Abstract: Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently… 29 arXiv — NLP / Computation & Language research 5d ago How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse arXiv:2510.23842v2 Announce Type: replace Abstract: Most state-of-the-art sign language models are trained on interpreter or isolated vocabulary data, which overlooks the variability that characterizes natural dialogue. However, human communication dynamically adapts to contexts… 31 Hugging Face Daily Papers research 5d ago EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies Abstract EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models. Generated by… 18 Hugging Face Daily Papers research 5d ago MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery Abstract Long-term memory in LLM agents should be evaluated as an auditable post-interaction artifact by reconstructing structured user state from the agent's memory, as demonstrated by MEMPROBE, a benchmark testing memory recovery against synthetic ground truth across 50… 21 r/MachineLearning community 5d ago Find the best open-source OCR models in one place at Papers with Code [P] Hi, I've created an overview of the most important OCR benchmarks, along with the top open models, and links to their paper and code: https://paperswithcode.co/tasks/ocr . This week, new OCR models were released by Baidu and Mistral. Baidu released Unlimited OCR , a 3B-parameter… 27 r/MachineLearning community 5d ago I made a superhuman Generals.io agent with self-play RL [P] Hi everyone, I trained a self-play RL agent for Generals.io that reached superhuman-level and ranked #1 on the human 1v1 leaderboard. It began as my master's thesis where the goal was to beat a prior algorithm based agent. We succeeded using behavior cloning, RL fine-tuning and… 6 r/LocalLLaMA community 5d ago OpenAI and Broadcom unveil LLM-optimized inference chip https://openai.com/index/openai-broadcom-jalapeno-inference-chip/ Quoted from the start of the blog post: Early testing shows that the first-generation accelerator will deliver performance per watt substantially better than current state-of-the-art Built from the ground up for… 11 r/LocalLLaMA community 5d ago Qwen-AgentWorld-35B-A3B for Coding? Benchmark from its model card. Removed online models & Qwen-AgentWorld-397B-A17B from the table. Just Open models. Model MCP Search Term. SWE Android Web OS Overall DeepSeek-V4-Pro 63.27 27.61 51.26 59.44 55.17 50.32 63.70 52.97 GLM-5.1 67.60 22.46 47.32 52.07 59.10 51.50 59.13… 11 Hugging Face Daily Papers research 5d ago AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning Abstract Large language models face challenges in archive-grounded reasoning tasks involving evidence retrieval and synthesis across diverse document collections, with performance varying significantly across domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language… 26 r/MachineLearning community 5d ago I compiled LLM inference pricing across 7 providers — the caching numbers are surprising(spreadsheet included) [R] I've been comparing GPU/LLM providers for a side project and ended up with way too many browser tabs and spreadsheets. So I decided to pull the public pricing data into one sheet and compare it side by side. A quick disclaimer: this is not benchmark data . I didn't run latency… 32 Hugging Face Daily Papers research 5d ago ChartWalker: Benchmarking the Cross-Chart RAG Task Abstract ChartWalker presents a novel framework for cross-chart retrieval-augmented generation with hierarchical knowledge graph construction and structure-aware sampling for challenging multi-modal analytical tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Cross-Chart… 33 Hugging Face Daily Papers research 5d ago DiffusionBench: On Holistic Evaluation of Diffusion Transformers Abstract Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers that demonstrates the need for comprehensive benchmarking beyond ImageNet class-conditional generation to assess true progress in generative modeling. Generated by… 25 Hugging Face Daily Papers research 5d ago LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis Abstract A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Mental… 36 arXiv — Machine Learning research 6d ago You Don't Need to Run Every Eval arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to… 29 arXiv — Machine Learning research 6d ago RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting arXiv:2606.24062v1 Announce Type: new Abstract: Financial time series forecasting presents structural challenges absent from standard benchmarks. Log-returns are non-stationary, exhibit exceptionally low signal-to-noise (SNR) ratios, and are governed by regime-dependent temporal… 8 arXiv — Machine Learning research 6d ago Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment arXiv:2606.24173v1 Announce Type: new Abstract: On-device fault detection enables real-time diagnostics without cloud dependency, but deploying machine learning models on resource-constrained hardware demands careful tradeoffs between accuracy, latency, and model size. We… 14 arXiv — NLP / Computation & Language research 6d ago BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks arXiv:2606.24162v1 Announce Type: new Abstract: Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject… 27 arXiv — Machine Learning research 6d ago Open-Vocabulary BEV Segmentation with 3D-Aware Geometric Constraints arXiv:2606.24353v1 Announce Type: cross Abstract: Bird's-eye view (BEV) perception fuses multi-camera images into a unified top-down representation for autonomous driving. Despite recent progress, state-of-the-art methods remain confined to closed-set scenarios, making them… 6 arXiv — Machine Learning research 6d ago PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models arXiv:2606.24388v1 Announce Type: cross Abstract: We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by… 38 arXiv — NLP / Computation & Language research 6d ago QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages arXiv:2606.23943v1 Announce Type: new Abstract: Tokenization is a foundational step in NLP pipelines, yet standard evaluation metrics such as fertility rate fail to capture morphological correctness for agglutinative languages. We present QuechuaTok, a systematic benchmark… 32 arXiv — NLP / Computation & Language research 6d ago RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring arXiv:2606.23992v1 Announce Type: new Abstract: Clinical value sets define the standardized terminology codes used in quality measurement, phenotyping, cohort construction, and clinical decision support. The recently introduced Retrieval-Augmented Set Completion (RASC) benchmark… 32 arXiv — NLP / Computation & Language research 6d ago MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language,… 38 arXiv — NLP / Computation & Language research 6d ago A P\={a}ninian Foundation for Indic Language Processing arXiv:2606.24172v1 Announce Type: new Abstract: More than a billion people communicate in Indic languages, yet the natural language processing infrastructure serving them remains fragmented and underdeveloped. The cause is structural: the field organizes its tools and benchmarks… 24 arXiv — NLP / Computation & Language research 6d ago A Synthetic Reliability-Aware PINN Benchmark for Offshore Wind Turbine Support-Structure Monitoring with Bayesian Inverse Identification arXiv:2606.24176v1 Announce Type: new Abstract: Reliable structural health monitoring (SHM) of offshore wind turbine (OWT) support structures requires fast state estimation from sparse measurements. Repeated high fidelity finite element or aeroelastic analyses are difficult to… 8 arXiv — NLP / Computation & Language research 6d ago MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval arXiv:2606.24200v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) in clinical settings increasingly requires multilingual retrieval against predominantly English evidence corpora. Multilingual medical retrieval demands three capabilities: cross-lingual… 36 Page 2 of 10 · 500 articles ← Newer Older →