Tag

Benchmark

80 articles archived under #benchmark · RSS

Hugging Face Daily Papers research 3h ago

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Abstract When2Tool benchmark identifies conditions under which tool calls are necessary for LLM agents, revealing that models can predict tool necessity from hidden states but fail to act on this knowledge, leading to the development of Probe&Prefill method that reduces…

15
Hugging Face Daily Papers research 6h ago

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Abstract Urban-ImageNet presents a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, organized under a hierarchical taxonomy for scene classification, cross-modal retrieval, and instance segmentation tasks.…

36
r/MachineLearning community 8h ago

Best examples of ML projects with good dataset/task code abstractions? [D]

I am working on a benchmark and need to manage several interlocking components: datasets and metadata, diverse ML tasks (varying inputs and outputs), and baseline experiments covering models, training, and evaluations. Any pointers to projects that handle these through…

4
Hugging Face Daily Papers research 14h ago

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

Abstract WildRelight dataset addresses the gap between synthetic and real-world single-image relighting by providing high-resolution outdoor scenes with aligned natural illumination, enabling physics-guided domain adaptation through diffusion posterior sampling and test-time…

21
Hugging Face Daily Papers research 16h ago

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Abstract Multi-turn dialogue safety monitoring system detects harmful intent accumulation through turn-level analysis and evaluates performance on a new benchmark dataset. AI-generated summary Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed…

18
Hugging Face Daily Papers research 18h ago

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Abstract A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks. AI-generated summary Multimodal deep search…

33
Hugging Face Daily Papers research 18h ago

From Web to Pixels: Bringing Agentic Search into Visual Perception

Abstract Researchers introduce WebEye, a benchmark for object localization requiring external knowledge resolution, and Pixel-Searcher, an agent-based approach that connects hidden target identities to visual annotations through search and reasoning. AI-generated summary Visual…

22
Hugging Face Daily Papers research 18h ago

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Abstract SeePhys Pro benchmark reveals that current multimodal models struggle with representation-invariant reasoning when information shifts from text to visual formats, and demonstrates that blind training can improve performance through residual textual cues. AI-generated…

36
Hugging Face Daily Papers research 18h ago

MEME: Multi-entity & Evolving Memory Evaluation

Abstract MEME benchmark evaluates memory systems across multiple entities and evolving conditions, revealing persistent challenges in dependency reasoning despite advanced retrieval and prompting techniques. AI-generated summary LLM-based agents increasingly operate in…

17
arXiv — Machine Learning research 19h ago

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

arXiv:2605.11091v1 Announce Type: new Abstract: Automated ASD screening tools remain limited by single-architecture evaluations, axis-restricted assessment, and near-exclusive focus on adult cohorts, obscuring age-specific diagnostic patterns critical for early intervention. We…

4
arXiv — Machine Learning research 19h ago

Optimistic Dual Averaging Unifies Modern Optimizers

arXiv:2605.11172v1 Announce Type: new Abstract: We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances…

31
arXiv — Machine Learning research 19h ago

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

arXiv:2605.11205v1 Announce Type: new Abstract: Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation…

34
arXiv — Machine Learning research 19h ago

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

arXiv:2605.11209v1 Announce Type: new Abstract: While existing benchmarks demonstrate the near-perfect performance of large language models (LLMs) on various tasks, this apparent saturation often obscures the need for rigorous evaluation of their reliability. In real-world…

36
arXiv — Machine Learning research 19h ago

A Proof-of-Concept Simulation-Driven Digital Twin Framework for Decision-Aware Diabetes Modeling

arXiv:2605.11247v1 Announce Type: new Abstract: This paper presents a proof-of-concept digital twin framework for simulation-driven diabetes modeling using benchmark clinical data, synthetic temporal augmentation, and illustrative continuous glucose monitoring (CGM) analysis.…

27
arXiv — Machine Learning research 19h ago

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

arXiv:2605.11355v1 Announce Type: new Abstract: Inventory-policy comparisons are often difficult to interpret because performance depends on the evaluation contract as much as on the policy itself. Differences in topology, demand regime, information access, feasibility…

32
arXiv — Machine Learning research 19h ago

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

arXiv:2605.11504v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag…

23
arXiv — NLP / Computation & Language research 19h ago

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

arXiv:2605.11128v1 Announce Type: new Abstract: Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks…

11
arXiv — NLP / Computation & Language research 19h ago

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

arXiv:2605.11143v1 Announce Type: new Abstract: Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer…

27
arXiv — NLP / Computation & Language research 19h ago

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

arXiv:2605.11663v1 Announce Type: new Abstract: Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed…

13
arXiv — NLP / Computation & Language research 19h ago

On Predicting the Post-training Potential of Pre-trained LLMs

arXiv:2605.11978v1 Announce Type: new Abstract: The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's…

11
arXiv — NLP / Computation & Language research 19h ago

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

arXiv:2605.12022v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in…

26
arXiv — NLP / Computation & Language research 19h ago

PreScam: A Benchmark for Predicting Scam Progression from Early Conversations

arXiv:2605.12243v1 Announce Type: new Abstract: Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in…

33
arXiv — NLP / Computation & Language research 19h ago

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

arXiv:2605.12361v1 Announce Type: new Abstract: Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question…

6
arXiv — NLP / Computation & Language research 19h ago

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

arXiv:2605.12493v1 Announce Type: new Abstract: Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for…

17
arXiv — NLP / Computation & Language research 19h ago

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

arXiv:2605.11398v1 Announce Type: cross Abstract: We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health…

36
Hugging Face Daily Papers research 20h ago

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Abstract Autonomous agents exhibit distinct value systems from underlying language models, requiring new benchmarking approaches to assess alignment across diverse execution environments. AI-generated summary Autonomous agents have rapidly matured as task executors and seen…

28
Hugging Face Daily Papers research 20h ago

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Abstract A new benchmark called LongMemEval-V2 is introduced to evaluate memory systems' ability to help agents acquire environment-specific experience in web environments, featuring a suite of memory methods including AgentRunbook-R and AgentRunbook-C that demonstrate varying…

23
Hugging Face Daily Papers research 21h ago

Teaching Language Models to Think in Code

Abstract ThinC framework enables mathematical problem solving where code serves as the primary reasoning mechanism instead of a verification tool, demonstrating superior performance on math benchmarks. AI-generated summary Tool-integrated reasoning (TIR) has emerged as a…

6
r/LocalLLaMA community 1d ago

MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more

I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment. And some architectures like Qwen3.6 27B have super weird patterns that can get genuinely…

14
r/LocalLLaMA community 1d ago

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across each of 11 categories Models:…

17
r/LocalLLaMA community 1d ago

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

now you can evaluate your models at home, sounds like a perfect tool to compare quants and finetunes Datasets: AIME, AIME2025, GSM8K, GPQA   submitted by   /u/jacek2023 [link]   [comments]

15
Smol AI News news-outlet 1d ago

not much happened today

**Research-level reasoning benchmarks** are advancing with **439 new math problems** from **64 mathematicians** and expanded medical benchmarks in **Medmarks v1.0** covering **30 benchmarks** and **61 models**. **Google DeepMind's AI Co-Mathematician** achieves **48% on…

15
Latent.Space news-outlet 1d ago

[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD

well done, Team Thinky.

26
Vercel — AI dev-tools 1d ago

AI Gateway production index

Ask which AI model is best, and the answer changes before the ink dries. That's what happens in an industry where new models are released weekly. Every benchmark measures a different race, and every race crowns its own winner, but Vercel has a unique view of the industry through…

38
Latent.Space news-outlet 5d ago

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

OpenAI continues deploying GPT-5 everywhere

18
Smol AI News news-outlet 6d ago

GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

**OpenAI** released **GPT-Realtime-2**, a voice model with **GPT-5-class reasoning**, tool use, interruption handling, and extended context windows up to **128K tokens**, achieving top scores on **Big Bench Audio** and **Conversational Dynamics** benchmarks. They also launched a…

22
Smol AI News news-outlet 9d ago

not much happened today

**AI Twitter Recap** highlights the shift from model-centric AI to **context pipelines** and **agent orchestration** as key performance drivers. Notably, **gpt-5.2-codex** and **gpt-5.3-codex** showed significant benchmark improvements through prompt and middleware tuning. The…

16
The Algorithmic Bridge news-outlet 12d ago

Weekly Top Picks #120

Q1 earnings / Trump wants to nationalize AI / China protects its workers / ARC-AGI-3 defeats GPT-5.5 and Opus-4.7 / The "permanent underclass" / Dawkins x Claudia

25
Smol AI News news-outlet 16d ago

not much happened today

**OpenAI** loosens its **Azure exclusivity**, allowing distribution across **Google TPU**, **AWS Trainium**, and **Bedrock** with commitments through **2032** and revenue share through **2030**. **GPT-5.5** shows improved benchmarks but is not uniformly dominant, ranking…

11
Latent.Space news-outlet 18d ago

[AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips

The prodigal Tiger returns... but is no longer the benchmarks leader.

21
Smol AI News news-outlet 21d ago

not much happened today

**Alibaba** released **Qwen3.6-27B**, a dense, Apache 2.0 open coding model with thinking and non-thinking modes, outperforming the larger Qwen3.5-397B-A17B on multiple coding benchmarks including SWE-bench and Terminal-Bench. It supports native vision-language reasoning over…

15
OpenAI news 21d ago

Introducing OpenAI Privacy Filter

OpenAI Privacy Filter is an open-weight model for detecting and redacting personally identifiable information (PII) in text with state-of-the-art accuracy

37
OpenAI news 22d ago

Introducing ChatGPT Images 2.0

ChatGPT Images 2.0 introduces a state-of-the-art image generation model with improved text rendering, multilingual support, and advanced visual reasoning.

9
Smol AI News news-outlet 26d ago

not much happened today

**Anthropic** launched **Claude Design**, a prototyping tool powered by **Claude Opus 4.7**, targeting design workflows and competing with **Figma** and others. Benchmarks show **Opus 4.7** leading in coding and text tasks, with improved efficiency and adaptive reasoning, though…

7
Smol AI News news-outlet 27d ago

Anthropic's Claude Opus 4.7

**Anthropic** launched **Claude Opus 4.7**, its most capable Opus model yet, featuring stronger coding and agentic performance, a new tokenizer, and improved long-context handling with a new **xhigh** reasoning tier. Benchmarks show substantial gains, including **SWE-bench Pro…

37
Vercel — AI dev-tools 28d ago

Seedance 2.0 Video Generation on AI Gateway

You can now access Bytedance's latest state-of-the-art video generation model, Seedance 2.0, via AI Gateway with no other provider accounts required. Seedance 2.0 is available on AI Gateway in two variants: Standard and Fast. Both share the same capabilities. Standard produces…

10
Interconnects research 1mo ago

Gemma 4 and what makes an open model succeed

Hint: it's not benchmark scores.

13
ThursdAI news-outlet 1mo ago

AGI is here? Jensen says yes, ARC-AGI-3 says AI scores under 1%

Listen now (100 mins) | From Weights & Biases, Daniel Han from Unsloth joins our panel to discuss Turbo Quant, Anthropic beats DoW in court, Jensen Claims AGI while ArcAGI 3 claims "no AGI".

32
Smol AI News news-outlet 1mo ago

not much happened today

**ARC-AGI-3** benchmark introduced by **@arcprize** and **François Chollet** resets the frontier for general agentic reasoning with humans solving 100% of tasks versus under 1% for current models, focusing on zero-preparation generalization and human-like learning efficiency.…

4
Smol AI News news-outlet 1mo ago

not much happened today

**Cursor** launched **Composer 2**, a frontier-class coding model with major cost reductions and strong benchmark scores like **61.3 on CursorBench** and **73.7 on SWE-bench Multilingual**. The model was improved via a **first continued pretraining run** feeding into…

36
Smol AI News news-outlet 1mo ago

MiniMax 2.7: GLM-5 at 1/3 cost SOTA Open Model

**MiniMax M2.7** is the headline model release, described as a "self-evolving agent" with strong performance metrics including **56.22% on SWE-Pro**, **57.0% on Terminal Bench 2**, and parity with **Sonnet 4.6**. It features recursive self-improvement in skills, memory, and…

6
Vercel — AI dev-tools 1mo ago

Use GPT 5.4 Mini and Nano on AI Gateway

GPT-5.4 Mini and GPT-5.4 Nano from OpenAI are now available on Vercel AI Gateway . Both models deliver state-of-the-art performance for their size class in coding and computer use, and are built for sub-agent workflows where multiple smaller models coordinate on parts of a…

9
Smol AI News news-outlet 2mo ago

GPT 5.4: SOTA Knowledge Work -and- Coding -and- CUA Model, OpenAI is so very back

**OpenAI** launched **GPT-5.4** and **GPT-5.4 Pro** with unified mainline and Codex models, featuring **native computer use**, up to **~1M token context**, and efficiency improvements including a new **Codex `/fast` mode**. Benchmarks showed strong results like…

25
ThursdAI news-outlet 2mo ago

📅 ThursdAI - Feb 26 - The Pentagon wants War Claude, every benchmark collapsed, and a solo founder hit $700K ARR with AI agents

From Weights & Biases, this week is the closest I've felt to the AI singularity starting, bonkers 1 man AI startups crossing $700K ARR live on show, DoD vs Anthropic, Anthropic vs Chinese models & mor

17
Smol AI News news-outlet 2mo ago

Nano Banana 2 aka Gemini 3.1 Flash Image Preview: the new SOTA Imagegen model

**Google and DeepMind** launched **Nano Banana 2** (aka **Gemini 3.1 Flash Image Preview**), a leading image generation and editing model integrated across multiple Google products with features like **4K upscaling**, **multi-subject consistency**, and **real-time…

29
Import AI news-outlet 2mo ago

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy

Will AIs be jealous of one another?

28
Smol AI News news-outlet 2mo ago

not much happened today

**Gemini 3.1 Pro** demonstrates strong retrieval capabilities and cost efficiency compared to **GPT-5.2** and **Opus 4.6**, though users report tooling and UI issues. The **SWE-bench Verified** evaluation methodology is under scrutiny for consistency, with updates bringing…

27
Smol AI News news-outlet 2mo ago

Gemini 3.1 Pro: 2x 3.0 on ARC-AGI 2

**Google** released **Gemini 3.1 Pro**, a developer preview integrated across the **Gemini app**, **NotebookLM**, **Gemini API / AI Studio**, and **Vertex AI**, highlighting a significant reasoning improvement with **ARC-AGI-2 = 77.1%** and strong coding and agentic-tool…

10
NVIDIA Developer Blog official-blog 2mo ago

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write custom kernels and to...

6
Smol AI News news-outlet 2mo ago

Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats

**Anthropic** launched **Claude Sonnet 4.6**, an upgrade over Sonnet 4.5, featuring broad improvements in **coding, long-context reasoning, agent planning, knowledge work, and design**, plus a **1M-token context window (beta)**. Benchmarks show Sonnet 4.6 leading on **GDPval-AA…

4
Import AI news-outlet 2mo ago

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

Will 2026 be looked back on as the pivotal year for making decisions about the singularity?

19
Smol AI News news-outlet 2mo ago

MiniMax-M2.5: SOTA coding, search, toolcalls, $1/hour

**MiniMax-M2.5** is now open source, featuring an "agent-native" reinforcement learning framework called **Forge** trained across **200k+ RL environments** for coding, tool use, and workflows. It boasts strong benchmark scores like **80.2% SWE-Bench Verified** and emphasizes…

20
ThursdAI news-outlet 2mo ago

📆 Open source just pulled up to Opus 4.6 — at 1/20th the price

Plus: Gemini 3 Deep Think hits 84% on ARC-AGI, OpenAI's new 1000 t/s coding model, and the video model that shattered reality.

21
Smol AI News news-outlet 3mo ago

new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5

**Google DeepMind** is rolling out the upgraded **Gemini 3 Deep Think V2** reasoning mode to **Google AI Ultra** subscribers and opening early access to the **Vertex AI / Gemini API** for select users. Key benchmark achievements include **ARC-AGI-2 at 84.6%**, **Humanity’s Last…

31
Smol AI News news-outlet 3mo ago

Z.ai GLM-5: New SOTA Open Weights LLM

**Zhipu AI** launched **GLM-5**, an **Opus-class** model scaling from **355B to 744B parameters** with **DeepSeek Sparse Attention** integration for cost-efficient long-context serving. GLM-5 achieves **SOTA on BrowseComp** and leads on **Vending Bench 2**, focusing on office…

18
Interconnects research 3mo ago

Opus 4.6, Codex 5.3, and the post-benchmark era

On comparing models in 2026.

34
Smol AI News news-outlet 3mo ago

not much happened today

**AI News** for early February 2026 highlights a detailed comparison between **GPT-5.3-Codex** and **Claude Opus 4.6**, with users noting **Codex's** strength in detailed scoped tasks and **Opus's** ergonomic advantage for exploratory work. Benchmarks on Karpathy's **nanochat…

11
Hugging Face official-blog 3mo ago

Community Evals: Because we're done trusting black-box leaderboards over the community

Back to Articles Community Evals: Because we're done trusting black-box leaderboards over the community Published February 4, 2026 Update on GitHub Upvote 89 ben burtenshaw burtenshaw Nathan Habib SaylorTwift Bertrand Chevrier kramp merve merve Daniel van Strien davanstrien…

17
Smol AI News news-outlet 3mo ago

Context Graphs: Hype or actually Trillion-dollar opportunity?

**Zhipu AI** launched **GLM-OCR**, a lightweight **0.9B** multimodal OCR model excelling in complex document understanding with top benchmark scores and day-0 deployment support from **lmsys**, **vllm**, and **novita labs**. **Ollama** enabled local-first usage with easy offline…

28
Smol AI News news-outlet 3mo ago

Moonshot Kimi K2.5 - Beats Sonnet 4.5 at half the cost, SOTA Open Model, first Native Image+Video, 100 parallel Agent Swarm manager

**MoonshotAI's Kimi K2.5** is a **32B active-1T parameter open-weights model** featuring **native multimodality** with image and video understanding, built through continual pretraining on **15 trillion mixed visual and text tokens**. It introduces a new **MoonViT vision…

22
Hugging Face official-blog 3mo ago

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

Back to Articles AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality Enterprise Article Published January 21, 2026 Upvote 33 Dhaval Patel DhavalPatel ibm-research James Rayfield jtrayfield ibm-research Saumya Ahuja saumyaahuja ibm-research…

22
Ahead of AI (Sebastian Raschka) research 4mo ago

The State Of LLMs 2025: Progress, Problems, and Predictions

A 2025 review of large language models, from DeepSeek R1 and RLVR to inference-time scaling, benchmarks, architectures, and predictions for 2026.

33
Smol AI News news-outlet 4mo ago

Meta Superintelligence Labs acquires Manus AI for over $2B, at $100M ARR, 9months after launch

**Manus** achieved a rapid growth trajectory in 2025, raising **$500M** from Benchmark and reaching **$100M ARR** before being acquired by **Meta** for an estimated **$4B**. The **vLLM** team launched a dedicated community site with new resources, while performance issues with…

30
Hugging Face official-blog 4mo ago

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

Back to Articles The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator Enterprise + Article Published December 17, 2025 Upvote 49 Seph Mard sephmard1 nvidia Isabel Hulseman ihulseman0220 nvidia Besmira Nushi bnushi nvidia Piotr Januszewski…

31
Google DeepMind official-blog 5mo ago

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

Systematically evaluating the factuality of large language models with the FACTS Benchmark Suite.

10
Hugging Face official-blog 5mo ago

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Back to Articles Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks Published November 21, 2025 Update on GitHub Upvote 27 Eric Bezzam bezzam Steven Zheng Steveeeeeeen Eustache Le Bihan eustlb Vaibhav Srivastav reach-vb While everyone (and their…

30
Ahead of AI (Sebastian Raschka) research 7mo ago

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples

29
Eugene Yan research 10mo ago

Evaluating Long-Context Question & Answer Systems

Evaluation metrics, how to build eval datasets, eval methodology, and a review of several benchmarks.

13
Nonint (James Betker) research 16mo ago

Beating ARC the hard way

ARC is benchmark developed to test out of distribution reasoning and common sense in general solvers. It is specifically designed to be: Easily solvable by most humans Not amenable to any kind of brute-force solvers (e.g. try every permutation of a solution) Not able to be…

4
Lil'Log (Lilian Weng) research 40mo ago

Large Transformer Model Inference Optimization

[Updated on 2023-01-24: add a small section on Distillation .] Large transformer models are mainstream nowadays, creating SoTA results for a variety of tasks. They are powerful but very expensive to train and use. The extremely high inference cost, in both time and memory, is a…

30

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Best examples of ML projects with good dataset/task code abstractions? [D]

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

From Web to Pixels: Bringing Agentic Search into Visual Perception

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

MEME: Multi-entity & Evolving Memory Evaluation

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

Optimistic Dual Averaging Unifies Modern Optimizers

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

A Proof-of-Concept Simulation-Driven Digital Twin Framework for Decision-Aware Diabetes Modeling

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

On Predicting the Post-training Potential of Pre-trained LLMs

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

PreScam: A Benchmark for Predicting Scam Progression from Early Conversations

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Teaching Language Models to Think in Code

MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more

Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp

not much happened today

[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD

AI Gateway production index

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

not much happened today

Weekly Top Picks #120

not much happened today

[AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips

not much happened today

Introducing OpenAI Privacy Filter

Introducing ChatGPT Images 2.0

not much happened today

Anthropic's Claude Opus 4.7

Seedance 2.0 Video Generation on AI Gateway

Gemma 4 and what makes an open model succeed

AGI is here? Jensen says yes, ARC-AGI-3 says AI scores under 1%

not much happened today

not much happened today

MiniMax 2.7: GLM-5 at 1/3 cost SOTA Open Model

Use GPT 5.4 Mini and Nano on AI Gateway

GPT 5.4: SOTA Knowledge Work -and- Coding -and- CUA Model, OpenAI is so very back

📅 ThursdAI - Feb 26 - The Pentagon wants War Claude, every benchmark collapsed, and a solo founder hit $700K ARR with AI agents

Nano Banana 2 aka Gemini 3.1 Flash Image Preview: the new SOTA Imagegen model

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy

not much happened today

Gemini 3.1 Pro: 2x 3.0 on ARC-AGI 2

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Claude Sonnet 4.6: clean upgrade of 4.5, mostly better with some caveats

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

MiniMax-M2.5: SOTA coding, search, toolcalls, $1/hour

📆 Open source just pulled up to Opus 4.6 — at 1/20th the price

new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5

Z.ai GLM-5: New SOTA Open Weights LLM

Opus 4.6, Codex 5.3, and the post-benchmark era

not much happened today

Community Evals: Because we're done trusting black-box leaderboards over the community

Context Graphs: Hype or actually Trillion-dollar opportunity?

Moonshot Kimi K2.5 - Beats Sonnet 4.5 at half the cost, SOTA Open Model, first Native Image+Video, 100 parallel Agent Swarm manager

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

The State Of LLMs 2025: Progress, Problems, and Predictions

Meta Superintelligence Labs acquires Manus AI for over $2B, at $100M ARR, 9months after launch

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Evaluating Long-Context Question & Answer Systems

Beating ARC the hard way

Large Transformer Model Inference Optimization