Tag

Funding

500 articles archived under #funding · RSS

arXiv — Machine Learning research 2h ago

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived…

16
arXiv — Machine Learning research 2h ago

How Far Can Sharpness and Complexity Jointly Explain Generalization?

arXiv:2606.29043v1 Announce Type: new Abstract: Sharpness and complexity are two central factors in the generalization analysis of deep neural networks. Existing quantitative evaluations of generalization measures have largely focused on individual scalar measures, leaving the…

13
arXiv — Machine Learning research 2h ago

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

arXiv:2606.29110v1 Announce Type: new Abstract: Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating…

32
arXiv — Machine Learning research 2h ago

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using…

27
arXiv — Machine Learning research 2h ago

Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation

arXiv:2606.29471v1 Announce Type: new Abstract: Strictly proper scoring rules identify the true conditional class distribution at population level, but their curvature can alter optimization and finite-sample behavior. We study three multiclass objectives: a class-aware…

23
arXiv — NLP / Computation & Language research 2h ago

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce…

28
arXiv — NLP / Computation & Language research 2h ago

Fine-Tuning General-Purpose Large Language Models for Agricultural Applications:A Reproducible Framework and Evaluation Protocol Based on Qwen3-8B

arXiv:2606.28992v1 Announce Type: new Abstract: General-purpose large language models (LLMs) have demonstrated strong abilities in opendomain question answering, information extraction, and text generation. Agricultural applications, however, are domain-specific,…

20
arXiv — NLP / Computation & Language research 2h ago

Understanding Evaluation Illusion in Diffusion Large Language Models

arXiv:2606.29228v1 Announce Type: new Abstract: Despite the capability of parallel decoding, diffusion large language models (dLLMs) require many denoising steps to maintain generation quality, motivating recent research on efficient decoding strategies. However, existing…

23
arXiv — NLP / Computation & Language research 2h ago

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against…

8
arXiv — NLP / Computation & Language research 2h ago

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical…

10
arXiv — NLP / Computation & Language research 2h ago

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it…

4
arXiv — NLP / Computation & Language research 2h ago

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

arXiv:2606.29920v1 Announce Type: new Abstract: Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is…

17
TechCrunch — AI news-outlet 17h ago

Omen AI’s plan to optimize data centers is all wet

Omen AI raised a $31 million Series A to monitor chip coolant and stop bacterial outbreaks in data centers.

8
arXiv — Machine Learning research 1d ago

Unified Zero-Shot Time Series Forecasting: A Darts Foundation

arXiv:2606.27438v1 Announce Type: new Abstract: Since its initial release in 2020, Darts has become a widely used open-source Python library for time series analysis. A series of foundation models have recently claimed accuracy improvements in zero-shot forecasting, promising a…

15
arXiv — Machine Learning research 1d ago

Productionized Fairness Measurement Under Privacy Constraints

arXiv:2606.27558v1 Announce Type: new Abstract: Fairness measurements in the form of disaggregated evaluations often rely on demographic signals that are legally constrained or culturally sensitive. Race and ethnicity signals are among the more difficult signals to curate and…

34
arXiv — Machine Learning research 1d ago

Quantum Generative Diffusion Model for Real-World Time Series

arXiv:2606.27561v1 Announce Type: new Abstract: Generative models have achieved remarkable success in data synthesis, though recent advances driven by increasing model scale have introduced challenges in computational cost and efficiency. Quantum machine learning offers a…

10
arXiv — Machine Learning research 1d ago

GNBAN: Graph Neural Basis Attention Networks for Long-Horizon Forecasting over Large Entity Sets

arXiv:2606.27863v1 Announce Type: new Abstract: Demand forecasting at the bottom of a retail hierarchy requires predicting tens of thousands of correlated long-horizon series across products, stores, and regions. Modern systems must scale across massive catalogs, capture shared…

33
arXiv — Machine Learning research 1d ago

TA-SparseMG: Trend-Aware Sparse Forecasting via Multi-Scale Gating for Long-Term Time Series

arXiv:2606.27908v1 Announce Type: new Abstract: Long-term time series forecasting finds extensive applications in domains such as power demand, traffic flow, meteorological observation, and renewable energy dispatch. Forecasting dynamically varying long-term time series poses…

21
arXiv — Machine Learning research 1d ago

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets…

21
arXiv — Machine Learning research 1d ago

COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives

arXiv:2606.28194v1 Announce Type: new Abstract: While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on…

18
arXiv — Machine Learning research 1d ago

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the…

38
arXiv — NLP / Computation & Language research 1d ago

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

arXiv:2606.27378v1 Announce Type: new Abstract: We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks.…

29
arXiv — NLP / Computation & Language research 1d ago

Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

arXiv:2606.27909v1 Announce Type: new Abstract: Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever…

15
arXiv — NLP / Computation & Language research 1d ago

Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

arXiv:2606.28050v1 Announce Type: new Abstract: LLM-as-a-Judge and self-evaluation pipelines implicitly assume that evaluation is easier than generation. We test this in a controlled in-context QA setting where a context passage is the sole information source and each model…

29
arXiv — NLP / Computation & Language research 1d ago

Subject-level Inference for Realistic Text Anonymization Evaluation

arXiv:2604.21211v2 Announce Type: replace Abstract: Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations,…

6
r/LocalLLaMA community 1d ago

DeepSpec - a deepseek-ai Collection

DeepSpec DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding. It contains data preparation utilities, draft model implementations, training code, and evaluation scripts. Released Checkpoints The checkpoints below are the ones used…

26
r/LocalLLaMA community 2d ago

I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0.

I have been running an open evaluation setup where N models answer the same prompt, then blind-grade each other in an N x N matrix with self-judgments excluded. No single privileged judge. So far: 286 evaluations, 198 hand-written questions, 22,254 valid judgments across 55…

35
Hugging Face Daily Papers research 2d ago

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

Abstract A computational origami system generates crease patterns from natural language using AI-driven optimization and aesthetic evaluation, enabling human-AI collaboration in mathematically constrained design. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While generative AI…

11
r/MachineLearning community 2d ago

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

When evaluating migrating production LLM workloads off commercial cloud APIs, the conversation usually gets oversimplified into a trade-off between quality and infrastructure cost. To look past clean, isolated averages, I built a repeatable evaluation matrix using a real-world…

29
r/LocalLLaMA community 2d ago

Orthrus (diffusion head) trained Qwen 3.5/3.6 and Gemma 4 models are dropping soon

"Hi all, we are finalized with our testing and are preparing the release pipeline. We will be releasing support for the Qwen3.5, Qwen3.6, and Gemma4 very soon. Alongside the model checkpoints, we will be open-sourcing our complete end-to-end training and evaluation code. Stay…

19
arXiv — Machine Learning research 4d ago

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's…

4
arXiv — Machine Learning research 4d ago

The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators

arXiv:2606.26294v1 Announce Type: new Abstract: Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier,…

25
arXiv — Machine Learning research 4d ago

EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning

arXiv:2606.26327v1 Announce Type: new Abstract: In actor-critic reinforcement learning, network architectures are typically manually designed. Automating this design is challenging because each candidate must be trained before evaluation, and the design space is open-ended. To…

29
arXiv — NLP / Computation & Language research 4d ago

DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

arXiv:2606.26429v1 Announce Type: cross Abstract: Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce…

24
arXiv — Machine Learning research 4d ago

Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform

arXiv:2606.26590v1 Announce Type: new Abstract: Security misconfigurations in Terraform Infrastructure-as-Code are a growing risk in cloud deployments, and large language models are increasingly used as automated repair agents. Existing evaluations often treat a repair as…

5
arXiv — Machine Learning research 4d ago

Target-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space

arXiv:2606.26657v1 Announce Type: new Abstract: Identifying high-utility candidates from massive discrete spaces under expensive evaluations is a recurring challenge across the sciences, with structure-based drug discovery as a prominent example. While surrogate-based…

20
arXiv — Machine Learning research 4d ago

Decision-Aligned Evaluation of Uncertainty Quantification

arXiv:2606.26990v1 Announce Type: new Abstract: Uncertainty estimates in machine learning are typically evaluated using generic metrics such as the negative log-likelihood and expected calibration error, yet good performance on such metrics does not necessarily imply high…

13
arXiv — NLP / Computation & Language research 4d ago

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

arXiv:2606.26101v1 Announce Type: new Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a…

21
arXiv — NLP / Computation & Language research 4d ago

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's…

12
arXiv — NLP / Computation & Language research 4d ago

ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent

arXiv:2606.26403v1 Announce Type: new Abstract: Foundation-model research increasingly needs data about people: user state, personal histories, relationships, contact-like fields, documents, and longitudinal updates. Real user data is difficult to share, perturb, audit, or…

34
arXiv — NLP / Computation & Language research 4d ago

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

arXiv:2606.26775v1 Announce Type: new Abstract: Multimedia event extraction aims to jointly identify events and their arguments across multiple modalities, such as text and images, to support more comprehensive event understanding. While recent work reports steady and…

15
arXiv — NLP / Computation & Language research 4d ago

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While…

36
arXiv — NLP / Computation & Language research 4d ago

Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents

arXiv:2606.26479v1 Announce Type: cross Abstract: Recent work (2024 to 2026) has converged on a strategy for defending tool-using LLM agents against indirect prompt injection: rather than training the model to refuse malicious instructions, enforce security outside the model…

38
arXiv — NLP / Computation & Language research 4d ago

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against…

18
arXiv — NLP / Computation & Language research 4d ago

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

arXiv:2606.27226v1 Announce Type: cross Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores…

14
Hugging Face Daily Papers research 4d ago

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

Abstract Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a…

7
Hugging Face Daily Papers research 4d ago

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are…

23
TechCrunch — AI news-outlet 4d ago

General Intuition’s $2.3B bet that video games can train AI agents for the real world

General Intuition has raised $320 million to scale AI trained on millions of hours of gameplay, betting action data can help AI develop something closer to human intuition.

25
TechCrunch — AI news-outlet 4d ago

Netris raises $15M Series A from a16z to help AI neoclouds go live faster

Netris provides software that runs on network switches, and offers a platform that helps neocloud operators reduce the time it takes to go live.

36
Hugging Face Daily Papers research 5d ago

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

Abstract Two-channel evaluation shows output compression reduces costs while input compression increases costs and degrades accuracy across models and datasets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct "Talk short. Drop grammar. Save token." This caveman style is widely…

28

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

How Far Can Sharpness and Complexity Jointly Explain Generalization?

Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps

Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models

Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

Fine-Tuning General-Purpose Large Language Models for Agricultural Applications:A Reproducible Framework and Evaluation Protocol Based on Qwen3-8B

Understanding Evaluation Illusion in Diffusion Large Language Models

Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

Omen AI&#8217;s plan to optimize data centers is all wet

Unified Zero-Shot Time Series Forecasting: A Darts Foundation

Productionized Fairness Measurement Under Privacy Constraints

Quantum Generative Diffusion Model for Real-World Time Series

GNBAN: Graph Neural Basis Attention Networks for Long-Horizon Forecasting over Large Entity Sets

TA-SparseMG: Trend-Aware Sparse Forecasting via Multi-Scale Gating for Long-Term Time Series

Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings

COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives

Democratic ICAI: Debating Our Way to Steering Principles from Preferences

Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs

Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

Subject-level Inference for Realistic Text Anonymization Evaluation

DeepSpec - a deepseek-ai Collection

I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0.

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P]

Orthrus (diffusion head) trained Qwen 3.5/3.6 and Gemma 4 models are dropping soon

Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations

The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators

EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning

DualEval: Joint Model-Item Calibration for Unified LLM Evaluation

Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform

Target-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space

Decision-Aligned Evaluation of Uncertainty Quantification

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

General Intuition&#8217;s $2.3B bet that video games can train AI agents for the real world

Netris raises $15M Series A from a16z to help AI neoclouds go live faster

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

Omen AI’s plan to optimize data centers is all wet

General Intuition’s $2.3B bet that video games can train AI agents for the real world