News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — Machine Learning research 1h ago Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived… 16 arXiv — Machine Learning research 1h ago How Far Can Sharpness and Complexity Jointly Explain Generalization? arXiv:2606.29043v1 Announce Type: new Abstract: Sharpness and complexity are two central factors in the generalization analysis of deep neural networks. Existing quantitative evaluations of generalization measures have largely focused on individual scalar measures, leaving the… 13 arXiv — Machine Learning research 1h ago Few-Step Boltzmann Generators via Scalable Likelihood Flow Maps arXiv:2606.29110v1 Announce Type: new Abstract: Recent progress in flow-based generative modeling has led to models that output high-quality samples while using only a small number of function evaluations. However, at present, there is a lack of similar advances in estimating… 32 arXiv — Machine Learning research 1h ago Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models arXiv:2606.29196v1 Announce Type: new Abstract: Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using… 27 arXiv — Machine Learning research 1h ago Structured Proper Loss Geometries for Multiclass Classification: Theory and Controlled Empirical Evaluation arXiv:2606.29471v1 Announce Type: new Abstract: Strictly proper scoring rules identify the true conditional class distribution at population level, but their curvature can alter optimization and finite-sample behavior. We study three multiclass objectives: a class-aware… 23 arXiv — NLP / Computation & Language research 1h ago SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce… 28 arXiv — NLP / Computation & Language research 1h ago Fine-Tuning General-Purpose Large Language Models for Agricultural Applications:A Reproducible Framework and Evaluation Protocol Based on Qwen3-8B arXiv:2606.28992v1 Announce Type: new Abstract: General-purpose large language models (LLMs) have demonstrated strong abilities in opendomain question answering, information extraction, and text generation. Agricultural applications, however, are domain-specific,… 20 arXiv — NLP / Computation & Language research 1h ago Understanding Evaluation Illusion in Diffusion Large Language Models arXiv:2606.29228v1 Announce Type: new Abstract: Despite the capability of parallel decoding, diffusion large language models (dLLMs) require many denoising steps to maintain generation quality, motivating recent research on efficient decoding strategies. However, existing… 23 arXiv — NLP / Computation & Language research 1h ago Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against… 8 arXiv — NLP / Computation & Language research 1h ago Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency arXiv:2606.29876v1 Announce Type: new Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical… 10 arXiv — NLP / Computation & Language research 1h ago MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it… 4 arXiv — NLP / Computation & Language research 1h ago Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios? arXiv:2606.29920v1 Announce Type: new Abstract: Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is… 17 TechCrunch — AI news-outlet 16h ago Omen AI’s plan to optimize data centers is all wet Omen AI raised a $31 million Series A to monitor chip coolant and stop bacterial outbreaks in data centers. 8 arXiv — Machine Learning research 1d ago Unified Zero-Shot Time Series Forecasting: A Darts Foundation arXiv:2606.27438v1 Announce Type: new Abstract: Since its initial release in 2020, Darts has become a widely used open-source Python library for time series analysis. A series of foundation models have recently claimed accuracy improvements in zero-shot forecasting, promising a… 15 arXiv — Machine Learning research 1d ago Productionized Fairness Measurement Under Privacy Constraints arXiv:2606.27558v1 Announce Type: new Abstract: Fairness measurements in the form of disaggregated evaluations often rely on demographic signals that are legally constrained or culturally sensitive. Race and ethnicity signals are among the more difficult signals to curate and… 34 arXiv — Machine Learning research 1d ago Quantum Generative Diffusion Model for Real-World Time Series arXiv:2606.27561v1 Announce Type: new Abstract: Generative models have achieved remarkable success in data synthesis, though recent advances driven by increasing model scale have introduced challenges in computational cost and efficiency. Quantum machine learning offers a… 10 arXiv — Machine Learning research 1d ago GNBAN: Graph Neural Basis Attention Networks for Long-Horizon Forecasting over Large Entity Sets arXiv:2606.27863v1 Announce Type: new Abstract: Demand forecasting at the bottom of a retail hierarchy requires predicting tens of thousands of correlated long-horizon series across products, stores, and regions. Modern systems must scale across massive catalogs, capture shared… 33 arXiv — Machine Learning research 1d ago TA-SparseMG: Trend-Aware Sparse Forecasting via Multi-Scale Gating for Long-Term Time Series arXiv:2606.27908v1 Announce Type: new Abstract: Long-term time series forecasting finds extensive applications in domains such as power demand, traffic flow, meteorological observation, and renewable energy dispatch. Forecasting dynamically varying long-term time series poses… 21 arXiv — Machine Learning research 1d ago Benchmarking on Tasks That Matter: Dataset Selection for Preserving Model Rankings arXiv:2606.27997v1 Announce Type: new Abstract: Benchmarks of machine learning models often include many datasets, making evaluation expensive. For efficiency, it is preferable to perform evaluations on small, representative datasets instead. The selection of such subsets… 21 arXiv — Machine Learning research 1d ago COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives arXiv:2606.28194v1 Announce Type: new Abstract: While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on… 18 arXiv — Machine Learning research 1d ago Democratic ICAI: Debating Our Way to Steering Principles from Preferences arXiv:2606.28294v1 Announce Type: new Abstract: Preference-based alignment often struggles to capture the reasoning that underlies human judgments. Many evaluations rely on multiple interacting criteria, yet pairwise labels reveal only the final choice rather than the… 38 arXiv — NLP / Computation & Language research 1d ago Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs arXiv:2606.27378v1 Announce Type: new Abstract: We introduce an axiomatic evaluation framework for latent thought representations in LLMs, comprising metrics that are independent of downstream benchmark scores and reveal representational failures that benchmark accuracy masks.… 29 arXiv — NLP / Computation & Language research 1d ago Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs arXiv:2606.27909v1 Announce Type: new Abstract: Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever… 15 arXiv — NLP / Computation & Language research 1d ago Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA arXiv:2606.28050v1 Announce Type: new Abstract: LLM-as-a-Judge and self-evaluation pipelines implicitly assume that evaluation is easier than generation. We test this in a controlled in-context QA setting where a context passage is the sole information source and each model… 29 arXiv — NLP / Computation & Language research 1d ago Subject-level Inference for Realistic Text Anonymization Evaluation arXiv:2604.21211v2 Announce Type: replace Abstract: Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations,… 6 r/LocalLLaMA community 1d ago DeepSpec - a deepseek-ai Collection DeepSpec DeepSpec is a full-stack codebase for training and evaluating draft models for speculative decoding. It contains data preparation utilities, draft model implementations, training code, and evaluation scripts. Released Checkpoints The checkpoints below are the ones used… 26 r/LocalLLaMA community 2d ago I had 55 LLMs blind-grade each other (22k judgments, all open). Every model family with enough data is biased toward its own siblings. Qwen judges favor Qwen by ~0.9 points. Mistral penalizes its own by ~1.0. I have been running an open evaluation setup where N models answer the same prompt, then blind-grade each other in an N x N matrix with self-judgments excluded. No single privileged judge. So far: 286 evaluations, 198 hand-written questions, 22,254 valid judgments across 55… 35 Hugging Face Daily Papers research 2d ago COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami Abstract A computational origami system generates crease patterns from natural language using AI-driven optimization and aesthetic evaluation, enabling human-AI collaboration in mathematically constrained design. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While generative AI… 11 r/MachineLearning community 2d ago Benchmarking Self-Hosted Gemma 2 9B vs. Frontier APIs: The FP8 Quantization Prefill Tax and VRAM Realities on an NVIDIA L4 [P] When evaluating migrating production LLM workloads off commercial cloud APIs, the conversation usually gets oversimplified into a trade-off between quality and infrastructure cost. To look past clean, isolated averages, I built a repeatable evaluation matrix using a real-world… 29 r/LocalLLaMA community 2d ago Orthrus (diffusion head) trained Qwen 3.5/3.6 and Gemma 4 models are dropping soon "Hi all, we are finalized with our testing and are preparing the release pipeline. We will be releasing support for the Qwen3.5, Qwen3.6, and Gemma4 very soon. Alongside the model checkpoints, we will be open-sourcing our complete end-to-end training and evaluation code. Stay… 19 arXiv — Machine Learning research 4d ago Necessary but Not Sufficient: Temperature Control and Reproducibility in LLM-as-Judge Safety Evaluations arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's… 4 arXiv — Machine Learning research 4d ago The Red Queen G\"odel Machine: Co-Evolving Agents and Their Evaluators arXiv:2606.26294v1 Announce Type: new Abstract: Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier,… 25 arXiv — Machine Learning research 4d ago EVOM: Agentic Meta-Evolution of Actor-Critic Architectures for Reinforcement Learning arXiv:2606.26327v1 Announce Type: new Abstract: In actor-critic reinforcement learning, network architectures are typically manually designed. Automating this design is challenging because each candidate must be trained before evaluation, and the design space is open-ended. To… 29 arXiv — NLP / Computation & Language research 4d ago DualEval: Joint Model-Item Calibration for Unified LLM Evaluation arXiv:2606.26429v1 Announce Type: cross Abstract: Current LLM evaluation relies on two complementary but often disconnected signals: static benchmarks with objective correctness labels and arena-style preference data that better reflect open-ended user interactions. We introduce… 24 arXiv — Machine Learning research 4d ago Empirical Software Engineering TerraProbe: A Layered-Oracle Framework for Detecting Deceptive Fixes in LLM-Assisted Terraform arXiv:2606.26590v1 Announce Type: new Abstract: Security misconfigurations in Terraform Infrastructure-as-Code are a growing risk in cloud deployments, and large language models are increasingly used as automated repair agents. Existing evaluations often treat a repair as… 5 arXiv — Machine Learning research 4d ago Target-Aware Bandit Allocation for Scalable Surrogate Optimization in Chemical Space arXiv:2606.26657v1 Announce Type: new Abstract: Identifying high-utility candidates from massive discrete spaces under expensive evaluations is a recurring challenge across the sciences, with structure-based drug discovery as a prominent example. While surrogate-based… 20 arXiv — Machine Learning research 4d ago Decision-Aligned Evaluation of Uncertainty Quantification arXiv:2606.26990v1 Announce Type: new Abstract: Uncertainty estimates in machine learning are typically evaluated using generic metrics such as the negative log-likelihood and expected calibration error, yet good performance on such metrics does not necessarily imply high… 13 arXiv — NLP / Computation & Language research 4d ago Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models arXiv:2606.26101v1 Announce Type: new Abstract: Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior. We present a… 21 arXiv — NLP / Computation & Language research 4d ago From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's… 12 arXiv — NLP / Computation & Language research 4d ago ProfileFoundry: A Synthetic Person-Object Substrate for Privacy, Memory, and Tool-Use Evaluation in LLM Agent arXiv:2606.26403v1 Announce Type: new Abstract: Foundation-model research increasingly needs data about people: user state, personal histories, relationships, contact-like fields, documents, and longitudinal updates. Real user data is difficult to share, perturb, audit, or… 34 arXiv — NLP / Computation & Language research 4d ago Evaluation Pitfalls and Challenges in Multimedia Event Extraction arXiv:2606.26775v1 Announce Type: new Abstract: Multimedia event extraction aims to jointly identify events and their arguments across multiple modalities, such as text and images, to support more comprehensive event understanding. While recent work reports steady and… 15 arXiv — NLP / Computation & Language research 4d ago Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While… 36 arXiv — NLP / Computation & Language research 4d ago Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents arXiv:2606.26479v1 Announce Type: cross Abstract: Recent work (2024 to 2026) has converged on a strategy for defending tool-using LLM agents against indirect prompt injection: rather than training the model to refuse malicious instructions, enforce security outside the model… 38 arXiv — NLP / Computation & Language research 4d ago Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against… 18 arXiv — NLP / Computation & Language research 4d ago Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement arXiv:2606.27226v1 Announce Type: cross Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores… 14 Hugging Face Daily Papers research 4d ago GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents Abstract Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a… 7 Hugging Face Daily Papers research 4d ago Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are… 23 TechCrunch — AI news-outlet 4d ago General Intuition’s $2.3B bet that video games can train AI agents for the real world General Intuition has raised $320 million to scale AI trained on millions of hours of gameplay, betting action data can help AI develop something closer to human intuition. 25 TechCrunch — AI news-outlet 4d ago Netris raises $15M Series A from a16z to help AI neoclouds go live faster Netris provides software that runs on network switches, and offers a platform that helps neocloud operators reduce the time it takes to go live. 36 Hugging Face Daily Papers research 4d ago CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression Abstract Two-channel evaluation shows output compression reduces costs while input compression increases costs and degrades accuracy across models and datasets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct "Talk short. Drop grammar. Save token." This caveman style is widely… 28 Page 1 of 10 · 500 articles Older →