News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow arXiv — NLP / Computation & Language research 25d ago A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation arXiv:2606.06420v1 Announce Type: new Abstract: We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs… 38 The Information — AI news-outlet 25d ago Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, a level that would make it one of the most valuable privately held data center operators, The Information reported late Thursday . Brookfield Asset Management, KKR and… 28 The Information — AI news-outlet 25d ago Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, as it seeks to capitalize on soaring demand for the infrastructure needed to support artificial intelligence, according to people with knowledge of the deal. Brookfield… 34 Hugging Face Daily Papers research 25d ago Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by… 7 r/LocalLLaMA community 25d ago I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation If you’re building LLM apps and feel confused about when to use keyword search, embeddings, rerankers, or vector databases, this repo is for that. I built a docs-first repo on practical LLM system design patterns, covering pre-filtering, hybrid retrieval, rerankers, in-memory… 23 The Information — AI news-outlet 25d ago Fusion Startup Helion Nearly Triples Valuation to $15.5 Billion in Thrive-led Round Helion Energy, a nuclear fusion startup backed by OpenAI’s Sam Altman, still has to prove it can produce electricity to serve data centers and other customers. But investors seem confident it can deliver. The Everett, Wash.–based company said it has raised $465 million in… 33 Hugging Face Daily Papers research 25d ago PaintBench: Deterministic Evaluation of Precise Visual Editing Abstract PaintBench presents a scalable benchmark for precise visual editing tasks, revealing low performance across models and identifying key challenges in geometric transformations and structural manipulations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While current… 12 Hugging Face Daily Papers research 26d ago Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems Abstract Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are rapidly evolving from coding assistants… 21 arXiv — Machine Learning research 26d ago TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection arXiv:2606.04073v1 Announce Type: new Abstract: This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided \textbf{A}nomaly \textbf{D}etection, \textbf{TPA-AD}) for axle-box bearing time-series… 33 arXiv — Machine Learning research 26d ago Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification arXiv:2606.04110v1 Announce Type: new Abstract: Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both… 18 arXiv — Machine Learning research 26d ago KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not… 8 arXiv — Machine Learning research 26d ago (Mis)generalization of Helpful-only Fine-tuning arXiv:2606.04413v1 Announce Type: new Abstract: Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R&D where refusals would be an obstacle. Little is known about the… 34 arXiv — Machine Learning research 26d ago RowNet: A Memory Transformer for Tabular Regression arXiv:2606.04445v1 Announce Type: new Abstract: Real estate valuation is a structured regression problem in which prices are governed by heterogeneous feature types, sparse regional effects, nonlinear interactions, and the practical logic of comparable properties. Standard… 35 arXiv — Machine Learning research 26d ago Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms arXiv:2606.04767v1 Announce Type: new Abstract: The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric… 15 arXiv — NLP / Computation & Language research 26d ago SANE Schema-aware Natural-language Evaluation of Biological Data arXiv:2606.04500v1 Announce Type: new Abstract: High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a… 23 arXiv — NLP / Computation & Language research 26d ago Self-Evolving Deep Research via Joint Generation and Evaluation arXiv:2606.04507v1 Announce Type: new Abstract: Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report… 32 arXiv — NLP / Computation & Language research 26d ago GENEB: Why Genomic Models Are Hard to Compare arXiv:2606.04525v1 Announce Type: new Abstract: Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not… 26 arXiv — NLP / Computation & Language research 26d ago A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs arXiv:2606.04596v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the… 34 arXiv — NLP / Computation & Language research 26d ago LifeSide: Benchmarking Agents as Lifelong Digital Companions arXiv:2606.04660v1 Announce Type: new Abstract: Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and… 16 arXiv — NLP / Computation & Language research 26d ago Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents arXiv:2606.04874v1 Announce Type: new Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success,… 37 arXiv — NLP / Computation & Language research 26d ago Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data arXiv:2606.05122v1 Announce Type: new Abstract: Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training:… 22 arXiv — NLP / Computation & Language research 26d ago The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? arXiv:2606.04455v1 Announce Type: cross Abstract: Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We… 13 Hugging Face Daily Papers research 26d ago OpenSTBench: Beyond Semantic Evaluation for Speech Translation Abstract OpenSTBench presents a unified evaluation framework for speech translation systems that assesses multiple dimensions including translation quality, speech quality, and temporal consistency across different modalities and settings. Generated by… 15 Hugging Face Daily Papers research 26d ago WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts Abstract WebRISE evaluates MLLM-generated web artifacts by analyzing interaction contracts that capture user intent transitions and requirement checks across multiple input modalities, revealing significant gaps in model performance and demonstrating superior error detection… 38 Hugging Face Daily Papers research 26d ago M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks Abstract Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.… 35 Hugging Face Daily Papers research 26d ago Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Abstract Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep-research agents solve… 20 Hugging Face Daily Papers research 26d ago Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching Abstract Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence… 28 The Information — AI news-outlet 26d ago SpaceX Sets $135 Per Share Target Price for IPO SpaceX plans to sell 555.6 million shares at an expected price of $135 each when it goes public, the company said in a securities filing on Wednesday. SpaceX is expecting to raise $75 billion, more than twice as much as any other IPO in history, at a valuation of $1.77 trillion.… 21 Hugging Face Daily Papers research 26d ago Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling Abstract Researchers identify a perceptual judgment bias in multimodal large language models where visual evidence is overlooked for textual plausibility, and propose a training framework using a perturbed dataset and reward modeling to improve perceptual fidelity and evaluation… 18 arXiv — Machine Learning research 27d ago Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection arXiv:2606.02601v1 Announce Type: new Abstract: Within-dataset class-split evaluation is widely used as a proxy for fully unconditional out-of-distribution anomaly detection. We show that this protocol can become ill-posed when the held-out anomaly class overlaps the normal… 12 arXiv — Machine Learning research 27d ago Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate arXiv:2606.02670v1 Announce Type: new Abstract: Many recent multivariate time series anomaly detection (MT-SAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this… 30 arXiv — Machine Learning research 27d ago A Systematic Evaluation of Current Architectures in Wind Power Forecasting arXiv:2606.02849v1 Announce Type: new Abstract: Interval wind speed forecasting is essential for the efficient integration of wind energy into power systems, as it accounts for the inherent uncertainty of wind resources. This study presents a systematic literature review focused… 29 arXiv — Machine Learning research 27d ago Gate AI: LLM Security Benchmark Evaluation Methodology and Results arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation… 27 arXiv — Machine Learning research 27d ago Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings arXiv:2606.03365v1 Announce Type: new Abstract: Embedding models (KGEMs) constitute the main link prediction approach to complete knowledge graphs. Standard evaluation protocols emphasize rank-based metrics such as MRR or Hits@$K$, but usually overlook the influence of random… 26 arXiv — NLP / Computation & Language research 27d ago AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making arXiv:2606.03198v1 Announce Type: new Abstract: Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap… 17 arXiv — NLP / Computation & Language research 27d ago Sample-Size Scaling of the African Languages NLI Evaluation arXiv:2606.03219v1 Announce Type: new Abstract: African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language… 35 arXiv — NLP / Computation & Language research 27d ago WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts arXiv:2606.03220v1 Announce Type: new Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task… 16 arXiv — NLP / Computation & Language research 27d ago Benchmarking Speech-to-Speech Translation Models arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and… 5 arXiv — NLP / Computation & Language research 27d ago SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series arXiv:2606.03301v1 Announce Type: new Abstract: We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by… 33 arXiv — NLP / Computation & Language research 27d ago Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions… 6 Hugging Face Daily Papers research 27d ago NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation Abstract OmniDreams, a foundation generative world model trained from the Cosmos diffusion model, enables real-time action-conditioned video generation for autonomous driving policy evaluation in complex, unseen scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As… 23 TechCrunch — AI news-outlet 27d ago Cyera eyes $12B valuation at 80x ARR multiple despite operating losses The cybersecurity company is nearing a $300 million round led by Evolution Equity Partners. 34 TechCrunch — AI news-outlet 27d ago New Microsoft tool lets devs spin up AI behavior tests using text descriptions Microsoft on Tuesday took the wraps off Adaptive Spec-driven Scoring for Evaluation and Regression Testing, an open source framework for spinning up AI evaluations. 38 The Information — AI news-outlet 27d ago SpaceX Targets $1.75 Trillion Valuation Ahead of IPO Roadshow SpaceX is seeking a valuation of $1.75 trillion in its initial public offering next week, including additional shares the underwriting banks could sell if investor demand is strong, according to people familiar with the matter. The latest valuation target solidifies earlier… 19 Hugging Face Daily Papers research 27d ago Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG Abstract Retrieval-augmented generation systems exhibit source-dependent responses to identical queries, necessitating a shift from traditional correctness evaluation to analyzing inter-source relationships for multi-source NLP systems. Generated by… 17 Hugging Face Daily Papers research 27d ago τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation Abstract A unified video-action world model integrates policy learning, video prediction, and action evaluation using a shared video diffusion backbone for robotic manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Robotic manipulation requires models that generate… 22 Hugging Face Daily Papers research 28d ago ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats Abstract ChartArena presents a comprehensive bilingual benchmark for chart parsing that evaluates models across diverse chart types and visual conditions while providing a unified evaluation framework for fair comparison. AI-generated summary Charts are a primary medium for… 8 Hugging Face Daily Papers research 28d ago StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement Abstract StressDream enhances video world models by steering diffusion-based imaginations toward high-impact yet plausible outcomes through optimized noise initialization with semantic and plausibility objectives. AI-generated summary Video world models (WMs) have shown promise… 18 arXiv — Machine Learning research 28d ago DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal Prediction arXiv:2606.00690v1 Announce Type: new Abstract: Sequential conformal prediction (CP) provides valid uncertainty quantification under the assumption of residual exchangeability. However, this assumption is often violated in real-world time series due to temporal dependencies and… 11 arXiv — NLP / Computation & Language research 28d ago SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding arXiv:2606.00021v1 Announce Type: new Abstract: Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation… 17 Page 7 of 10 · 500 articles ← Newer Older →