Tag

Funding

500 articles archived under #funding · RSS

arXiv — NLP / Computation & Language research 25d ago

A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

arXiv:2606.06420v1 Announce Type: new Abstract: We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs…

38
The Information — AI news-outlet 25d ago

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, a level that would make it one of the most valuable privately held data center operators, The Information reported late Thursday . Brookfield Asset Management, KKR and…

28
The Information — AI news-outlet 25d ago

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Data center developer Switch is in talks to raise billions of dollars at a valuation of at least $50 billion, as it seeks to capitalize on soaring demand for the infrastructure needed to support artificial intelligence, according to people with knowledge of the deal. Brookfield…

34
Hugging Face Daily Papers research 25d ago

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

Abstract Large language models exhibit surface-level human-like risk decisions in the St. Petersburg game without consistent human-like decision-making mechanisms, highlighting the need for deeper analysis beyond outcome similarity in high-stakes evaluations. Generated by…

7
r/LocalLLaMA community 25d ago

I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation

If you’re building LLM apps and feel confused about when to use keyword search, embeddings, rerankers, or vector databases, this repo is for that. I built a docs-first repo on practical LLM system design patterns, covering pre-filtering, hybrid retrieval, rerankers, in-memory…

23
The Information — AI news-outlet 25d ago

Fusion Startup Helion Nearly Triples Valuation to $15.5 Billion in Thrive-led Round

Helion Energy, a nuclear fusion startup backed by OpenAI’s Sam Altman, still has to prove it can produce electricity to serve data centers and other customers. But investors seem confident it can deliver. The Everett, Wash.–based company said it has raised $465 million in…

33
Hugging Face Daily Papers research 25d ago

PaintBench: Deterministic Evaluation of Precise Visual Editing

Abstract PaintBench presents a scalable benchmark for precise visual editing tasks, revealing low performance across models and identifying key challenges in geometric transformations and structural manipulations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct While current…

12
Hugging Face Daily Papers research 26d ago

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Abstract Production-grounded evaluation framework RAMP assesses long-horizon software engineering agents through realistic compiler construction workloads and runtime analysis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents are rapidly evolving from coding assistants…

21
arXiv — Machine Learning research 26d ago

TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection

arXiv:2606.04073v1 Announce Type: new Abstract: This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided \textbf{A}nomaly \textbf{D}etection, \textbf{TPA-AD}) for axle-box bearing time-series…

33
arXiv — Machine Learning research 26d ago

Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification

arXiv:2606.04110v1 Announce Type: new Abstract: Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both…

18
arXiv — Machine Learning research 26d ago

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not…

8
arXiv — Machine Learning research 26d ago

(Mis)generalization of Helpful-only Fine-tuning

arXiv:2606.04413v1 Announce Type: new Abstract: Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R&D where refusals would be an obstacle. Little is known about the…

34
arXiv — Machine Learning research 26d ago

RowNet: A Memory Transformer for Tabular Regression

arXiv:2606.04445v1 Announce Type: new Abstract: Real estate valuation is a structured regression problem in which prices are governed by heterogeneous feature types, sparse regional effects, nonlinear interactions, and the practical logic of comparable properties. Standard…

35
arXiv — Machine Learning research 26d ago

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

arXiv:2606.04767v1 Announce Type: new Abstract: The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric…

15
arXiv — NLP / Computation & Language research 26d ago

SANE Schema-aware Natural-language Evaluation of Biological Data

arXiv:2606.04500v1 Announce Type: new Abstract: High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a…

23
arXiv — NLP / Computation & Language research 26d ago

Self-Evolving Deep Research via Joint Generation and Evaluation

arXiv:2606.04507v1 Announce Type: new Abstract: Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report…

32
arXiv — NLP / Computation & Language research 26d ago

GENEB: Why Genomic Models Are Hard to Compare

arXiv:2606.04525v1 Announce Type: new Abstract: Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not…

26
arXiv — NLP / Computation & Language research 26d ago

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

arXiv:2606.04596v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the…

34
arXiv — NLP / Computation & Language research 26d ago

LifeSide: Benchmarking Agents as Lifelong Digital Companions

arXiv:2606.04660v1 Announce Type: new Abstract: Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and…

16
arXiv — NLP / Computation & Language research 26d ago

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

arXiv:2606.04874v1 Announce Type: new Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success,…

37
arXiv — NLP / Computation & Language research 26d ago

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

arXiv:2606.05122v1 Announce Type: new Abstract: Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training:…

22
arXiv — NLP / Computation & Language research 26d ago

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

arXiv:2606.04455v1 Announce Type: cross Abstract: Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We…

13
Hugging Face Daily Papers research 26d ago

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Abstract OpenSTBench presents a unified evaluation framework for speech translation systems that assesses multiple dimensions including translation quality, speech quality, and temporal consistency across different modalities and settings. Generated by…

15
Hugging Face Daily Papers research 26d ago

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Abstract WebRISE evaluates MLLM-generated web artifacts by analyzing interaction contracts that capture user intent transitions and requirement checks across multiple input modalities, revealing significant gaps in model performance and demonstrating superior error detection…

38
Hugging Face Daily Papers research 26d ago

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Abstract Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.…

35
Hugging Face Daily Papers research 26d ago

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Abstract Deep-research agents can be audited using a claim-centric framework that identifies error spans in their reasoning trajectories, improving reliability assessment beyond just final answer evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep-research agents solve…

20
Hugging Face Daily Papers research 26d ago

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Abstract Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence…

28
The Information — AI news-outlet 26d ago

SpaceX Sets $135 Per Share Target Price for IPO

SpaceX plans to sell 555.6 million shares at an expected price of $135 each when it goes public, the company said in a securities filing on Wednesday. SpaceX is expecting to raise $75 billion, more than twice as much as any other IPO in history, at a valuation of $1.77 trillion.…

21
Hugging Face Daily Papers research 26d ago

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Abstract Researchers identify a perceptual judgment bias in multimodal large language models where visual evidence is overlooked for textual plausibility, and propose a training framework using a perturbed dataset and reward modeling to improve perceptual fidelity and evaluation…

18
arXiv — Machine Learning research 27d ago

Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection

arXiv:2606.02601v1 Announce Type: new Abstract: Within-dataset class-split evaluation is widely used as a proxy for fully unconditional out-of-distribution anomaly detection. We show that this protocol can become ill-posed when the held-out anomaly class overlaps the normal…

12
arXiv — Machine Learning research 27d ago

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

arXiv:2606.02670v1 Announce Type: new Abstract: Many recent multivariate time series anomaly detection (MT-SAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this…

30
arXiv — Machine Learning research 27d ago

A Systematic Evaluation of Current Architectures in Wind Power Forecasting

arXiv:2606.02849v1 Announce Type: new Abstract: Interval wind speed forecasting is essential for the efficient integration of wind energy into power systems, as it accounts for the inherent uncertainty of wind resources. This study presents a systematic literature review focused…

29
arXiv — Machine Learning research 27d ago

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation…

27
arXiv — Machine Learning research 27d ago

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings

arXiv:2606.03365v1 Announce Type: new Abstract: Embedding models (KGEMs) constitute the main link prediction approach to complete knowledge graphs. Standard evaluation protocols emphasize rank-based metrics such as MRR or Hits@$K$, but usually overlook the influence of random…

26
arXiv — NLP / Computation & Language research 27d ago

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

arXiv:2606.03198v1 Announce Type: new Abstract: Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap…

17
arXiv — NLP / Computation & Language research 27d ago

Sample-Size Scaling of the African Languages NLI Evaluation

arXiv:2606.03219v1 Announce Type: new Abstract: African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language…

35
arXiv — NLP / Computation & Language research 27d ago

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

arXiv:2606.03220v1 Announce Type: new Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task…

16
arXiv — NLP / Computation & Language research 27d ago

Benchmarking Speech-to-Speech Translation Models

arXiv:2606.03241v1 Announce Type: new Abstract: Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and…

5
arXiv — NLP / Computation & Language research 27d ago

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

arXiv:2606.03301v1 Announce Type: new Abstract: We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by…

33
arXiv — NLP / Computation & Language research 27d ago

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

arXiv:2606.03318v1 Announce Type: new Abstract: Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions…

6
Hugging Face Daily Papers research 27d ago

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

Abstract OmniDreams, a foundation generative world model trained from the Cosmos diffusion model, enables real-time action-conditioned video generation for autonomous driving policy evaluation in complex, unseen scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As…

23
TechCrunch — AI news-outlet 27d ago

Cyera eyes $12B valuation at 80x ARR multiple despite operating losses

The cybersecurity company is nearing a $300 million round led by Evolution Equity Partners.

34
TechCrunch — AI news-outlet 27d ago

New Microsoft tool lets devs spin up AI behavior tests using text descriptions

Microsoft on Tuesday took the wraps off Adaptive Spec-driven Scoring for Evaluation and Regression Testing, an open source framework for spinning up AI evaluations.

38
The Information — AI news-outlet 27d ago

SpaceX Targets $1.75 Trillion Valuation Ahead of IPO Roadshow

SpaceX is seeking a valuation of $1.75 trillion in its initial public offering next week, including additional shares the underwriting banks could sell if investor demand is strong, according to people familiar with the matter. The latest valuation target solidifies earlier…

19
Hugging Face Daily Papers research 27d ago

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Abstract Retrieval-augmented generation systems exhibit source-dependent responses to identical queries, necessitating a shift from traditional correctness evaluation to analyzing inter-source relationships for multi-source NLP systems. Generated by…

17
Hugging Face Daily Papers research 27d ago

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

Abstract A unified video-action world model integrates policy learning, video prediction, and action evaluation using a shared video diffusion backbone for robotic manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Robotic manipulation requires models that generate…

22
Hugging Face Daily Papers research 28d ago

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

Abstract ChartArena presents a comprehensive bilingual benchmark for chart parsing that evaluates models across diverse chart types and visual conditions while providing a unified evaluation framework for fair comparison. AI-generated summary Charts are a primary medium for…

8
Hugging Face Daily Papers research 28d ago

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Abstract StressDream enhances video world models by steering diffusion-based imaginations toward high-impact yet plausible outcomes through optimized noise initialization with semantic and plausibility objectives. AI-generated summary Video world models (WMs) have shown promise…

18
arXiv — Machine Learning research 28d ago

DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal Prediction

arXiv:2606.00690v1 Announce Type: new Abstract: Sequential conformal prediction (CP) provides valid uncertainty quantification under the assumption of residual exchangeability. However, this assumption is often violated in real-world time series due to temporal dependencies and…

11
arXiv — NLP / Computation & Language research 28d ago

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

arXiv:2606.00021v1 Announce Type: new Abstract: Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation…

17

A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Data Center Developer Switch in Talks to Raise Billions at $50 Billion-Plus Valuation

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

I Built a Practical Guide to LLM Engineering: RAG, Retrieval, Rerankers, and Evaluation

Fusion Startup Helion Nearly Triples Valuation to $15.5 Billion in Thrive-led Round

PaintBench: Deterministic Evaluation of Precise Visual Editing

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection

Variance Reduction for Heavy-Tailed Monetization Metrics in Ranking Experiments via Post-Stratification

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

(Mis)generalization of Helpful-only Fine-tuning

RowNet: A Memory Transformer for Tabular Regression

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

SANE Schema-aware Natural-language Evaluation of Biological Data

Self-Evolving Deep Research via Joint Generation and Evaluation

GENEB: Why Genomic Models Are Hard to Compare

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

LifeSide: Benchmarking Agents as Lifelong Digital Companions

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

SpaceX Sets $135 Per Share Target Price for IPO

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

A Systematic Evaluation of Current Architectures in Wind Power Forecasting

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

Sample-Size Scaling of the African Languages NLI Evaluation

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Benchmarking Speech-to-Speech Translation Models

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

Cyera eyes $12B valuation at 80x ARR multiple despite operating losses

New Microsoft tool lets devs spin up AI behavior tests using text descriptions

SpaceX Targets $1.75 Trillion Valuation Ahead of IPO Roadshow

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal Prediction

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding