Tag

Funding

500 articles archived under #funding · RSS

arXiv — NLP / Computation & Language research 1mo ago

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

arXiv:2605.29397v1 Announce Type: new Abstract: HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is…

35
arXiv — NLP / Computation & Language research 1mo ago

Comparative Evaluation of Machine Translation Systems on Images with Text

arXiv:2605.29476v1 Announce Type: new Abstract: This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study…

7
arXiv — NLP / Computation & Language research 1mo ago

PhoneWorld: Scaling Phone-Use Agent Environments

arXiv:2605.29486v1 Announce Type: new Abstract: A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but…

28
arXiv — NLP / Computation & Language research 1mo ago

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

arXiv:2605.29555v1 Announce Type: new Abstract: As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a…

31
arXiv — NLP / Computation & Language research 1mo ago

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

arXiv:2605.29585v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the…

27
arXiv — NLP / Computation & Language research 1mo ago

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

arXiv:2605.29667v1 Announce Type: new Abstract: When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries,…

9
arXiv — NLP / Computation & Language research 1mo ago

Personalized Turn-Level User Conversation Satisfaction Benchmark

arXiv:2605.29711v1 Announce Type: new Abstract: User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation…

9
arXiv — NLP / Computation & Language research 1mo ago

Metric-Dependent Annotation Saturation for Learning from Label Distributions

arXiv:2605.29797v1 Announce Type: new Abstract: When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from…

37
arXiv — NLP / Computation & Language research 1mo ago

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

arXiv:2605.29800v1 Announce Type: new Abstract: LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how…

35
r/LocalLLaMA community 1mo ago

llama.cpp B9387 Significant AMD/ROCm PP Update

https://github.com/ggml-org/llama.cpp/releases/tag/b9387 MFMA is restricted to AMD CDNA architecture that's MI100, MI200, MI300 series datacenter cards. Post your initial results if you try it! wink   submitted by   /u/Bulky-Priority6824 [link]   [comments]

38
The Information — AI news-outlet 1mo ago

Base Power in Talks to Raise Funds at $12 Billion Valuation

Base Power, a three-year-old home-battery startup, is in talks to raise funds at a $12 billion valuation, according to a person with knowledge of the discussions. Ribbit Capital, which backed Base Power’s last funding round, has been in talks to lead the current round, according…

17
OpenAI official-blog 1mo ago

A shared playbook for trustworthy third party evaluations

OpenAI shares guidance on third-party AI evaluations, covering how to assess model capabilities, safeguards, and validity for frontier systems.

22
r/MachineLearning community 1mo ago

Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R]

🌟 Announcing the 2nd Workshop on Social Simulation with LLMs (Social Sim'26) @ COLM 📣 Welcoming Submissions! Submission here:. 🗓️ Deadline: June 23, 2026 (AoE) This year's theme is "Fidelity in Applications”, moving beyond compelling demos toward evaluation, robustness,…

11
The Information — AI news-outlet 1mo ago

The AI Boom’s Pricey Middle

Baseten’s talks to raise fresh funding at an $11 billion valuation are the latest sign that investors are betting the messy work of helping developers run AI models can become one of the next big businesses in AI. That boom has lifted a group of companies including Baseten,…

27
The Information — AI news-outlet 1mo ago

Anthropic Releases New Flagship AI Model

Anthropic on Thursday announced its new flagship AI model, Claude Opus 4.8, which showed improvements in standardized AI performance evaluations in coding, financial analysis and other fields. The company also said the model is more likely to flag uncertainties about its work…

22
The Information — AI news-outlet 1mo ago

Anthropic Raises $65 Billion at $900 Billion Valuation; Micron, Samsung Invest

Anthropic said Thursday it had raised $65 billion at a valuation of $900 billion before the financing, more than double the valuation in a round closed three months earlier. New investors Micron, Samsung and SK Hynix, which make a key component of AI chips, are investing in the…

5
TechCrunch — AI news-outlet 1mo ago

Anthropic raises $65 Billion, nears $1T valuation ahead of IPO

Anthropic has closed a $65 billion Series H round at a $965 billion post-money valuation, marking what could be the AI startup's final private fundraise before a highly anticipated IPO.

14
Hacker News — AI on Front Page community 1mo ago

Anthropic raises $65B in Series H funding at $965B post-money valuation

Article URL: https://www.anthropic.com/news/series-h Comments URL: https://news.ycombinator.com/item?id=48313048 Points: 273 # Comments: 278

24
r/MachineLearning community 1mo ago

Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]

Wall-OSS-0.5 is a new 4B VLA release from X Square Robot, built on a 3B VLM backbone with action experts in a Mixture-of-Transformers layout. What caught my eye is that the report evaluates the pretrained checkpoint on real robots before task-specific fine tuning, instead of…

25
r/LocalLLaMA community 1mo ago

Qwen/Qwen-Image-Bench · Hugging Face

Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality criteria organized in a 3-level hierarchy…

8
Latent.Space news-outlet 1mo ago

[AINews] Cognition raises $1B in $26B Series D

coding is an uncapped TAM market

13
Smol AI News news-outlet 1mo ago

Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows

**Anthropic** announced a massive **$65B Series H financing** at a **$965B valuation**, led by **Altimeter, Dragoneer, Greenoaks, and Sequoia**, with run-rate revenue surpassing **$47B**. They launched **Claude Opus 4.8**, an update to Opus 4.7 featuring "sharper judgment,"…

28
arXiv — Machine Learning research 1mo ago

A Simple State Space Model Excels at Multivariate Time Series Classification

arXiv:2605.27406v1 Announce Type: new Abstract: Structured state space models (SSMs) have recently emerged as a promising foundation for sequence modeling, with Mamba-based architectures demonstrating strong performance through input-dependent state transitions, albeit at…

30
arXiv — Machine Learning research 1mo ago

Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation

arXiv:2605.27486v1 Announce Type: new Abstract: Federated learning (FL) has broadened the horizon for multivariate time series anomaly detection (MTSAD). However, benchmarking such anomaly detection methods within FL paradigm poses data-centric challenges. The existing datasets…

28
arXiv — Machine Learning research 1mo ago

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

arXiv:2605.27763v1 Announce Type: new Abstract: Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized…

17
arXiv — Machine Learning research 1mo ago

Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection

arXiv:2605.27992v1 Announce Type: new Abstract: Time series anomaly detection is critical for maintaining the reliability of mission-critical systems. While Transformer-based models like PatchTST have shown remarkable performance, their $\mathcal{O}(L^2)$ computational…

11
arXiv — Machine Learning research 1mo ago

Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector

arXiv:2605.28103v1 Announce Type: new Abstract: We present a unified experiment, analysis, and benchmark study of multivariate time-series (MTS) anomaly detection. Ten family-representative detectors -- spanning statistical, reconstruction, association, frequency, and…

4
arXiv — Machine Learning research 1mo ago

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

arXiv:2605.28203v1 Announce Type: new Abstract: As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional…

14
arXiv — NLP / Computation & Language research 1mo ago

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

arXiv:2605.27393v1 Announce Type: new Abstract: Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce…

7
arXiv — NLP / Computation & Language research 1mo ago

Disentangling Language Roles in Multilingual LLM Task Execution

arXiv:2605.27649v1 Announce Type: new Abstract: Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate…

28
arXiv — NLP / Computation & Language research 1mo ago

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

arXiv:2605.27709v1 Announce Type: new Abstract: Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning…

37
arXiv — NLP / Computation & Language research 1mo ago

ChildEval: When large language models meet children's personalities

arXiv:2605.27805v1 Announce Type: new Abstract: While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a…

20
arXiv — NLP / Computation & Language research 1mo ago

GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

arXiv:2605.27866v1 Announce Type: new Abstract: Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for…

35
arXiv — NLP / Computation & Language research 1mo ago

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

arXiv:2605.27882v1 Announce Type: new Abstract: LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified…

38
arXiv — NLP / Computation & Language research 1mo ago

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

arXiv:2605.27914v1 Announce Type: new Abstract: Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge…

7
arXiv — NLP / Computation & Language research 1mo ago

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

arXiv:2605.27984v1 Announce Type: new Abstract: Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment…

10
arXiv — NLP / Computation & Language research 1mo ago

Auditing Stance Asymmetry in Generative Explanations

arXiv:2605.27988v1 Announce Type: new Abstract: Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations…

22
arXiv — NLP / Computation & Language research 1mo ago

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

arXiv:2605.28013v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations:…

18
arXiv — NLP / Computation & Language research 1mo ago

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

arXiv:2605.28020v1 Announce Type: new Abstract: With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token…

29
arXiv — NLP / Computation & Language research 1mo ago

ATLAS: All-round Testing of Long-context Abilities across Scales

arXiv:2605.28079v1 Announce Type: new Abstract: Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and…

4
arXiv — NLP / Computation & Language research 1mo ago

Chinese Word Boundary Recovery through Character Alignment Projection

arXiv:2605.28128v1 Announce Type: new Abstract: Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper…

30
arXiv — NLP / Computation & Language research 1mo ago

Why We Need Speech to Evaluate Speech Translation

arXiv:2605.28227v1 Announce Type: new Abstract: Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and…

35
arXiv — NLP / Computation & Language research 1mo ago

Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

arXiv:2605.28313v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs…

38
r/MachineLearning community 1mo ago

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

[R] BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison I’m looking for feedback on a local agent-memory benchmark comparison, especially from people who care about evaluation methodology. I built an open-source R&D memory system called Context Swarm Memory…

31
The Information — AI news-outlet 1mo ago

Coding Startup Cognition Raises $1 Billion at a $26 Billion Valuation

Coding startup Cognition has raised more than $1 billion in a funding round that valued the company at $26 billion including the investment, the company said in a blog post. That’s nearly double its valuation from its last fundraise, which valued the three-year-old company at…

11
Hugging Face Daily Papers research 1mo ago

FastKernels: Benchmarking GPU Kernel Generation in Production

Abstract FastKernels addresses the gap between benchmark evaluation and production performance for LLM kernel agents by providing a representative set of architectures and a production-grade inference framework that aligns evaluation with real-world deployment. AI-generated…

34
TechCrunch — AI news-outlet 1mo ago

AI coding startup Cognition raises $1B at $25B pre-money valuation

As Cognition reaches $492 million in annualized revenue run rate, it more than doubled its valuation in eight months, it says.

15
Hugging Face Daily Papers research 1mo ago

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Abstract A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency. AI-generated summary Social…

30
Hugging Face Daily Papers research 1mo ago

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Abstract A skill-centric agent framework enables continuous improvement of task-solving capabilities through a unified lifecycle of skill creation, memory, management, evaluation, and refinement. AI-generated summary Large language model (LLM) agents rely on reusable skills to…

21
Hugging Face Daily Papers research 1mo ago

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Abstract Agentic CLEAR is an automatic evaluation framework that provides multi-level textual insights into agent behavior through dynamic analysis of LLM interactions across various benchmarks and settings. AI-generated summary Agentic systems are becoming more capable: agents…

19

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

Comparative Evaluation of Machine Translation Systems on Images with Text

PhoneWorld: Scaling Phone-Use Agent Environments

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

Personalized Turn-Level User Conversation Satisfaction Benchmark

Metric-Dependent Annotation Saturation for Learning from Label Distributions

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

llama.cpp B9387 Significant AMD/ROCm PP Update

Base Power in Talks to Raise Funds at $12 Billion Valuation

A shared playbook for trustworthy third party evaluations

Social Simulation with LLMs - Fidelity in Applications (CFP @ COLM'26) [R]

The AI Boom’s Pricey Middle

Anthropic Releases New Flagship AI Model

Anthropic Raises $65 Billion at $900 Billion Valuation; Micron, Samsung Invest

Anthropic raises $65 Billion, nears $1T valuation ahead of IPO

Anthropic raises $65B in Series H funding at $965B post-money valuation

Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]

Qwen/Qwen-Image-Bench · Hugging Face

[AINews] Cognition raises $1B in $26B Series D

Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows

A Simple State Space Model Excels at Multivariate Time Series Classification

Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving

Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection

Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector

Refining Multidimensional Video Reward Models via Disentangled Influence Functions

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

Disentangling Language Roles in Multilingual LLM Task Execution

ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

ChildEval: When large language models meet children's personalities

GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Auditing Stance Asymmetry in Generative Explanations

KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

ATLAS: All-round Testing of Long-context Abilities across Scales

Chinese Word Boundary Recovery through Character Alignment Projection

Why We Need Speech to Evaluate Speech Translation

Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]

Coding Startup Cognition Raises $1 Billion at a $26 Billion Valuation

FastKernels: Benchmarking GPU Kernel Generation in Production

AI coding startup Cognition raises $1B at $25B pre-money valuation

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents