Tag

Funding

500 articles archived under #funding · RSS

arXiv — NLP / Computation & Language research 28d ago

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

arXiv:2606.00027v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a…

37
arXiv — NLP / Computation & Language research 28d ago

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

arXiv:2606.00093v1 Announce Type: new Abstract: Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $\kappa$, and one or more rank correlations. A survey of 24 recent LLM-as-judge…

33
arXiv — NLP / Computation & Language research 28d ago

RealityTest: How People Probe AI Identity and Whether Models Disclose It

arXiv:2606.00168v1 Announce Type: new Abstract: AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of…

24
arXiv — NLP / Computation & Language research 28d ago

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

arXiv:2606.00596v1 Announce Type: new Abstract: Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in…

9
arXiv — NLP / Computation & Language research 28d ago

IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs

arXiv:2606.00875v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior…

25
arXiv — NLP / Computation & Language research 28d ago

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

arXiv:2606.00881v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size…

38
arXiv — NLP / Computation & Language research 28d ago

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

arXiv:2606.01016v1 Announce Type: new Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a…

19
arXiv — NLP / Computation & Language research 28d ago

Child-directed speech facilitates production, not comprehension, in BabyLMs

arXiv:2606.01045v1 Announce Type: new Abstract: Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of…

26
arXiv — NLP / Computation & Language research 28d ago

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

arXiv:2606.01260v1 Announce Type: new Abstract: Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied in Indonesia, thus leaving a critical gap in evaluating representational fairness and…

21
arXiv — NLP / Computation & Language research 28d ago

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

arXiv:2606.01322v1 Announce Type: new Abstract: Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven…

19
Hugging Face Daily Papers research 28d ago

Brain-IT-VQA: From Brain Signals to Answers

Abstract Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation. AI-generated summary Decoding visual content from fMRI signals recorded while a person…

21
The Information — AI news-outlet 28d ago

AI Evaluators Struggle with Models That Know When They’re Being Tested

AI researchers are starting to make progress on a confounding problem: AI models are getting better at telling when they are in an evaluation. That could become a problem for AI companies that use evaluations to gauge the capabilities and behaviors of their models before…

37
arXiv — Machine Learning research 29d ago

Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

arXiv:2605.30376v1 Announce Type: new Abstract: Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore critical inter-channel dependencies, while channel-dependent models are expressive but…

15
arXiv — Machine Learning research 29d ago

A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI

arXiv:2605.30388v1 Announce Type: new Abstract: This paper introduces a new systematic framework for detecting anomalies in maritime Automatic Identification System (AIS) datasets. These anomalies include abnormal vessel behaviours related to speed, position jumps, time gaps,…

22
arXiv — Machine Learning research 29d ago

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

arXiv:2605.30393v1 Announce Type: new Abstract: Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than out-of-sample skill. We introduce NumLeak, a measurement framework that combines API-boundary…

25
arXiv — Machine Learning research 29d ago

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

arXiv:2605.30514v1 Announce Type: new Abstract: Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This…

10
arXiv — Machine Learning research 29d ago

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

arXiv:2605.30590v1 Announce Type: new Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other…

23
arXiv — Machine Learning research 29d ago

Conformal Reliability: A New Evaluation Metric for Conditional Generation

arXiv:2605.30807v1 Announce Type: new Abstract: Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is…

8
arXiv — Machine Learning research 29d ago

GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring

arXiv:2605.30865v1 Announce Type: new Abstract: Continuous glucose monitoring (CGM) provides a dense view of daily metabolic physiology, yet existing generic time-series and CGM-specific foundation models often encode glucose traces as entangled single-stream sequences, leaving…

31
arXiv — NLP / Computation & Language research 29d ago

Refining Word-Based Grammatical Error Annotation for L2 Korean

arXiv:2605.30545v1 Announce Type: new Abstract: Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they…

10
arXiv — NLP / Computation & Language research 29d ago

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

arXiv:2605.30568v1 Announce Type: new Abstract: LLM-as-a-Judge is a scalable alternative to human evaluation, yet existing rubric-based methods rely on human-annotated data such as reference answers or expert-crafted rubrics. We propose to automatically generate fine-grained…

37
arXiv — NLP / Computation & Language research 29d ago

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

arXiv:2605.30673v1 Announce Type: new Abstract: Classroom videos contain observable teaching practices, but their pedagogical and visual signals are rarely organized in forms suitable for model evaluation. We present \textit{TeachObs}, a human-validated benchmark for multimodal…

26
arXiv — NLP / Computation & Language research 29d ago

Pairwise Reference Alignment as a Model-Level Ordinal Observable

arXiv:2605.30758v1 Announce Type: new Abstract: Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference…

18
arXiv — NLP / Computation & Language research 29d ago

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

arXiv:2605.31351v1 Announce Type: new Abstract: AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general…

30
arXiv — NLP / Computation & Language research 29d ago

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

arXiv:2605.31381v1 Announce Type: new Abstract: We evaluate the consistency of automated judges in conducting a multi-dimensional safety evaluation in a reference-free setup. Our results indicate that Large Language Models are unreliable judges in identifying safety issues…

36
arXiv — NLP / Computation & Language research 29d ago

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

arXiv:2605.31483v1 Announce Type: new Abstract: Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination…

20
Hugging Face Daily Papers research 29d ago

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Abstract Swanbench-Speech addresses the lack of comprehensive long-form speech evaluation by providing a benchmark with diverse scenarios, multi-dimensional metrics, and insights into model limitations. AI-generated summary Recent advances in speech generation have enabled…

5
Hugging Face Daily Papers research 29d ago

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Abstract OpenSkillEval is an automatic evaluation framework that assesses skill-augmented agent systems and skills across diverse real-world applications, revealing that skill availability doesn't guarantee effective usage and that performance benefits depend heavily on model…

31
r/LocalLLaMA community 1mo ago

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

Author here. The short version of why I built this: Cyber-AI evaluation is converging on the same diagnosis from multiple labs. Anthropic's Claude Mythos system card this year: their cyber ranges "lack many features often present in real-world environments such as defensive…

6
r/MachineLearning community 1mo ago

Bayesian Opt. GPs vs Linear models and Neural Networks for parameter optimizations [R]

Hi, Relatively new to deep learning. I wanted some opinions on which of these approaches might be best for time series data and spectral analysis. I currently use a GP and it works pretty well, but I’m wondering what the computational tradeoffs and so forth might be. Any ideas?…

4
Hacker News — AI on Front Page community 1mo ago

OpenRouter raises $113M Series B

Article URL: https://openrouter.ai/announcements/series-b Comments URL: https://news.ycombinator.com/item?id=48338660 Points: 242 # Comments: 110

4
TechCrunch — AI news-outlet 1mo ago

The groupthink boom: what 3 top VCs really think about the AI frenzy

"If you're 22 years old in San Francisco and building something in AI, there may be a seed term sheet in your inbox — but if you're 19, oh my God, this means you're really good; you might already have a Series A [offer]," said one, half-kiddingly.

12
r/LocalLLaMA community 1mo ago

Gryphe/Pantheon-Reasoning-27B · Hugging Face

from Gryphe: An experiment in bringing reasoning capability to the Pantheon roleplay series in the form of an uncensored dense Qwen 3.6 27B. This specific model can be thought of as a successor to both the Pantheon series and the one-time Codex release since I used such a large…

15
Hacker News — AI on Front Page community 1mo ago

Danish pension fund excludes SpaceX citing governance and valuation

Article URL: https://www.reuters.com/legal/transactional/danish-pension-fund-excludes-spacex-citing-governance-valuation-2026-05-29/ Comments URL: https://news.ycombinator.com/item?id=48333820 Points: 207 # Comments: 146

23
Hugging Face Daily Papers research 1mo ago

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Abstract A parameter-efficient vision-language model is developed for time-series anomaly detection using a novel benchmark with natural-language rationales, achieving superior performance and generalization across multiple datasets. AI-generated summary Recent advances in…

38
TechCrunch — AI news-outlet 1mo ago

This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory

South Korean chip startup Xcena is betting that AI's real bottleneck is not compute, but memory.

20
Hugging Face Daily Papers research 1mo ago

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

Abstract PRISM evaluates automated peer review systems across multiple dimensions using argument mining and retrieval-augmented verification, revealing that while LLMs match human performance in specific areas, no system consistently equals human reviewers across all evaluation…

19
arXiv — Machine Learning research 1mo ago

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

arXiv:2605.28866v1 Announce Type: new Abstract: Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. However, prior studies largely overlook the inherent continuity and ordinality of time series…

20
arXiv — Machine Learning research 1mo ago

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

arXiv:2605.28867v1 Announce Type: new Abstract: Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an…

10
arXiv — Machine Learning research 1mo ago

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

arXiv:2605.29005v1 Announce Type: new Abstract: Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational…

18
arXiv — Machine Learning research 1mo ago

Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation

arXiv:2605.29108v1 Announce Type: new Abstract: Selecting efficient multi-step synthetic routes is a central challenge in organic synthesis, particularly in medicinal and process chemistry, where route choice directly impacts feasibility, cost, and development efficiency.…

28
arXiv — Machine Learning research 1mo ago

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

arXiv:2605.29156v1 Announce Type: new Abstract: Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit…

9
arXiv — Machine Learning research 1mo ago

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

arXiv:2605.29283v1 Announce Type: new Abstract: Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to…

22
arXiv — Machine Learning research 1mo ago

Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems

arXiv:2605.29373v1 Announce Type: new Abstract: Solving high-dimensional PDE-governed inverse problems is often challenging due to complex non-Gaussian posterior distributions, expensive forward model evaluations, and misspecified prior information. To address these issues, we…

13
arXiv — Machine Learning research 1mo ago

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

arXiv:2605.29500v1 Announce Type: new Abstract: Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard…

11
arXiv — NLP / Computation & Language research 1mo ago

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

arXiv:2605.28830v1 Announce Type: new Abstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated…

19
arXiv — NLP / Computation & Language research 1mo ago

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

arXiv:2605.28848v1 Announce Type: new Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how…

31
arXiv — NLP / Computation & Language research 1mo ago

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

arXiv:2605.28882v1 Announce Type: new Abstract: With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet…

4
arXiv — NLP / Computation & Language research 1mo ago

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

arXiv:2605.29256v1 Announce Type: new Abstract: Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and…

21
arXiv — NLP / Computation & Language research 1mo ago

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

arXiv:2605.29340v1 Announce Type: new Abstract: In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods…

19

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

RealityTest: How People Probe AI Identity and Whether Models Disclose It

Toward Responsible and Epistemically Grounded Multilingual LLMs for Computational Social Science and Humanities

IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Child-directed speech facilitates production, not comprehension, in BabyLMs

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages

TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

Brain-IT-VQA: From Brain Signals to Answers

AI Evaluators Struggle with Models That Know When They’re Being Tested

Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Conformal Reliability: A New Evaluation Metric for Conditional Generation

GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring

Refining Word-Based Grammatical Error Annotation for L2 Korean

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation

Pairwise Reference Alignment as a Model-Level Ordinal Observable

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

PolyRange: Contamination-resistant offensive-AI benchmark for web targets (that ain't a benchmark, THAT's a benchmark)

Bayesian Opt. GPs vs Linear models and Neural Networks for parameter optimizations [R]

OpenRouter raises $113M Series B

The groupthink boom: what 3 top VCs really think about the AI frenzy

Gryphe/Pantheon-Reasoning-27B · Hugging Face

Danish pension fund excludes SpaceX citing governance and valuation

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

This chip startup just raised $135M on a bet that AI&#8217;s biggest bottleneck isn&#8217;t compute &#8212; it&#8217;s memory

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse Problems

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal Activities

This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory