News / #funding Tag Funding 500 articles archived under #funding · RSS Sign in to follow Hugging Face Daily Papers research 10d ago The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation Abstract Analysis of FID variance across different training and sampling seeds reveals significant reproducibility issues in image generation evaluation, with retraining causing larger fluctuations than resampling, and recommends updated evaluation protocols with error bars and… 21 r/MachineLearning community 10d ago Best library for releasing my research optimization algorithm? [D] Hi All! I have developed a research optimizer (QQN Quadratic Quasi-Newton) and published a paper on it where I am able to, but I would really like to make the algorithm itself easily available to the community for evaluation. I have a Rust, Java, and Javascript implementations,… 36 TechCrunch — AI news-outlet 10d ago The CEO of Allbirds’ new AI biz has a plan, but no employees Call it a startup with a sole founder and a very large seed round, but what's next is less clear. 23 Hugging Face Daily Papers research 10d ago Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages Abstract Multi-LCB addresses the limitation of LiveCodeBench by providing a multi-language benchmark for evaluating LLMs across twelve programming languages while maintaining contamination controls and evaluation protocols. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 33 arXiv — Machine Learning research 11d ago Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures arXiv:2606.19365v1 Announce Type: new Abstract: Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly… 35 arXiv — Machine Learning research 11d ago MassSpecGym in the Wild: Uncovering and Correcting Evaluation Pitfalls in AI-Driven Molecule Discovery arXiv:2606.19624v1 Announce Type: new Abstract: Reliable benchmarking is critical for developing machine learning models for tandem mass spectrometry (MS/MS) based molecule discovery. Subtle issues in experimental design and model evaluation procedures can degrade the… 16 arXiv — Machine Learning research 11d ago SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models arXiv:2606.19888v1 Announce Type: new Abstract: Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent… 11 arXiv — Machine Learning research 11d ago PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection arXiv:2606.20055v1 Announce Type: new Abstract: Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational… 21 arXiv — Machine Learning research 11d ago Learner-based Concept Drift Detection: Analysis and Evaluation arXiv:2606.20216v1 Announce Type: new Abstract: Machine learning algorithms deployed for evolving streaming environments must handle the non-stationary data distributions, commonly referred to as concept drift. The presence of concept drift poses a major challenge for many… 23 arXiv — NLP / Computation & Language research 11d ago Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias arXiv:2606.19544v1 Announce Type: new Abstract: LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates… 34 arXiv — NLP / Computation & Language research 11d ago IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources arXiv:2606.20089v1 Announce Type: new Abstract: Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a… 15 arXiv — NLP / Computation & Language research 11d ago The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse arXiv:2606.20255v1 Announce Type: new Abstract: We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for… 34 arXiv — NLP / Computation & Language research 11d ago Gender Bias in LLM Hiring Decisions: Evidence from a Japanese Context and Evaluation of Mitigation Strategies arXiv:2606.18649v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in hiring workflows, yet most research on gender bias in LLM hiring decisions has focused on English-language, Western-format resumes. This study examines whether pro-female… 38 Hugging Face Daily Papers research 11d ago Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 27 Hugging Face Daily Papers research 11d ago Re-Centering Humans in LLM Personalization Abstract Human-centered evaluation reveals significant gaps between synthetic and real-world LLM personalization performance, with models struggling to extract user attributes and generate truly personalized responses that match human quality judgments. Generated by… 30 TechCrunch — AI news-outlet 11d ago General Intuition in talks to raise $300M at around $2B valuation General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning. 14 Hugging Face Daily Papers research 11d ago A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets Abstract A benchmark for predicting spreadsheet user actions is introduced, addressing challenges in edit history availability and complex action spaces through manual curation and online evaluation methodology. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Predictive code… 17 OpenAI official-blog 11d ago Improving health intelligence in ChatGPT Learn how GPT-5.5 Instant improves ChatGPT’s health and wellness responses with stronger reasoning, better context, clearer communication, and physician-informed evaluations. 7 arXiv — Machine Learning research 12d ago Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting arXiv:2606.18367v1 Announce Type: new Abstract: Standard benchmarks evaluate time series foundation models (TSFMs) using aggregate metrics, but these can mask severe failures in critical operating regimes. We introduce regime-stratified evaluation and apply it to three TSFMs on… 19 arXiv — Machine Learning research 12d ago RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing arXiv:2606.18774v1 Announce Type: new Abstract: We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on… 35 arXiv — Machine Learning research 12d ago Anomaly Detection for Sparse and Irregular Multivariate Time Series with Latent SDEs arXiv:2606.18898v1 Announce Type: new Abstract: Multivariate time series anomaly detection (MTSAD) is critical for a wide range of application areas, such as industrial monitoring, cybersecurity, or healthcare. Real-world data is often sparse, irregularly sampled or partially… 8 arXiv — NLP / Computation & Language research 12d ago Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance arXiv:2606.18613v1 Announce Type: new Abstract: The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication.… 7 arXiv — NLP / Computation & Language research 12d ago Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports arXiv:2606.18797v1 Announce Type: new Abstract: Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this… 23 arXiv — NLP / Computation & Language research 12d ago Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering arXiv:2606.18986v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series… 10 arXiv — NLP / Computation & Language research 12d ago Learning User Simulators with Turing Rewards arXiv:2606.19336v1 Announce Type: new Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by… 37 arXiv — NLP / Computation & Language research 12d ago Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment arXiv:2606.18979v1 Announce Type: cross Abstract: Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but… 36 arXiv — NLP / Computation & Language research 12d ago Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation arXiv:2606.19139v1 Announce Type: cross Abstract: Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts,… 15 arXiv — NLP / Computation & Language research 12d ago ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark arXiv:2505.23851v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution… 38 Hugging Face Daily Papers research 12d ago Physics-IQ Verified Abstract A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video… 29 r/LocalLLaMA community 12d ago Lin Junyang AI Lab Closes Round at $2B Valuation A new lab from Lin Junyang can only be good news for open source / weights, I think. Excited to see what the lead responsible for the Qwen line does next.   submitted by   /u/rmhubbert [link]   [comments] 38 TechCrunch — AI news-outlet 12d ago World model maker Odyssey nabs $1.45B valuation backed by Amazon and other big names World models are the next big thing in AI beyond LLMs and, with this round, Odyssey has cemented itself as one of the startups to watch. 30 TechCrunch — AI news-outlet 12d ago Pramaana Labs raises $27M seed round from Khosla Ventures to bring formal verification to AI Pramaana will focus on highly sensitive verticals like law, drug discovery, and tax preparation — where errors can be costly and reliability is at a premium. 22 arXiv — Machine Learning research 13d ago Informative Missingness to Generate Irregular Clinical Time Series arXiv:2606.17106v1 Announce Type: new Abstract: Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology,… 8 arXiv — Machine Learning research 13d ago Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis arXiv:2606.17115v1 Announce Type: new Abstract: Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based… 18 arXiv — NLP / Computation & Language research 13d ago Rift: A Conflict Signature for Deception in Language Models arXiv:2606.17229v1 Announce Type: cross Abstract: A model that lies while knowing the truth is the central case ELK cannot handle with behavioral evaluation alone. We ask whether such deception leaves an internal signature distinguishing it from honest error. Our key move is a… 9 arXiv — Machine Learning research 13d ago Offline Preference-Based Trajectory Evaluation arXiv:2606.17541v1 Announce Type: new Abstract: Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective… 20 arXiv — Machine Learning research 13d ago Multiple cyclicity and Wavelet Decomposition with Channel Correlation for Long-term Time Series Forecasting arXiv:2606.17996v1 Announce Type: new Abstract: Cyclicity and trend are important components of time series data and many studies based on cyclicity and trend have achieved good results in long-term time series forecasting. However, we believe that current work neglects the… 37 arXiv — Machine Learning research 13d ago Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines arXiv:2606.18122v1 Announce Type: new Abstract: Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents… 36 arXiv — Machine Learning research 13d ago RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports arXiv:2606.17062v1 Announce Type: cross Abstract: Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation… 11 arXiv — NLP / Computation & Language research 13d ago MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation arXiv:2606.17449v1 Announce Type: new Abstract: While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation… 38 arXiv — NLP / Computation & Language research 13d ago AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows arXiv:2606.17474v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential,… 17 arXiv — NLP / Computation & Language research 13d ago Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement arXiv:2606.17506v1 Announce Type: new Abstract: Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate… 4 arXiv — NLP / Computation & Language research 13d ago Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings arXiv:2606.17542v1 Announce Type: new Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction.… 19 arXiv — NLP / Computation & Language research 13d ago The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer arXiv:2606.17609v1 Announce Type: new Abstract: Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the… 7 arXiv — NLP / Computation & Language research 13d ago Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs arXiv:2606.17634v1 Announce Type: new Abstract: Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has… 30 arXiv — NLP / Computation & Language research 13d ago Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation arXiv:2606.17820v1 Announce Type: new Abstract: This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of… 28 arXiv — NLP / Computation & Language research 13d ago When Multiple Scripts Matter: Evaluating ASR in Clinical Settings arXiv:2606.17826v1 Announce Type: new Abstract: Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics… 20 arXiv — NLP / Computation & Language research 13d ago HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice arXiv:2606.18103v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual… 8 arXiv — NLP / Computation & Language research 13d ago RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills arXiv:2606.18203v1 Announce Type: new Abstract: The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an… 28 arXiv — NLP / Computation & Language research 13d ago Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization arXiv:2606.17092v1 Announce Type: cross Abstract: Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a… 8 Page 3 of 10 · 500 articles ← Newer Older →