News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — Machine Learning research 31m ago Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models arXiv:2606.28406v1 Announce Type: new Abstract: Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing… 36 arXiv — Machine Learning research 31m ago Counterfactual Residual Data Augmentation for Regression arXiv:2606.28460v1 Announce Type: new Abstract: Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel… 21 arXiv — Machine Learning research 31m ago NIVA: A Multimodal Foundation Model for Actionable Earth System Intelligence arXiv:2606.28546v1 Announce Type: new Abstract: Recent advances in AI-driven weather and climate modeling have improved forecast skill while reducing computational cost. However, existing data-driven approaches are limited in their ability to model coupled Earth system dynamics,… 9 arXiv — Machine Learning research 31m ago ML-Powered LDAP Reconnaissance Detection using Weak Supervision arXiv:2606.28917v1 Announce Type: new Abstract: Lightweight Directory Access Protocol (LDAP) is a protocol that allows users to query and modify Active Directory (AD) data. By default, all users have read access to all AD data through LDAP, making it a common initial tool for… 14 arXiv — Machine Learning research 31m ago DLR: Zero-Inference-Cost Latent Residuals for Low-Rank Pre-Training arXiv:2606.28932v1 Announce Type: new Abstract: Large language models have driven recent progress in language and multimodal AI, yet pre-training them at scale is prohibitively expensive. Low-rank pre-training, which factorizes each weight matrix into a rank-r product to reduce… 35 arXiv — Machine Learning research 31m ago AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification arXiv:2606.29335v1 Announce Type: new Abstract: Multimodal speaker identification systems face two key challenges in real-world deployment: missing modalities and language mismatch between training and testing conditions. In practical scenarios, background multi-speaker… 14 arXiv — Machine Learning research 31m ago Do Models Read What They Write? Causal Registers in Scratchpad Reasoning arXiv:2606.29522v1 Announce Type: new Abstract: A central hope behind process supervision is that models can expose intermediate variables that matter for their later behavior. For this to help with alignment, a scratchpad must be tied to the computation: when the model writes a… 29 arXiv — NLP / Computation & Language research 31m ago SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision arXiv:2606.28562v1 Announce Type: new Abstract: On-policy distillation (OPD) has a property absent in offline distillation and RL: teacher supervision quality depends on student competence. Incoherent rollouts yield noisy gradients; already-mastered tokens yield redundant ones.… 10 arXiv — NLP / Computation & Language research 31m ago EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control arXiv:2606.28938v1 Announce Type: new Abstract: Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle's real-time electro-mechanical state. To bridge this gap, we… 26 arXiv — NLP / Computation & Language research 31m ago Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study arXiv:2606.29213v1 Announce Type: new Abstract: OCR systems, ranging from classical engines to specialised OCR vision-language models (OCR-VLMs) and frontier multimodal LLMs, report strong results on English and Chinese document benchmarks, yet their behaviour on Indic scripts… 30 arXiv — NLP / Computation & Language research 31m ago Hybrid Retriever Evolution for Multimodal Document Reasoning Agents arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to… 33 arXiv — NLP / Computation & Language research 31m ago Resolution Thresholds in VLM Detection of Harmful ASCII Art Across Construction Modes and Languages arXiv:2606.29649v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) are increasingly deployed as content moderation tools, yet they remain vulnerable to jailbreak attacks in which harmful text is visually encoded as ASCII art. This can allow inappropriate or… 31 arXiv — NLP / Computation & Language research 31m ago Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models arXiv:2606.29689v1 Announce Type: new Abstract: Open-ended aesthetic critique is a challenge for multimodal large language models (MLLMs): unlike multiple-choice aesthetic benchmarks, it has no single correct answer, and most aesthetic evaluation has measured models against… 8 arXiv — NLP / Computation & Language research 31m ago DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning arXiv:2606.30189v1 Announce Type: new Abstract: Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world… 14 arXiv — NLP / Computation & Language research 31m ago Forewarned is Forearmed: When Non-Sequential Embedding Turns Into an Anomaly Detector arXiv:2606.30196v1 Announce Type: new Abstract: This paper offers an in-depth analysis of non-sequential multimodal sentence-level embeddings, with a particular focus on the SONAR model. We demonstrate that certain embedding dimensions are sensitive to perturbations and can… 25 Hacker News — AI on Front Page community 8h ago .self: A new top-level domain designed to support self-hosting Article URL: https://hccf.onmy.cloud/2026/06/21/reclaiming-our-digital-selves-hccfs-vision-for-a-human-centered-top-level-domain/ Comments URL: https://news.ycombinator.com/item?id=48724230 Points: 246 # Comments: 154 24 r/MachineLearning community 11h ago I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p] (this was deleted before but i dont know if it was the filters of reddit or the moderators, if is the moderators i will not post it again after you delete it sorry.) (The name will probably change soon because I didn't realize "AgroVision" is already a registered trademark lol.)… 15 r/MachineLearning community 13h ago I do historical swordfighting and noticed AI struggles to track it. I’m building an open dataset to help fix this. Does my schema make sense? [P] Hi everyone, I’m a historical swordfighter (HEMA practitioner), and while I’m not a computer vision engineer or a roboticist, I’ve been reading a lot about the current bottlenecks in embodied AI, specifically around the Sim2Real gap and thin-object tracking. It occurred to me… 18 r/MachineLearning community 23h ago ECCV 2026 Final Decisions after Provisional Acceptance [D] Has anyone actually received final acceptance following their provisional acceptance email from ECCV 2026? I am very confused. Thank you so much.   submitted by   /u/Land_Heavy [link]   [comments] 15 arXiv — Machine Learning research 1d ago HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models arXiv:2606.27627v1 Announce Type: new Abstract: Discrete audio representations have become increasingly popular for building multimodal text-audio systems and integrating audio capabilities into Large Language Models (LLMs). However, numerous studies report performance… 7 arXiv — Machine Learning research 1d ago Are Time-Series Foundation Models Ready for E-Nose Data? An Empirical Assessment of Their Embeddings arXiv:2606.27672v1 Announce Type: new Abstract: Inspired by advances in natural language processing and computer vision, "time-series foundation models" (TSFMs) have recently been introduced with the promise of strong generalization across diverse time-series tasks, including… 5 arXiv — Machine Learning research 1d ago Dual-Learning based Penalized Multi-Align Clustering for Multi-View Incomplete and Disorderly Data arXiv:2606.27984v1 Announce Type: new Abstract: Multimodal feature fusion can effectively capture complex patterns in real-world data by integrating complementary information from different modalities. However, in many applications, such as boiler combustion monitoring,… 18 arXiv — Machine Learning research 1d ago Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection arXiv:2606.28134v1 Announce Type: new Abstract: Graph-based fraud detection is essential for safeguarding large-scale transaction systems, where undetected anomalies may lead to substantial financial losses and security risks. Real-world fraud graphs pose two coupled challenges:… 12 arXiv — Machine Learning research 1d ago Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge arXiv:2606.27527v1 Announce Type: cross Abstract: Large Language Models (LLMs) possess broad conceptual knowledge acquired through large-scale text pretraining, yet their potential to supervise models in other modalities remains underexplored. In this work, we propose… 10 arXiv — Machine Learning research 1d ago Learning from Annotation Uncertainty: Entropy-Aware Curriculum for Speech Emotion Recognition arXiv:2606.27536v1 Announce Type: cross Abstract: Speech emotion recognition (SER) often relies on hard consensus labels that collapse annotator disagreement. We study distribution-based supervision for 9-class SER on MSP-Podcast 2.0 using a WavLM-Base multitask model for… 23 arXiv — NLP / Computation & Language research 1d ago Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection arXiv:2606.28002v1 Announce Type: new Abstract: Insurance fraud imposes substantial financial losses and operational inefficiencies, raising premiums and impacting trust among legitimate policyholders. Early detection at FNOL remains a persistent challenge. Existing approaches… 25 arXiv — NLP / Computation & Language research 1d ago Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models arXiv:2606.28273v1 Announce Type: new Abstract: Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally… 31 arXiv — NLP / Computation & Language research 1d ago DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write… 11 arXiv — NLP / Computation & Language research 1d ago Aloe-Vision: Robust Vision-Language Models for Healthcare arXiv:2606.27500v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) specialized in healthcare are emerging as a promising research direction due to their potential impact in clinical and biomedical applications. However, progress is constrained by the scarcity… 28 arXiv — NLP / Computation & Language research 1d ago Joint Transcription and Decryption of Images of Encrypted Handwritten Documents: A Comparison with the Traditional Pipeline arXiv:2606.27700v1 Announce Type: cross Abstract: Historical encrypted manuscripts present a challenging problem at the intersection of cryptology, linguistics, paleography, and computer vision. Current automatic decipherment approaches usually rely on a two-stage pipeline:… 7 arXiv — NLP / Computation & Language research 1d ago EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from… 34 arXiv — NLP / Computation & Language research 1d ago Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using… 4 arXiv — NLP / Computation & Language research 1d ago SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning arXiv:2606.22873v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering,… 31 TechCrunch — AI news-outlet 2d ago SoftBank’s CEO isn’t the only one with questions about Elon Musk’s orbital data center hype Not everyone is buying Elon Musk’s vision for orbital data centers. 19 TechCrunch — AI news-outlet 2d ago Apple Vision Pro exec is reportedly leaving for OpenAI Paul Meade, the Apple vice president in charge of the Vision Pro headset, is reportedly leaving the company to join OpenAI’s hardware team. 22 r/LocalLLaMA community 2d ago Agentic Cyberdeck Dev I developed this around August '25, but never had real polished panels. So, here we are with some decent panels, and new speakers for voice Al inferencing. This has local agentic GPS, chat, voice, vision analysis. This is a fun little project that I come back around to until I… 12 r/LocalLLaMA community 2d ago New deepseek vision model incoming? Hello guys, it seems like DeepSeek added a new vision mode to their application. Does this mean, that they will release a new vision model? Edit: Guys.it is not an OCR model. I have just asked it to describe multiple images, which had no text in them.   submitted by  … 19 Hugging Face Daily Papers research 3d ago ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation Abstract ABACUS is a unified vision-language model that performs object counting and related tasks through innovative spatial grounding, boundary-aware counting policies, and self-critical learning strategies. Generated by Qwen/Qwen2.5-Coder-32B-Instruct ABACUS is a unified… 16 r/LocalLLaMA community 3d ago Can Qwen3.6-35B-A3B on an RTX 3060 Replace Google Vision for Receipt-to-JSON Extraction? I tried replacing Google Vision in my receipt pipeline with a local Qwen model. I had an old LINE message bot where I could send a receipt photo, it would go to Google Vision, get parsed into JSON, and saved in SQLite. Recently I tried again, but locally. Setup: RTX 3060 12GB… 8 r/LocalLLaMA community 3d ago Gemma 4 12b needs glasses Having a lot of fun using Gemma 4 as an assistant, but is growing frustrated with the poor default image resolution setting for image vision. Tasks like identifying smaller text in an image that Qwen 3.6 flies through, Gemma 4 are never able to decipher. Even larger overall… 31 arXiv — Machine Learning research 4d ago \chisao{}: A GPU-Native Parallel Optimizer for Multimodal Black-Box Functions via Convergence-Anticonvergence Oscillation arXiv:2606.26164v1 Announce Type: new Abstract: Finding all modes of a multimodal black-box function is a fundamental challenge in optimization, Bayesian inference, and scientific computing. Existing approaches -- basin-hopping, CMA-ES, multistart gradient descent -- operate… 26 arXiv — Machine Learning research 4d ago When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence arXiv:2606.26473v1 Announce Type: new Abstract: Many multimodal systems estimate the reliability of each modality and weight their contributions to the final prediction. However, it remains unclear whether these scores influence model decisions or merely correlate with… 20 arXiv — NLP / Computation & Language research 4d ago Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA arXiv:2606.27023v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) applied to Medical Visual Question Answering (VQA) tend to produce overconfident outputs regardless of actual correctness, and existing verbalized confidence calibration methods, developed… 15 arXiv — Machine Learning research 4d ago Automating Potential-based Reward Shaping with Vision Language Model Guidance arXiv:2606.27180v1 Announce Type: new Abstract: Sparse rewards are inherently challenging for reinforcement learning agents as they lack intermediate feedback to guide exploration and to correctly attribute the sparse success rewards to relevant parts of the trajectory. Naive… 36 arXiv — Machine Learning research 4d ago Beyond the Hard Budget: Sparsity Regularizers for More Interpretable Top-k Sparse Autoencoders arXiv:2606.27321v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have become a leading tool for interpreting the representations of vision foundation models, decomposing their polysemantic activations into a larger set of sparse, more monosemantic features. The Top-$k$… 22 arXiv — Machine Learning research 4d ago Dot-Flik: A Scalable Edge AI Architecture for Distributed Insect Monitoring arXiv:2606.26121v1 Announce Type: cross Abstract: Global insect population declines necessitate scalable, continuous monitoring systems, yet existing vision-based solutions remain constrained by high hardware costs, energy demands, and reliance on centralized processing or cloud… 11 arXiv — NLP / Computation & Language research 4d ago Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a… 37 arXiv — NLP / Computation & Language research 4d ago From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's… 12 arXiv — NLP / Computation & Language research 4d ago The Inattentional Gap: Task-Conditioned Language and Vision Models Omit the Safety-Critical Signals They Can Otherwise Report arXiv:2606.26529v1 Announce Type: new Abstract: AI safety is evaluated by how reliably a model detects the hazards it is told to find, yet accidents often arise from the hazard no one specified. We show that conditioning a language or vision model on a narrow task suppresses its… 14 arXiv — NLP / Computation & Language research 4d ago SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context arXiv:2606.26654v1 Announce Type: new Abstract: Personalized language-model assistants are often evaluated through a memory lens: can a model recall preferences users have explicitly stated in dialogue? More comprehensive personalization demands a harder capability -- inferring… 13 Page 1 of 10 · 500 articles Older →