News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — Machine Learning research 19d ago Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data arXiv:2606.11794v1 Announce Type: new Abstract: Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an… 17 arXiv — NLP / Computation & Language research 19d ago ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward arXiv:2606.11209v1 Announce Type: new Abstract: Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning,… 35 arXiv — NLP / Computation & Language research 19d ago T2MM: An LLM Supported Architecture For Inquiry-Based Modeling arXiv:2606.11210v1 Announce Type: new Abstract: Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education… 9 arXiv — NLP / Computation & Language research 19d ago Context-Aware Multimodal Claim Verification in Spoken Dialogues arXiv:2606.11420v1 Announce Type: new Abstract: Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed,… 7 arXiv — NLP / Computation & Language research 19d ago When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models arXiv:2606.11906v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic… 17 arXiv — NLP / Computation & Language research 19d ago An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination arXiv:2606.11910v1 Announce Type: new Abstract: Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented… 28 arXiv — NLP / Computation & Language research 19d ago Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos arXiv:2606.11953v1 Announce Type: new Abstract: Hateful videos have become prevalent on online platforms, highlighting an urgent need for effective detection. However, existing studies primarily focus on binary classification and fail to provide contextual rationales that reveal… 8 r/LocalLLaMA community 19d ago nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face Model Overview Description: DiffusionGemma 26B A4B IT is an open-weights multimodal generative model developed by Google DeepMind that processes text, image, and video inputs to produce text output via discrete diffusion. Built on the Gemma 4 26B A4B Mixture-of-Experts (MoE)… 12 Hugging Face Daily Papers research 19d ago Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models Abstract Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach. Generated by… 35 Hugging Face Daily Papers research 19d ago InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning Abstract InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 18 Hugging Face Daily Papers research 19d ago World Model Self-Distillation: Training World Models to Solve General Tasks Abstract A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Pretrained video… 15 Hugging Face Daily Papers research 19d ago World Pilot: Steering Vision-Language-Action Models with World-Action Priors Abstract World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 10 Hugging Face Daily Papers research 20d ago Kwai Keye-VL-2.0 Technical Report Abstract Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure. Generated by… 36 r/LocalLLaMA community 20d ago Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt? I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single… 31 Hugging Face Daily Papers research 20d ago Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation Abstract Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 9 Hugging Face Daily Papers research 20d ago Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories Abstract A multi-agent framework automates data journalism by generating evidence-grounded, multimodal news stories while maintaining transparency and verifiability. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Data tells stories that shape society; the data journalist's job is… 10 arXiv — Machine Learning research 20d ago SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning arXiv:2606.09853v1 Announce Type: new Abstract: A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches… 38 arXiv — Machine Learning research 20d ago SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs arXiv:2606.09868v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However,… 28 arXiv — Machine Learning research 20d ago LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts arXiv:2606.09907v1 Announce Type: new Abstract: Multimodal clinical learning is increasingly important for integrating diverse patient data, including imaging, text, and personalised health records. However, it faces two fundamental challenges: i) modality missingness, where… 28 arXiv — Machine Learning research 20d ago MMClima: A Framework for Multimodal Climate Science Data and Evaluation arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We… 20 arXiv — Machine Learning research 20d ago Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity arXiv:2606.10198v1 Announce Type: new Abstract: Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors… 20 arXiv — NLP / Computation & Language research 20d ago Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark arXiv:2606.10400v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than… 23 arXiv — NLP / Computation & Language research 20d ago Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use arXiv:2606.10803v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability… 38 arXiv — NLP / Computation & Language research 20d ago Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models arXiv:2606.11074v1 Announce Type: new Abstract: With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit… 37 arXiv — NLP / Computation & Language research 20d ago CANVAS: Captioning Art with Narrative Visual-Audio AI Systems arXiv:2606.09846v1 Announce Type: cross Abstract: Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated… 6 arXiv — NLP / Computation & Language research 20d ago From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs arXiv:2606.10147v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the… 30 Hugging Face Daily Papers research 20d ago One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA Abstract Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks. Generated by… 28 Hugging Face Daily Papers research 20d ago Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking Abstract Struct-Searcher introduces a belief revision theory-based structural agentic workflow for multimodal information seeking that improves accuracy over existing vision-language models and deep research agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep research… 17 Hugging Face Daily Papers research 20d ago ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations Abstract ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This paper introduces ARM, a discrete… 35 Hugging Face Daily Papers research 20d ago VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation Abstract VoLoAgent enables physical orchestration by integrating vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Open-vocabulary long-horizon manipulation requires robots to reason… 24 Hugging Face Daily Papers research 20d ago Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher Abstract Trust functions enable effective weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Weak-to-strong… 15 Hugging Face Daily Papers research 20d ago Phase Marginalization for Patch-Grid Instability in Vision Transformers Abstract Phase Marginalization is a post-hoc method that addresses phase-dependent instability in Vision Transformers by evaluating structured patch-grid phases and aggregating outputs in the original image coordinate system. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision… 32 NVIDIA Developer Blog official-blog 20d ago Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability As AI infrastructure scales, enterprise expectations for operational maturity are increasing. Organizations expect these systems to be provisionable,... 38 Hugging Face Daily Papers research 20d ago AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents Abstract AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training… 32 Hugging Face Daily Papers research 20d ago Light-WAM: Efficient World Action Models with State-Fusion Action Decoding Abstract Light-WAM is a lightweight world action model for robot manipulation that uses a compact video backbone and downsampled latent space for efficient future-video supervision, combined with a StateFusionActionExpert for direct action prediction. Generated by… 25 Google DeepMind official-blog 20d ago Introducing Gemma 4 12B: a unified, encoder-free multimodal model Introducing Gemma 4 12B: a unified, encoder-free multimodal model Jun 03, 2026 · Share x.com Facebook LinkedIn Mail Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.… 17 Hugging Face Daily Papers research 21d ago Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text Abstract Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) improves the performance of… 27 Hugging Face Daily Papers research 21d ago EMMA: Extracting Multiple physical parameters from Multimodal Data Abstract EMMA is a physics-informed multimodal framework that directly recovers dynamical parameters from raw video, audio, and image data using a Liquid Time-Constant network and physics-constrained loss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce EMMA, a… 33 Hugging Face Daily Papers research 21d ago Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather… 24 Hugging Face Daily Papers research 21d ago OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics Abstract OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 18 Hugging Face Daily Papers research 21d ago Trajectory-Refined Distillation Abstract On-policy distillation suffers from prefix failure where dense token-level supervision creates fragmented gradients; trajectory-refined distillation addresses this by correcting student rollouts at the trajectory level before distillation. Generated by… 37 arXiv — Machine Learning research 21d ago DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression arXiv:2606.07599v1 Announce Type: new Abstract: Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to… 23 arXiv — Machine Learning research 21d ago KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection arXiv:2606.07651v1 Announce Type: new Abstract: Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on… 12 arXiv — Machine Learning research 21d ago Constraint-Aware Optimization for Robust Protein Stability Prediction arXiv:2606.08100v1 Announce Type: new Abstract: Multimodal $\Delta\Delta G$ predictors integrating protein language models with inverse-folding representations achieve strong in-distribution accuracy on the Megascale dataset but exhibit limited robustness on out-of-distribution… 33 Hugging Face Daily Papers research 21d ago SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a… 7 Vercel — AI dev-tools 21d ago Budgets for API keys on AI Gateway AI costs are getting harder to forecast. As teams lean more on coding agents and other token-heavy workflows, a key can burn cost faster than anyone notices: Autonomous workflows that can loop or fan out without supervision Demos and prototypes that could catch unexpected… 29 Hugging Face Daily Papers research 21d ago Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models Abstract Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.… 22 Hugging Face Daily Papers research 21d ago A Cookbook of 3D Vision: Data, Learning Paradigms, and Application Abstract 3D vision research is organized through a taxonomy connecting geometric representations, datasets, learning frameworks, and applications across reconstruction, generation, and video modeling tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct 3D vision has rapidly… 32 Hugging Face Daily Papers research 21d ago SPACENUM: Revisiting Spatial Numerical Understanding in VLMs Abstract Vision-language models struggle to genuinely understand spatial numerical concepts, relying instead on shallow visual cues rather than developing robust coordinate-aware representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language Models (VLMs) are… 19 Hugging Face Daily Papers research 22d ago Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors Abstract An online 3D vision-language model enables real-time spatial understanding from streaming video using autoregressive control modeling and efficient visual token compression. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite advances in 3D scene understanding,… 30 Page 6 of 10 · 500 articles ← Newer Older →