Tag

Multimodal

500 articles archived under #multimodal · RSS

arXiv — Machine Learning research 19d ago

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

arXiv:2606.11794v1 Announce Type: new Abstract: Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an…

17
arXiv — NLP / Computation & Language research 19d ago

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

arXiv:2606.11209v1 Announce Type: new Abstract: Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning,…

35
arXiv — NLP / Computation & Language research 19d ago

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

arXiv:2606.11210v1 Announce Type: new Abstract: Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education…

9
arXiv — NLP / Computation & Language research 19d ago

Context-Aware Multimodal Claim Verification in Spoken Dialogues

arXiv:2606.11420v1 Announce Type: new Abstract: Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed,…

7
arXiv — NLP / Computation & Language research 19d ago

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

arXiv:2606.11906v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic…

17
arXiv — NLP / Computation & Language research 19d ago

An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination

arXiv:2606.11910v1 Announce Type: new Abstract: Traffic law liability determination is critical for assigning legal penalties, requiring the simultaneous identification of interdependent statutory provisions across multiple legal dimensions. However, existing retrieval-augmented…

28
arXiv — NLP / Computation & Language research 19d ago

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

arXiv:2606.11953v1 Announce Type: new Abstract: Hateful videos have become prevalent on online platforms, highlighting an urgent need for effective detection. However, existing studies primarily focus on binary classification and fail to provide contextual rationales that reveal…

8
r/LocalLLaMA community 19d ago

nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face

Model Overview Description: DiffusionGemma 26B A4B IT is an open-weights multimodal generative model developed by Google DeepMind that processes text, image, and video inputs to produce text output via discrete diffusion. Built on the Gemma 4 26B A4B Mixture-of-Experts (MoE)…

12
Hugging Face Daily Papers research 19d ago

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Abstract Embodied-R1.5 is a unified embodied foundation model that integrates embodied reasoning capabilities and achieves state-of-the-art performance on embodied vision-language benchmarks through a multi-task balanced reinforcement learning approach. Generated by…

35
Hugging Face Daily Papers research 19d ago

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Abstract InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
Hugging Face Daily Papers research 19d ago

World Model Self-Distillation: Training World Models to Solve General Tasks

Abstract A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Pretrained video…

15
Hugging Face Daily Papers research 19d ago

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Abstract World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

10
Hugging Face Daily Papers research 20d ago

Kwai Keye-VL-2.0 Technical Report

Abstract Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure. Generated by…

36
r/LocalLLaMA community 20d ago

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the separate ASR + LLM steps into a single…

31
Hugging Face Daily Papers research 20d ago

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Abstract Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

9
Hugging Face Daily Papers research 20d ago

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

Abstract A multi-agent framework automates data journalism by generating evidence-grounded, multimodal news stories while maintaining transparency and verifiability. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Data tells stories that shape society; the data journalist's job is…

10
arXiv — Machine Learning research 20d ago

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

arXiv:2606.09853v1 Announce Type: new Abstract: A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches…

38
arXiv — Machine Learning research 20d ago

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

arXiv:2606.09868v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However,…

28
arXiv — Machine Learning research 20d ago

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

arXiv:2606.09907v1 Announce Type: new Abstract: Multimodal clinical learning is increasingly important for integrating diverse patient data, including imaging, text, and personalised health records. However, it faces two fundamental challenges: i) modality missingness, where…

28
arXiv — Machine Learning research 20d ago

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

arXiv:2606.10194v1 Announce Type: new Abstract: Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We…

20
arXiv — Machine Learning research 20d ago

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

arXiv:2606.10198v1 Announce Type: new Abstract: Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors…

20
arXiv — NLP / Computation & Language research 20d ago

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

arXiv:2606.10400v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than…

23
arXiv — NLP / Computation & Language research 20d ago

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv:2606.10803v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability…

38
arXiv — NLP / Computation & Language research 20d ago

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

arXiv:2606.11074v1 Announce Type: new Abstract: With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit…

37
arXiv — NLP / Computation & Language research 20d ago

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

arXiv:2606.09846v1 Announce Type: cross Abstract: Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated…

6
arXiv — NLP / Computation & Language research 20d ago

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

arXiv:2606.10147v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the…

30
Hugging Face Daily Papers research 20d ago

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Abstract Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks. Generated by…

28
Hugging Face Daily Papers research 20d ago

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Abstract Struct-Searcher introduces a belief revision theory-based structural agentic workflow for multimodal information seeking that improves accuracy over existing vision-language models and deep research agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep research…

17
Hugging Face Daily Papers research 20d ago

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Abstract ARM demonstrates a unified autoregressive framework for image understanding, generation, and editing through discrete semantic tokenization and reinforcement learning optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This paper introduces ARM, a discrete…

35
Hugging Face Daily Papers research 20d ago

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Abstract VoLoAgent enables physical orchestration by integrating vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Open-vocabulary long-horizon manipulation requires robots to reason…

24
Hugging Face Daily Papers research 20d ago

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

Abstract Trust functions enable effective weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Weak-to-strong…

15
Hugging Face Daily Papers research 20d ago

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Abstract Phase Marginalization is a post-hoc method that addresses phase-dependent instability in Vision Transformers by evaluating structured patch-grid phases and aggregating outputs in the original image coordinate system. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision…

32
NVIDIA Developer Blog official-blog 20d ago

Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability

As AI infrastructure scales, enterprise expectations for operational maturity are increasing. Organizations expect these systems to be provisionable,...

38
Hugging Face Daily Papers research 20d ago

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Abstract AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training…

32
Hugging Face Daily Papers research 20d ago

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Abstract Light-WAM is a lightweight world action model for robot manipulation that uses a compact video backbone and downsampled latent space for efficient future-video supervision, combined with a StateFusionActionExpert for direct action prediction. Generated by…

25
Google DeepMind official-blog 20d ago

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Introducing Gemma 4 12B: a unified, encoder-free multimodal model Jun 03, 2026 · Share x.com Facebook LinkedIn Mail Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.…

17
Hugging Face Daily Papers research 21d ago

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Abstract Optical reasoning uses images as a standalone reasoning medium for language and multimodal tasks, achieving higher token efficiency than traditional text-based approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) improves the performance of…

27
Hugging Face Daily Papers research 21d ago

EMMA: Extracting Multiple physical parameters from Multimodal Data

Abstract EMMA is a physics-informed multimodal framework that directly recovers dynamical parameters from raw video, audio, and image data using a Liquid Time-Constant network and physics-constrained loss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce EMMA, a…

33
Hugging Face Daily Papers research 21d ago

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Abstract Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather…

24
Hugging Face Daily Papers research 21d ago

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Abstract OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

18
Hugging Face Daily Papers research 21d ago

Trajectory-Refined Distillation

Abstract On-policy distillation suffers from prefix failure where dense token-level supervision creates fragmented gradients; trajectory-refined distillation addresses this by correcting student rollouts at the trajectory level before distillation. Generated by…

37
arXiv — Machine Learning research 21d ago

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

arXiv:2606.07599v1 Announce Type: new Abstract: Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to…

23
arXiv — Machine Learning research 21d ago

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

arXiv:2606.07651v1 Announce Type: new Abstract: Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on…

12
arXiv — Machine Learning research 21d ago

Constraint-Aware Optimization for Robust Protein Stability Prediction

arXiv:2606.08100v1 Announce Type: new Abstract: Multimodal $\Delta\Delta G$ predictors integrating protein language models with inverse-folding representations achieve strong in-distribution accuracy on the Megascale dataset but exhibit limited robustness on out-of-distribution…

33
Hugging Face Daily Papers research 21d ago

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Abstract SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial reasoning is a…

7
Vercel — AI dev-tools 21d ago

Budgets for API keys on AI Gateway

AI costs are getting harder to forecast. As teams lean more on coding agents and other token-heavy workflows, a key can burn cost faster than anyone notices: Autonomous workflows that can loop or fan out without supervision Demos and prototypes that could catch unexpected…

29
Hugging Face Daily Papers research 21d ago

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Abstract Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.…

22
Hugging Face Daily Papers research 21d ago

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

Abstract 3D vision research is organized through a taxonomy connecting geometric representations, datasets, learning frameworks, and applications across reconstruction, generation, and video modeling tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct 3D vision has rapidly…

32
Hugging Face Daily Papers research 21d ago

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Abstract Vision-language models struggle to genuinely understand spatial numerical concepts, relying instead on shallow visual cues rather than developing robust coordinate-aware representations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language Models (VLMs) are…

19
Hugging Face Daily Papers research 22d ago

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Abstract An online 3D vision-language model enables real-time spatial understanding from streaming video using autoregressive control modeling and efficient visual token compression. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite advances in 3D scene understanding,…

30

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

Context-Aware Multimodal Claim Verification in Spoken Dialogues

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

An Ontology-Guided Multi-Anchor Graph Retrieval Framework for Traffic Legal Liability Determination

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

World Model Self-Distillation: Training World Models to Solve General Tasks

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Kwai Keye-VL-2.0 Technical Report

Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt?

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

Phase Marginalization for Patch-Grid Instability in Vision Transformers

Delivering Lifecycle Control for AI Infrastructure at Scale with NVIDIA DGX Spark Enterprise Manageability

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

EMMA: Extracting Multiple physical parameters from Multimodal Data

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Trajectory-Refined Distillation

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

Constraint-Aware Optimization for Robust Protein Stability Prediction

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Budgets for API keys on AI Gateway

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors