Tag

Multimodal

500 articles archived under #multimodal · RSS

arXiv — NLP / Computation & Language research 4d ago

GAVEL: Grounded Caption Error Verification and Localization

arXiv:2606.26923v1 Announce Type: new Abstract: Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy…

24
arXiv — NLP / Computation & Language research 4d ago

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

arXiv:2606.27330v1 Announce Type: new Abstract: Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and…

8
arXiv — NLP / Computation & Language research 4d ago

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to…

19
arXiv — NLP / Computation & Language research 4d ago

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against…

18
arXiv — NLP / Computation & Language research 4d ago

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

arXiv:2606.27187v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in…

25
arXiv — NLP / Computation & Language research 4d ago

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

arXiv:2506.15681v4 Announce Type: replace Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios,…

16
Hugging Face Daily Papers research 4d ago

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

Abstract ViQ presents a visual quantization framework that balances semantic richness and detail preservation in discrete representations, enabling efficient multimodal training with native-resolution inputs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A unified representation…

26
Hugging Face Daily Papers research 4d ago

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Abstract On-policy skill distillation framework extracts dense hindsight supervision from completed trajectories to improve language agent training efficiency and performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Outcome-based reinforcement learning provides a stable…

20
Hugging Face Daily Papers research 4d ago

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are…

23
Hugging Face Daily Papers research 4d ago

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

Abstract A novel-view video synthesis method that enhances motion-aware diffusion models through multi-view point tracking supervision to improve geometric consistency and motion fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Synthesizing a novel-view video from a…

37
Hugging Face Daily Papers research 4d ago

ShutterMuse: Capture-Time Photography Guidance with MLLMs

Abstract Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-world photography…

12
arXiv — Machine Learning research 5d ago

Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models

arXiv:2606.24898v1 Announce Type: new Abstract: Looped language models turn hidden states into runtime state: each state is decoded for prediction and fed back into future computation. This creates a basic supervision question: which state variables does cross-entropy actually…

37
arXiv — Machine Learning research 5d ago

When Multi-Sensor Fusion Fails to Generalize: Cattle Posture Classification Under Animal-Level and Temporal Distribution Shift

arXiv:2606.24986v1 Announce Type: new Abstract: Automated cattle posture-classification systems frequently report near-perfect accuracy, yet their robustness under realistic deployment conditions remains largely unknown. In particular, it is unclear whether multimodal sensor…

25
arXiv — Machine Learning research 5d ago

Geo-Strat-RL: Learning Geological Event Reasoning from Verifiable Tasks

arXiv:2606.25000v1 Announce Type: new Abstract: To evaluate whether vision-language models can reason about geological histories, it is necessary to construct observations for which the underlying process history is known. Furthermore, reasoning over geological histories is not…

6
arXiv — Machine Learning research 5d ago

An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture, leaf area index, and plant height from Sentinel-1 and Sentinel-2 time series

arXiv:2606.25174v1 Announce Type: new Abstract: Field-scale retrieval of surface soil moisture (SM), leaf area index (LAI), and plant height (PH) is essential for precision agriculture, yet it remains an ill-posed inverse problem. Concurrent variations in soil moisture and…

24
arXiv — NLP / Computation & Language research 5d ago

Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making

arXiv:2606.25421v1 Announce Type: new Abstract: Recent studies on world modeling for Large Language Model (LLM) agents typically formulate the learning objective as next-observation prediction. However, this objective ties supervision to what a transition happens to reveal,…

32
arXiv — NLP / Computation & Language research 5d ago

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

arXiv:2606.25442v1 Announce Type: new Abstract: Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often…

29
arXiv — NLP / Computation & Language research 5d ago

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations…

29
arXiv — NLP / Computation & Language research 5d ago

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

arXiv:2606.26079v1 Announce Type: new Abstract: Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI…

31
arXiv — NLP / Computation & Language research 5d ago

Multilingual Hematology Visual Question Answering Dataset

arXiv:2606.25246v1 Announce Type: cross Abstract: Vision Language Models (VLMs) have shown promising capabilities in medical image analysis by jointly understanding visual and textual information for tasks such as Visual Question Answering. However, existing hematology…

5
arXiv — NLP / Computation & Language research 5d ago

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet…

14
arXiv — NLP / Computation & Language research 5d ago

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

arXiv:2606.26041v1 Announce Type: cross Abstract: Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently…

29
Hugging Face Daily Papers research 5d ago

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Abstract Multimodal Chain-of-Thought reasoning shows selective effectiveness across different tasks, with limitations in maintaining visual introspection during reasoning processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) has become a standard method…

17
Hugging Face Daily Papers research 5d ago

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

Abstract This survey explores multimodal code intelligence systems that generate and reason with code based on visual inputs, categorizing approaches across GUI, scientific visualization, structured graphics, and emerging frameworks while identifying verification-centered…

25
Hugging Face Daily Papers research 5d ago

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Abstract Implicit Visual Chain-of-Thought decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation with sketch supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Unified multi-modal large language models (MLLMs)…

7
Hugging Face Daily Papers research 5d ago

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Abstract Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

20
Hugging Face Daily Papers research 5d ago

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Abstract InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-language-action (VLA) models can learn manipulation…

19
r/MachineLearning community 5d ago

MuJoCo derived Simulator for High Fidelity Vision RL training natively on GPU [D]

Hi everyone, For the past couple of weeks I have been working on a simulator project considering the shortcomings of MuJoCo. There are things that people like and also don't like about MuJoCo, like the CPU dependency on MuJoCo which makes the simulation not parallelizable beyond…

31
Hugging Face Daily Papers research 5d ago

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Abstract EventVLA addresses long-horizon robotic manipulation challenges by introducing a sparse visual evidence memory framework with visual anchors and dynamic Keyframe Evidence Memory module for improved task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory…

23
Hugging Face Daily Papers research 5d ago

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation…

34
arXiv — Machine Learning research 6d ago

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

arXiv:2606.23964v1 Announce Type: new Abstract: Self-supervised learning in fluorescence microscopy often relies on 2D projections, despite the inherently three-dimensional nature of cells. We present a systematic comparison of 2D and 3D masked autoencoders (MAE-2D vs. MAE-3D)…

34
arXiv — Machine Learning research 6d ago

Verifiable Foundation Models for Robot Safety

arXiv:2606.23754v1 Announce Type: cross Abstract: Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable…

4
arXiv — Machine Learning research 6d ago

Prediction of Viscoelastic Droplet Impact Dynamics Using a Vision Transformer-Based Approach

arXiv:2606.23940v1 Announce Type: cross Abstract: Droplet impact on solid surfaces is a complex fluid dynamics problem with applications in spray cooling, inkjet printing, and pharmaceutical processing. Although numerical simulations are widely used to investigate these…

24
arXiv — Machine Learning research 6d ago

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

arXiv:2606.24388v1 Announce Type: cross Abstract: We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by…

38
arXiv — NLP / Computation & Language research 6d ago

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language,…

38
arXiv — NLP / Computation & Language research 6d ago

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

arXiv:2606.24286v1 Announce Type: new Abstract: Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To…

15
arXiv — NLP / Computation & Language research 6d ago

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

arXiv:2606.23885v1 Announce Type: cross Abstract: Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing…

17
arXiv — NLP / Computation & Language research 6d ago

PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation

arXiv:2606.24346v1 Announce Type: cross Abstract: Petroleum-engineering search exposes a supervision gap for strong general retrievers: relevant evidence exists in public web text, but domain relevance labels are scarce. To address this gap, we propose PETRA, a large-scale…

28
Hugging Face Daily Papers research 6d ago

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

Abstract FlowR2A addresses the tension in multimodal driving planning by combining dense reward supervision with dynamic proposal generation through a flow-matching decoder that learns reward-conditioned action distributions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
Hugging Face Daily Papers research 6d ago

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

Abstract A comprehensive multimodal misinformation detection framework is introduced that handles complex, multilingual content with multiple images and diverse verification approaches, achieving superior performance while reducing computational costs. Generated by…

29
Hugging Face Daily Papers research 6d ago

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Abstract A novel framework called VeriEvol is introduced that addresses the challenge of scaling reinforcement learning for visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach that separates prompt difficulty from answer reliability,…

17
Hugging Face Daily Papers research 6d ago

Libretto: Giving LLM Agents a Sense of Musical Structure

Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from…

18
Hugging Face Daily Papers research 6d ago

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Abstract BioMatrix is a novel multimodal foundation model that integrates molecular sequences, structures, and natural language into a unified decoder-only architecture for diverse biological tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present BioMatrix, the first…

37
r/MachineLearning community 7d ago

Just landed a Computer Vision internship, here's the preparation list I used [D]

Hey everyone, I recently landed a Computer Vision internship after prepping with this checklist I put together. It starts with core math and ML fundamentals, then moves into the specialized CV topics that actually come up in interviews. I compressed it into just 7 days due to…

25
Hugging Face Daily Papers research 7d ago

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured…

19
Hugging Face Daily Papers research 7d ago

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

Abstract UniverSat introduces a Universal Patch Encoder for Vision Transformers that enables robust, sensor-agnostic spatial feature extraction across diverse Earth Observation data types. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision Transformers (ViT) dominate computer…

6
Hugging Face Daily Papers research 7d ago

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

Abstract PolicyTrim is a reinforcement learning-based framework that enhances VLA model efficiency by extending reliable action chunk lengths and reducing redundant physical steps through dynamic exploration and redundancy-aware rewards. Generated by…

25
Hugging Face Daily Papers research 7d ago

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

Abstract Multimodal large language models exhibit social bias driven by specific visual attributes, with fashion style and socioeconomic cues having the greatest impact on model judgments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) are…

37
Hugging Face Daily Papers research 7d ago

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Abstract GeneralVLA-2 addresses limitations in vision-language-action systems by introducing GeoFuse-MV3D for improved 3D reconstruction and an enhanced KnowledgeBank for better memory management in robotic manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

32
r/LocalLLaMA community 8d ago

Best local model for vision - 2nd benchmark update - 21 Jun 2026

I previously posted the first results of my VLM benchmark . There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it…

9

GAVEL: Grounded Caption Error Verification and Localization

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

ShutterMuse: Capture-Time Photography Guidance with MLLMs

Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models

When Multi-Sensor Fusion Fails to Generalize: Cattle Posture Classification Under Animal-Level and Temporal Distribution Shift

Geo-Strat-RL: Learning Geological Event Reasoning from Verifiable Tasks

An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture, leaf area index, and plant height from Sentinel-1 and Sentinel-2 time series

Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making

PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Multilingual Hematology Visual Question Answering Dataset

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

InSight: Self-Guided Skill Acquisition via Steerable VLAs

MuJoCo derived Simulator for High Fidelity Vision RL training natively on GPU [D]

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

Verifiable Foundation Models for Robot Safety

Prediction of Viscoelastic Droplet Impact Dynamics Using a Vision Transformer-Based Approach

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Libretto: Giving LLM Agents a Sense of Musical Structure

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Just landed a Computer Vision internship, here's the preparation list I used [D]

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Best local model for vision - 2nd benchmark update - 21 Jun 2026