News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — NLP / Computation & Language research 4d ago GAVEL: Grounded Caption Error Verification and Localization arXiv:2606.26923v1 Announce Type: new Abstract: Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy… 24 arXiv — NLP / Computation & Language research 4d ago Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning arXiv:2606.27330v1 Announce Type: new Abstract: Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and… 8 arXiv — NLP / Computation & Language research 4d ago Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs arXiv:2606.26387v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to… 19 arXiv — NLP / Computation & Language research 4d ago Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models arXiv:2606.26566v1 Announce Type: cross Abstract: Adversarial evaluation of AI systems has matured along four largely disconnected tracks: diffusion-based attacks on text and large language models (LLMs), diffusion-based attacks on image classifiers, jailbreak pipelines against… 18 arXiv — NLP / Computation & Language research 4d ago HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models arXiv:2606.27187v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in… 25 arXiv — NLP / Computation & Language research 4d ago GenRecal: Generation after Recalibration from Large to Small Vision-Language Models arXiv:2506.15681v4 Announce Type: replace Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios,… 16 Hugging Face Daily Papers research 4d ago ViQ: Text-Aligned Visual Quantized Representations at Any Resolution Abstract ViQ presents a visual quantization framework that balances semantic richness and detail preservation in discrete representations, enabling efficient multimodal training with native-resolution inputs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A unified representation… 26 Hugging Face Daily Papers research 4d ago OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning Abstract On-policy skill distillation framework extracts dense hindsight supervision from completed trajectories to improve language agent training efficiency and performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Outcome-based reinforcement learning provides a stable… 20 Hugging Face Daily Papers research 4d ago Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation Abstract A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Video generation models are… 23 Hugging Face Daily Papers research 4d ago MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation Abstract A novel-view video synthesis method that enhances motion-aware diffusion models through multi-view point tracking supervision to improve geometric consistency and motion fidelity. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Synthesizing a novel-view video from a… 37 Hugging Face Daily Papers research 4d ago ShutterMuse: Capture-Time Photography Guidance with MLLMs Abstract Researchers developed a new benchmark and dataset for photography assistance, along with a unified multimodal model that provides both composition guidance and pose recommendations during image capture. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-world photography… 12 arXiv — Machine Learning research 5d ago Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models arXiv:2606.24898v1 Announce Type: new Abstract: Looped language models turn hidden states into runtime state: each state is decoded for prediction and fed back into future computation. This creates a basic supervision question: which state variables does cross-entropy actually… 37 arXiv — Machine Learning research 5d ago When Multi-Sensor Fusion Fails to Generalize: Cattle Posture Classification Under Animal-Level and Temporal Distribution Shift arXiv:2606.24986v1 Announce Type: new Abstract: Automated cattle posture-classification systems frequently report near-perfect accuracy, yet their robustness under realistic deployment conditions remains largely unknown. In particular, it is unclear whether multimodal sensor… 25 arXiv — Machine Learning research 5d ago Geo-Strat-RL: Learning Geological Event Reasoning from Verifiable Tasks arXiv:2606.25000v1 Announce Type: new Abstract: To evaluate whether vision-language models can reason about geological histories, it is necessary to construct observations for which the underlying process history is known. Furthermore, reasoning over geological histories is not… 6 arXiv — Machine Learning research 5d ago An iterative energy-based multimodal transformer for joint retrieval of wheat soil moisture, leaf area index, and plant height from Sentinel-1 and Sentinel-2 time series arXiv:2606.25174v1 Announce Type: new Abstract: Field-scale retrieval of surface soil moisture (SM), leaf area index (LAI), and plant height (PH) is essential for precision agriculture, yet it remains an ill-posed inverse problem. Concurrent variations in soil moisture and… 24 arXiv — NLP / Computation & Language research 5d ago Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making arXiv:2606.25421v1 Announce Type: new Abstract: Recent studies on world modeling for Large Language Model (LLM) agents typically formulate the learning objective as next-observation prediction. However, this objective ties supervision to what a transition happens to reveal,… 32 arXiv — NLP / Computation & Language research 5d ago PolicyAlign: Direct Policy-Based Safety Alignment for Large Language Models arXiv:2606.25442v1 Announce Type: new Abstract: Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often… 29 arXiv — NLP / Computation & Language research 5d ago SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models arXiv:2606.25990v1 Announce Type: new Abstract: As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations… 29 arXiv — NLP / Computation & Language research 5d ago Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models arXiv:2606.26079v1 Announce Type: new Abstract: Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI… 31 arXiv — NLP / Computation & Language research 5d ago Multilingual Hematology Visual Question Answering Dataset arXiv:2606.25246v1 Announce Type: cross Abstract: Vision Language Models (VLMs) have shown promising capabilities in medical image analysis by jointly understanding visual and textual information for tasks such as Visual Question Answering. However, existing hematology… 5 arXiv — NLP / Computation & Language research 5d ago Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet… 14 arXiv — NLP / Computation & Language research 5d ago How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations arXiv:2606.26041v1 Announce Type: cross Abstract: Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently… 29 Hugging Face Daily Papers research 5d ago Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do Abstract Multimodal Chain-of-Thought reasoning shows selective effectiveness across different tasks, with limitations in maintaining visual introspection during reasoning processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Chain-of-Thought (CoT) has become a standard method… 17 Hugging Face Daily Papers research 5d ago Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence Abstract This survey explores multimodal code intelligence systems that generate and reason with code based on visual inputs, categorizing approaches across GUI, scientific visualization, structured graphics, and emerging frameworks while identifying verification-centered… 25 Hugging Face Daily Papers research 5d ago IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation Abstract Implicit Visual Chain-of-Thought decomposes visual conditioning into structural and semantic cascades for improved structure-aware image generation with sketch supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Unified multi-modal large language models (MLLMs)… 7 Hugging Face Daily Papers research 5d ago Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models Abstract Wan-Streamer is a unified, end-to-end multimodal model that enables real-time audio-visual interaction through causal attention mechanisms and integrated processing of visual, audio, and text modalities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present… 20 Hugging Face Daily Papers research 5d ago InSight: Self-Guided Skill Acquisition via Steerable VLAs Abstract InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-language-action (VLA) models can learn manipulation… 19 r/MachineLearning community 5d ago MuJoCo derived Simulator for High Fidelity Vision RL training natively on GPU [D] Hi everyone, For the past couple of weeks I have been working on a simulator project considering the shortcomings of MuJoCo. There are things that people like and also don't like about MuJoCo, like the CPU dependency on MuJoCo which makes the simulation not parallelizable beyond… 31 Hugging Face Daily Papers research 5d ago EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies Abstract EventVLA addresses long-horizon robotic manipulation challenges by introducing a sparse visual evidence memory framework with visual anchors and dynamic Keyframe Evidence Memory module for improved task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory… 23 Hugging Face Daily Papers research 5d ago FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation Abstract FLUX3D addresses limitations in image-to-3D Gaussian Splatting generation by improving representation learning and cross-modal alignment through specialized architectures and attention mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse voxel representation… 34 arXiv — Machine Learning research 6d ago 3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy arXiv:2606.23964v1 Announce Type: new Abstract: Self-supervised learning in fluorescence microscopy often relies on 2D projections, despite the inherently three-dimensional nature of cells. We present a systematic comparison of 2D and 3D masked autoencoders (MAE-2D vs. MAE-3D)… 34 arXiv — Machine Learning research 6d ago Verifiable Foundation Models for Robot Safety arXiv:2606.23754v1 Announce Type: cross Abstract: Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable… 4 arXiv — Machine Learning research 6d ago Prediction of Viscoelastic Droplet Impact Dynamics Using a Vision Transformer-Based Approach arXiv:2606.23940v1 Announce Type: cross Abstract: Droplet impact on solid surfaces is a complex fluid dynamics problem with applications in spray cooling, inkjet printing, and pharmaceutical processing. Although numerical simulations are widely used to investigate these… 24 arXiv — Machine Learning research 6d ago PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models arXiv:2606.24388v1 Announce Type: cross Abstract: We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by… 38 arXiv — NLP / Computation & Language research 6d ago MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models arXiv:2606.24155v1 Announce Type: new Abstract: Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language,… 38 arXiv — NLP / Computation & Language research 6d ago AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression arXiv:2606.24286v1 Announce Type: new Abstract: Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To… 15 arXiv — NLP / Computation & Language research 6d ago Mind the Heads: Topological Representation Alignment for Multimodal LLMs arXiv:2606.23885v1 Announce Type: cross Abstract: Representation alignment has emerged as an effective approach to improve Multimodal Large Language Models (MLLMs) by regularizing their internal representations toward those of an external vision encoder. However, existing… 17 arXiv — NLP / Computation & Language research 6d ago PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation arXiv:2606.24346v1 Announce Type: cross Abstract: Petroleum-engineering search exposes a supervision gap for strong general retrievers: relevant evidence exists in public web text, but domain relevance labels are scarce. To address this gap, we propose PETRA, a large-scale… 28 Hugging Face Daily Papers research 6d ago FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning Abstract FlowR2A addresses the tension in multimodal driving planning by combining dense reward supervision with dynamic proposal generation through a flow-matching decoder that learns reward-conditioned action distributions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 35 Hugging Face Daily Papers research 6d ago ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection Abstract A comprehensive multimodal misinformation detection framework is introduced that handles complex, multilingual content with multiple images and diverse verification approaches, achieving superior performance while reducing computational costs. Generated by… 29 Hugging Face Daily Papers research 6d ago VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct Abstract A novel framework called VeriEvol is introduced that addresses the challenge of scaling reinforcement learning for visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach that separates prompt difficulty from answer reliability,… 17 Hugging Face Daily Papers research 6d ago Libretto: Giving LLM Agents a Sense of Musical Structure Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from… 18 Hugging Face Daily Papers research 6d ago BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language Abstract BioMatrix is a novel multimodal foundation model that integrates molecular sequences, structures, and natural language into a unified decoder-only architecture for diverse biological tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present BioMatrix, the first… 37 r/MachineLearning community 7d ago Just landed a Computer Vision internship, here's the preparation list I used [D] Hey everyone, I recently landed a Computer Vision internship after prepping with this checklist I put together. It starts with core math and ML fundamentals, then moves into the specialized CV topics that actually come up in interviews. I compressed it into just 7 days due to… 25 Hugging Face Daily Papers research 7d ago DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured… 19 Hugging Face Daily Papers research 7d ago UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation Abstract UniverSat introduces a Universal Patch Encoder for Vision Transformers that enables robust, sensor-agnostic spatial feature extraction across diverse Earth Observation data types. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision Transformers (ViT) dominate computer… 6 Hugging Face Daily Papers research 7d ago PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models Abstract PolicyTrim is a reinforcement learning-based framework that enhances VLA model efficiency by extending reliable action chunk lengths and reducing redundant physical steps through dynamic exploration and redundancy-aware rewards. Generated by… 25 Hugging Face Daily Papers research 7d ago StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs Abstract Multimodal large language models exhibit social bias driven by specific visual attributes, with fashion style and socioeconomic cues having the greatest impact on model judgments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) are… 37 Hugging Face Daily Papers research 7d ago GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning Abstract GeneralVLA-2 addresses limitations in vision-language-action systems by introducing GeoFuse-MV3D for improved 3D reconstruction and an enhanced KnowledgeBank for better memory management in robotic manipulation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 32 r/LocalLLaMA community 8d ago Best local model for vision - 2nd benchmark update - 21 Jun 2026 I previously posted the first results of my VLM benchmark . There were a few useful comments and observations I took into account, to revise and expand my benchmark: I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it… 9 Page 2 of 10 · 500 articles ← Newer Older →