Tag

Multimodal

500 articles archived under #multimodal · RSS

Hugging Face Daily Papers research 26d ago

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Abstract Research reveals significant disparities between text and image generation capabilities in multimodal models, with effective textual knowledge editing not transferring reliably to visual output, necessitating modality-aware editing approaches. Generated by…

9
r/MachineLearning community 26d ago

Repo for implementations of various Transformer Attn mechanisms [P]

Initially, I developed this so I can easily switch between different Attention mechanisms for my Small Language Model (SLM) experiments and benchmarking. However, I also realized that these implementations can be applicable in Computer Vision, modernize Vision Encoders, RL, and…

14
Hugging Face Daily Papers research 26d ago

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Abstract OVO-S-Bench presents a comprehensive benchmark for evaluating streaming spatial intelligence in multimodal language models through human-annotated questions spanning multiple abstraction levels. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal agents in robotics,…

23
arXiv — Machine Learning research 26d ago

Self-Distilled Policy Gradient

arXiv:2606.04036v1 Announce Type: new Abstract: On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be…

33
arXiv — Machine Learning research 26d ago

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not…

8
arXiv — Machine Learning research 26d ago

A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support

arXiv:2606.04209v1 Announce Type: new Abstract: Counterfactual explanations seek small, semantically meaningful changes to an input that alter a model's prediction, and are widely used to interpret and audit machine learning systems. In modern vision, language, and multimodal…

13
arXiv — NLP / Computation & Language research 26d ago

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

arXiv:2606.04231v1 Announce Type: new Abstract: Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often…

24
arXiv — NLP / Computation & Language research 26d ago

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We…

27
arXiv — NLP / Computation & Language research 26d ago

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

arXiv:2606.04596v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the…

34
arXiv — NLP / Computation & Language research 26d ago

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

arXiv:2606.04719v1 Announce Type: new Abstract: The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this…

7
arXiv — NLP / Computation & Language research 26d ago

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

arXiv:2606.04046v1 Announce Type: cross Abstract: In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at…

31
arXiv — NLP / Computation & Language research 26d ago

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

arXiv:2606.04240v1 Announce Type: cross Abstract: Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The…

7
arXiv — NLP / Computation & Language research 26d ago

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

arXiv:2606.04244v1 Announce Type: cross Abstract: Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when…

7
arXiv — NLP / Computation & Language research 26d ago

Video2LoRA: Parametric Video Internalization for Vision-Language Models

arXiv:2606.04351v1 Announce Type: cross Abstract: Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video…

13
arXiv — NLP / Computation & Language research 26d ago

Stateful Visual Encoders for Vision-Language Models

arXiv:2606.04433v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language…

32
Hugging Face Daily Papers research 26d ago

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Abstract MapAgent is an industrial-grade agentic architecture that combines vision-language processing with constraint-aware reasoning to produce specification-compliant lane maps, achieving high automation rates in large-scale urban mapping. Generated by…

21
Hugging Face Daily Papers research 26d ago

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Abstract Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence…

28
Hugging Face Daily Papers research 26d ago

WALL-WM: Carving World Action Modeling at the Event Joints

Abstract WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference. Generated by Qwen/Qwen2.5-Coder-32B-Instruct WALL-WM is a World Action…

32
r/LocalLLaMA community 26d ago

How to use audio and vision modalities in llama.cpp?

How to use audio and vision modalities in llama.cpp with Gemma4 12B it? I’m on release b9494, but when I run llama-cli it shows “modalities: text” only, and crashes if I try to add an image.   submitted by   /u/No-Leave-4512 [link]   [comments]

20
llama.cpp releases dev-tools 26d ago

b9494

mtmd: enable non-causal vision for gemma 4 unified ( #24082 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu…

22
r/LocalLLaMA community 26d ago

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

  submitted by   /u/johnnyApplePRNG [link]   [comments]

4
Hugging Face Daily Papers research 26d ago

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

Abstract YOLO26 addresses real-time vision challenges through a unified model family with NMS-free inference, improved training strategies, and multi-task capabilities spanning detection, segmentation, and pose estimation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Real-time…

28
Hacker News — AI on Front Page community 26d ago

Gemma 4 12B: A unified, encoder-free multimodal model

Article URL: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/ Comments URL: https://news.ycombinator.com/item?id=48385906 Points: 263 # Comments: 95

28
Maarten Grootendorst research 26d ago

A Visual Guide to Gemma 4 12B

An in-depth explainer to Gemma 4 12B; a unified, encoder-free multimodal model!

38
r/LocalLLaMA community 26d ago

google/gemma-4-12B · Hugging Face

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned…

29
Hugging Face Daily Papers research 26d ago

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Abstract Researchers identify a perceptual judgment bias in multimodal large language models where visual evidence is overlooked for textual plausibility, and propose a training framework using a perturbed dataset and reward modeling to improve perceptual fidelity and evaluation…

18
Stratechery (Ben Thompson) community 27d ago

The Nvidia AI PC, Project Solara, Microsoft AI

The Nvidia AI PC feels like a relic of another AI era; Microsoft's vision for devices at Build was much more compelling.

5
r/LocalLLaMA community 27d ago

Holo3.1 35B/9B/4B/0.8B (Qwen 3.5 finetunes)

from Hcompany (which seems to be a French company): Holo3.1: Fast & Local Computer Use Agents Model Description Holo3.1 is our latest family of Vision-Language Models (VLMs) for computer use agents. Building on Holo3, it expands support beyond browser and desktop automation to…

25
Hugging Face Daily Papers research 27d ago

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Abstract Controlled concrete reasoning combines visual simulation with abstract reasoning through a training method that uses privileged future information to improve prediction accuracy and robustness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct World models and multimodal…

19
arXiv — Machine Learning research 27d ago

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

arXiv:2606.02605v1 Announce Type: new Abstract: Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Although coronary (X-ray) angiograms remain the standard for stenosis diagnosis, they are invasive,…

24
arXiv — Machine Learning research 27d ago

CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning

arXiv:2606.02659v1 Announce Type: new Abstract: Multimodal data fusion involves integrating and analyzing information from multiple modalities to uncover latent correlations and complementary patterns, thereby enhancing data processing and decision-making. While existing methods…

5
arXiv — Machine Learning research 27d ago

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

arXiv:2606.02679v1 Announce Type: new Abstract: Multimodal systems often benefit from combining information across language, sound, and visual streams, but this benefit is not guaranteed. A modality that is useful for one input may become distracting for another, and local…

24
arXiv — NLP / Computation & Language research 27d ago

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

arXiv:2606.02684v1 Announce Type: cross Abstract: On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which…

33
arXiv — Machine Learning research 27d ago

QUIVER: Quantum-Informed Views for Enhanced Representations in Large ML Models

arXiv:2606.02785v1 Announce Type: new Abstract: Large machine learning models benefit substantially from multimodal inputs that provide a complementary view of the same example. We introduce QUIVER (QUantum-Informed Views for Enhanced Representations, a paradigm that enriches…

21
arXiv — Machine Learning research 27d ago

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

arXiv:2606.02842v1 Announce Type: new Abstract: Multimodal spatial reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense cross-modal attention incur substantial computation and memory overhead. To address…

6
arXiv — Machine Learning research 27d ago

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

arXiv:2606.02947v1 Announce Type: new Abstract: Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing…

17
arXiv — Machine Learning research 27d ago

Constitutional On-Policy Safe Distillation

arXiv:2606.03089v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in…

15
arXiv — Machine Learning research 27d ago

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

arXiv:2606.03118v1 Announce Type: new Abstract: Objective: Diseases such as age-related macular degeneration and retinitis pigmentosa cause the degradation of the photoreceptor layer. One approach to restore vision is to electrically stimulate the surviving retinal ganglion…

38
arXiv — NLP / Computation & Language research 27d ago

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

arXiv:2606.02971v1 Announce Type: new Abstract: Extracting reporting obligations from EU legislation is critical for assessing and reducing regulatory reporting burden. However, distinguishing reporting requirements from structurally similar provisions requires specialised legal…

10
arXiv — NLP / Computation & Language research 27d ago

Coherence Maximization Improves Pluralistic Alignment

arXiv:2606.03110v1 Announce Type: new Abstract: Aligning AI systems with diverse human values requires value specifications grounded in concrete examples, but generating such examples without extensive human supervision remains an open challenge. We investigate what makes these…

16
arXiv — NLP / Computation & Language research 27d ago

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

arXiv:2606.03371v1 Announce Type: new Abstract: Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework,…

17
arXiv — NLP / Computation & Language research 27d ago

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

arXiv:2606.03604v1 Announce Type: new Abstract: When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal…

37
arXiv — NLP / Computation & Language research 27d ago

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

arXiv:2606.03693v1 Announce Type: new Abstract: Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an…

10
arXiv — NLP / Computation & Language research 27d ago

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

arXiv:2606.03793v1 Announce Type: new Abstract: Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric…

19
arXiv — NLP / Computation & Language research 27d ago

VESTA: Visual Exploration with Statistical Tool Agents

arXiv:2606.00384v1 Announce Type: cross Abstract: Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and…

10
arXiv — NLP / Computation & Language research 27d ago

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

arXiv:2606.02812v1 Announce Type: cross Abstract: Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but…

38
MIT News — AI research 27d ago

MIT researchers teach AI models to interpret charts

The new ChartNet training dataset could improve the accuracy of vision-language models that help analyze business trends or interpret scientific figures.

29
Hugging Face Daily Papers research 27d ago

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

Abstract A training-free framework for embodied navigation that uses a vision-only approach to create semantic maps and ground language goals through blind matching without paired vision-language data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Embodied visual navigation,…

12
Hugging Face Daily Papers research 27d ago

Benchmarking Visual State Tracking in Multimodal Video Understanding

Abstract Current multimodal large language models struggle with visual state tracking in videos, performing poorly even when human-level capabilities are required, and existing agentic approaches do not effectively address these limitations. Generated by…

6
Hugging Face Daily Papers research 27d ago

Trust Region On-Policy Distillation

Abstract Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch. Generated by…

9

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Repo for implementations of various Transformer Attn mechanisms [P]

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

Self-Distilled Policy Gradient

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support

MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Stateful Visual Encoders for Vision-Language Models

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

WALL-WM: Carving World Action Modeling at the Event Joints

How to use audio and vision modalities in llama.cpp?

b9494

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

Gemma 4 12B: A unified, encoder-free multimodal model

A Visual Guide to Gemma 4 12B

google/gemma-4-12B · Hugging Face

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

The Nvidia AI PC, Project Solara, Microsoft AI

Holo3.1 35B/9B/4B/0.8B (Qwen 3.5 finetunes)

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning

Before Fusion, Ask What to Keep: Contextual Calibration of Multimodal Signals

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

QUIVER: Quantum-Informed Views for Enhanced Representations in Large ML Models

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

Constitutional On-Policy Safe Distillation

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction

Coherence Maximization Improves Pluralistic Alignment

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

VESTA: Visual Exploration with Statistical Tool Agents

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

MIT researchers teach AI models to interpret charts

PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps

Benchmarking Visual State Tracking in Multimodal Video Understanding

Trust Region On-Policy Distillation