Tag

Multimodal

500 articles archived under #multimodal · RSS

arXiv — Machine Learning research 22d ago

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,…

37
arXiv — Machine Learning research 22d ago

Federated Foundation Models over Vehicular Networks

arXiv:2606.06786v1 Announce Type: new Abstract: This paper presents a forward-looking vision for integrating the emerging multi-modal multi-task federated foundation models (M3T FedFMs) into vehicular networks, with the goal of unifying the expressive power of multi-modal…

34
arXiv — Machine Learning research 22d ago

From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing

arXiv:2606.06924v1 Announce Type: new Abstract: Existing LLM routing methods typically treat a model's single response to a query as its capability label for training routers. However, because LLM generation is inherently stochastic, such single-shot supervision provides only a…

9
arXiv — Machine Learning research 22d ago

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

arXiv:2606.06967v1 Announce Type: new Abstract: Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing…

24
arXiv — Machine Learning research 22d ago

A robust PPG foundation model using multimodal physiological supervision

arXiv:2606.07365v1 Announce Type: new Abstract: Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining…

8
arXiv — NLP / Computation & Language research 22d ago

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic…

19
arXiv — NLP / Computation & Language research 22d ago

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

arXiv:2606.07479v1 Announce Type: new Abstract: Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially…

4
arXiv — NLP / Computation & Language research 22d ago

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

arXiv:2606.06743v1 Announce Type: cross Abstract: The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main…

21
arXiv — NLP / Computation & Language research 22d ago

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

arXiv:2606.07172v1 Announce Type: cross Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations…

18
arXiv — NLP / Computation & Language research 22d ago

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance.…

6
arXiv — NLP / Computation & Language research 22d ago

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

arXiv:2606.07512v1 Announce Type: cross Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple…

14
arXiv — NLP / Computation & Language research 22d ago

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

arXiv:2510.26615v4 Announce Type: replace Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer…

27
arXiv — NLP / Computation & Language research 22d ago

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

arXiv:2601.06600v4 Announce Type: replace Abstract: Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive…

24
Hugging Face Daily Papers research 22d ago

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Abstract Astra is an agentic spatial reasoning framework that enhances Vision-Language Models with action-conditioned visual imagination by coupling a reinforcement learning-trained policy with a world simulator for generating novel-view observations. Generated by…

22
Hugging Face Daily Papers research 22d ago

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Abstract Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning. Generated by…

8
Hugging Face Daily Papers research 22d ago

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world…

11
Hugging Face Daily Papers research 22d ago

OpenSkill: Open-World Self-Evolution for LLM Agents

Abstract OpenSkill enables self-evolving agents to develop skills and verification signals from scratch using open-world resources without target-task supervision, achieving high automated performance across benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Self-evolving…

30
OpenAI official-blog 22d ago

Built to benefit everyone: our plan

A vision for the future of AI, focusing on access, safety, and shared prosperity as OpenAI works to ensure AGI benefits everyone.

6
r/LocalLLaMA community 23d ago

Gemma4 12B - Experiences?

Anyone check out the new Gemma4 12B that dropped 3 days ago? Integrated vision and audio recognition, no mmpro needed plus tool use. Q4 quant is like 8gb RAM. Crazy fast and great quality for it's size. No, it's not as good as a 27B or 31B. But it's damn close. Curious what…

24
Hacker News — AI on Front Page community 24d ago

OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision

Article URL: https://opencv.org/opencv-5/ Comments URL: https://news.ycombinator.com/item?id=48421858 Points: 227 # Comments: 38

33
Hugging Face Daily Papers research 24d ago

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

Abstract BRepCLIP enables multimodal representation learning for CAD models by aligning boundary representation geometry with language and image embeddings through contrastive pretraining, achieving superior retrieval and classification performance compared to point-based…

7
Hugging Face Daily Papers research 24d ago

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

Abstract KITScenes Multimodal dataset provides high-fidelity European driving data with comprehensive 3D maps and diverse urban environments for embodied AI research. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing autonomous driving datasets have enabled major progress,…

36
Hugging Face Daily Papers research 24d ago

MAOAM: Unified Object and Material Selection with Vision-Language Models

Abstract A unified vision-language model framework enables precise object and material selection through text or click interactions, supporting diverse editing workflows with improved robustness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Selection is a core operation in…

11
r/LocalLLaMA community 24d ago

model: Granite4 Vision by gabe-l-hart · Pull Request #23545 · ggml-org/llama.cpp

Model Summary: Granite Vision 4.1 4B is a vision-language model (VLM) that delivers frontier-level performance on structured document extraction tasks — chart extraction, table extraction, and semantic key-value pair extraction — in a compact 4B parameter footprint, providing a…

33
Hugging Face Daily Papers research 24d ago

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Abstract GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal…

30
Hugging Face Daily Papers research 24d ago

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Abstract AffordanceVLA introduces a unified framework that uses structured affordance forecasting as an intermediate representation to improve the precision of perception-action mapping in robotic manipulation by leveraging vision-language models. Generated by…

4
Hugging Face Daily Papers research 24d ago

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

Abstract LLM-based stance simulation exhibits context sensitivity when subjected to counterfactual revisions, with both text-only and multimodal approaches showing robust stance transitions across different polarization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

31
Hugging Face Daily Papers research 25d ago

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Abstract Discrete-WAM introduces a unified discrete latent vision-action world policy that enables compositional causal reasoning and counterfactual reasoning in autonomous driving through aligned discrete tokens and a shared discrete diffusion framework. Generated by…

29
r/MachineLearning community 25d ago

Are We Underestimating Small Edge AI Models?[D]

A lot of recent discussion around Edge AI focuses on running increasingly larger local LLMs. Meanwhile modern smartphones already have enough compute for many practical computer vision tasks that don't require massive models at all. I recently built and released an Android…

7
Hugging Face Daily Papers research 25d ago

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Abstract Mechanical engineering drawing understanding is improved through a specialized dataset and domain-specific model that outperforms existing baselines by leveraging multi-stage training and high-density visual question answering annotations. Generated by…

9
Hugging Face Daily Papers research 25d ago

Multimodal Music Recommendation System using LLMs

Abstract A multimodal framework for session-based music recommendation integrates audio, lyric, and semantic signals with LLM-based sequential reasoning to improve recommendation accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Music recommendation systems typically treat…

16
Hugging Face Daily Papers research 25d ago

RobotValues: Evaluating Household Robots When Human Values Conflict

Abstract RobotValues benchmark evaluates household robot planners in value-conflict scenarios, revealing that vision-language models exhibit default value preferences and struggle to override them when instructed to prioritize conflicting values. Generated by…

8
arXiv — Machine Learning research 25d ago

Differentiable Efficient Operator Search

arXiv:2606.05232v1 Announce Type: new Abstract: Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be…

33
arXiv — Machine Learning research 25d ago

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

arXiv:2606.05497v1 Announce Type: new Abstract: Given the inherently multimodal nature of human experience, vision-language models (VLMs) hold substantial promise for modeling human cognition as it grows and develops with experience. Realizing their potential requires tools for…

23
arXiv — Machine Learning research 25d ago

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

arXiv:2606.05597v1 Announce Type: new Abstract: Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present…

16
arXiv — Machine Learning research 25d ago

On the training of physics-informed neural operators for solving parametric partial differential equations

arXiv:2606.06164v1 Announce Type: new Abstract: Physics-informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input-output simulation data. By…

6
arXiv — NLP / Computation & Language research 25d ago

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four…

5
arXiv — NLP / Computation & Language research 25d ago

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

arXiv:2606.05414v1 Announce Type: new Abstract: Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level…

21
arXiv — NLP / Computation & Language research 25d ago

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

arXiv:2606.05744v1 Announce Type: new Abstract: Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their…

5
arXiv — NLP / Computation & Language research 25d ago

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

arXiv:2606.05749v1 Announce Type: new Abstract: Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and…

10
arXiv — NLP / Computation & Language research 25d ago

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

arXiv:2606.05843v1 Announce Type: new Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In…

36
arXiv — NLP / Computation & Language research 25d ago

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

arXiv:2606.05874v1 Announce Type: new Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios…

23
arXiv — NLP / Computation & Language research 25d ago

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both.…

21
arXiv — NLP / Computation & Language research 25d ago

FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

arXiv:2606.06271v1 Announce Type: new Abstract: While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision:…

25
Hugging Face Daily Papers research 25d ago

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing…

33
Hugging Face Daily Papers research 25d ago

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Abstract Video2LoRA enables efficient video processing in vision-language models by predicting Low-Rank Adaptation weights from video representations, reducing computational costs while maintaining video-faithful outputs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Processing…

7
Hugging Face official-blog 25d ago

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Back to Articles Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Enterprise + Article Published June 4, 2026 Upvote - Varun Singh varunsingh nvidia Isabel Hulseman ihulseman0220 nvidia Anuj Doshi andoshi nvidia Shyamala Prayaga sprayaga25…

6
Ollama releases dev-tools 25d ago

v0.30.5-rc0: llama.cpp version update (#16511)

Bump llama.cpp to b9509, which includes the upstream Gemma 4 12B multimodal projector fixes for the n_head=0 divide-by-zero crash seen on x86/CUDA/Linux/Windows. Fixes #16479 Fixes #16489 Fixes #16491 Fixes #16492 Fixes #16495

11
Hugging Face Daily Papers research 25d ago

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

Abstract Stable-Layers uses reinforcement learning with vision-language model feedback to improve layer decomposition without paired data, employing Flow-GRPO and LoRA adaptation for optimized policy training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present…

38
Hugging Face Daily Papers research 26d ago

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Abstract Vision-language models demonstrate strong performance on isolated spatial reasoning tasks but fail to maintain coherent spatial understanding and reliable actions during multi-turn interactive feedback in 3D environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

15

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

Federated Foundation Models over Vehicular Networks

From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

A robust PPG foundation model using multimodal physiological supervision

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

OpenSkill: Open-World Self-Evolution for LLM Agents

Built to benefit everyone: our plan

Gemma4 12B - Experiences?

OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

MAOAM: Unified Object and Material Selection with Vision-Language Models

model: Granite4 Vision by gabe-l-hart · Pull Request #23545 · ggml-org/llama.cpp

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Are We Underestimating Small Edge AI Models?[D]

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Multimodal Music Recommendation System using LLMs

RobotValues: Evaluating Household Robots When Human Values Conflict

Differentiable Efficient Operator Search

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

On the training of physics-informed neural operators for solving parametric partial differential equations

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

v0.30.5-rc0: llama.cpp version update (#16511)

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes