News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — Machine Learning research 22d ago MacArena: Benchmarking Computer Use Agents on an Online macOS Environment arXiv:2606.06560v1 Announce Type: new Abstract: Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have advanced rapidly, driven in part by standardized online evaluation benchmarks such as OSWorld,… 37 arXiv — Machine Learning research 22d ago Federated Foundation Models over Vehicular Networks arXiv:2606.06786v1 Announce Type: new Abstract: This paper presents a forward-looking vision for integrating the emerging multi-modal multi-task federated foundation models (M3T FedFMs) into vehicular networks, with the goal of unifying the expressive power of multi-modal… 34 arXiv — Machine Learning research 22d ago From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing arXiv:2606.06924v1 Announce Type: new Abstract: Existing LLM routing methods typically treat a model's single response to a query as its capability label for training routers. However, because LLM generation is inherently stochastic, such single-shot supervision provides only a… 9 arXiv — Machine Learning research 22d ago GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios arXiv:2606.06967v1 Announce Type: new Abstract: Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing… 24 arXiv — Machine Learning research 22d ago A robust PPG foundation model using multimodal physiological supervision arXiv:2606.07365v1 Announce Type: new Abstract: Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical settings. Recent PPG foundation models either use open-source ICU datasets with pretraining… 8 arXiv — NLP / Computation & Language research 22d ago M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions arXiv:2606.07402v1 Announce Type: new Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic… 19 arXiv — NLP / Computation & Language research 22d ago Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification arXiv:2606.07479v1 Announce Type: new Abstract: Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially… 4 arXiv — NLP / Computation & Language research 22d ago HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec arXiv:2606.06743v1 Announce Type: cross Abstract: The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main… 21 arXiv — NLP / Computation & Language research 22d ago Textual Supervision Enhances Geospatial Representations in Vision-Language Models arXiv:2606.07172v1 Announce Type: cross Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations… 18 arXiv — NLP / Computation & Language research 22d ago TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance.… 6 arXiv — NLP / Computation & Language research 22d ago MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism arXiv:2606.07512v1 Announce Type: cross Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple… 14 arXiv — NLP / Computation & Language research 22d ago SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding arXiv:2510.26615v4 Announce Type: replace Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer… 27 arXiv — NLP / Computation & Language research 22d ago Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation arXiv:2601.06600v4 Announce Type: replace Abstract: Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive… 24 Hugging Face Daily Papers research 22d ago Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators Abstract Astra is an agentic spatial reasoning framework that enhances Vision-Language Models with action-conditioned visual imagination by coupling a reinforcement learning-trained policy with a world simulator for generating novel-view observations. Generated by… 22 Hugging Face Daily Papers research 22d ago Watch, Remember, Reason: Human-View Video Understanding with MLLMs Abstract Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning. Generated by… 8 Hugging Face Daily Papers research 22d ago WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark Abstract WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct In real-world… 11 Hugging Face Daily Papers research 22d ago OpenSkill: Open-World Self-Evolution for LLM Agents Abstract OpenSkill enables self-evolving agents to develop skills and verification signals from scratch using open-world resources without target-task supervision, achieving high automated performance across benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Self-evolving… 30 OpenAI official-blog 22d ago Built to benefit everyone: our plan A vision for the future of AI, focusing on access, safety, and shared prosperity as OpenAI works to ensure AGI benefits everyone. 6 r/LocalLLaMA community 23d ago Gemma4 12B - Experiences? Anyone check out the new Gemma4 12B that dropped 3 days ago? Integrated vision and audio recognition, no mmpro needed plus tool use. Q4 quant is like 8gb RAM. Crazy fast and great quality for it's size. No, it's not as good as a 27B or 31B. But it's damn close. Curious what… 24 Hacker News — AI on Front Page community 24d ago OpenCV 5 Is Here: The Biggest Leap in Years for Computer Vision Article URL: https://opencv.org/opencv-5/ Comments URL: https://news.ycombinator.com/item?id=48421858 Points: 227 # Comments: 38 33 Hugging Face Daily Papers research 24d ago BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding Abstract BRepCLIP enables multimodal representation learning for CAD models by aligning boundary representation geometry with language and image embeddings through contrastive pretraining, achieving superior retrieval and classification performance compared to point-based… 7 Hugging Face Daily Papers research 24d ago The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset Abstract KITScenes Multimodal dataset provides high-fidelity European driving data with comprehensive 3D maps and diverse urban environments for embodied AI research. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing autonomous driving datasets have enabled major progress,… 36 Hugging Face Daily Papers research 24d ago MAOAM: Unified Object and Material Selection with Vision-Language Models Abstract A unified vision-language model framework enables precise object and material selection through text or click interactions, supporting diverse editing workflows with improved robustness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Selection is a core operation in… 11 r/LocalLLaMA community 24d ago model: Granite4 Vision by gabe-l-hart · Pull Request #23545 · ggml-org/llama.cpp Model Summary: Granite Vision 4.1 4B is a vision-language model (VLM) that delivers frontier-level performance on structured document extraction tasks — chart extraction, table extraction, and semantic key-value pair extraction — in a compact 4B parameter footprint, providing a… 33 Hugging Face Daily Papers research 24d ago Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models Abstract GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal… 30 Hugging Face Daily Papers research 24d ago AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding Abstract AffordanceVLA introduces a unified framework that uses structured affordance forecasting as an intermediate representation to improve the precision of perception-action mapping in robotic manipulation by leveraging vision-language models. Generated by… 4 Hugging Face Daily Papers research 24d ago Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions Abstract LLM-based stance simulation exhibits context sensitivity when subjected to counterfactual revisions, with both text-only and multimodal approaches showing robust stance transitions across different polarization mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 31 Hugging Face Daily Papers research 25d ago Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning Abstract Discrete-WAM introduces a unified discrete latent vision-action world policy that enables compositional causal reasoning and counterfactual reasoning in autonomous driving through aligned discrete tokens and a shared discrete diffusion framework. Generated by… 29 r/MachineLearning community 25d ago Are We Underestimating Small Edge AI Models?[D] A lot of recent discussion around Edge AI focuses on running increasingly larger local LLMs. Meanwhile modern smartphones already have enough compute for many practical computer vision tasks that don't require massive models at all. I recently built and released an Android… 7 Hugging Face Daily Papers research 25d ago MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding Abstract Mechanical engineering drawing understanding is improved through a specialized dataset and domain-specific model that outperforms existing baselines by leveraging multi-stage training and high-density visual question answering annotations. Generated by… 9 Hugging Face Daily Papers research 25d ago Multimodal Music Recommendation System using LLMs Abstract A multimodal framework for session-based music recommendation integrates audio, lyric, and semantic signals with LLM-based sequential reasoning to improve recommendation accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Music recommendation systems typically treat… 16 Hugging Face Daily Papers research 25d ago RobotValues: Evaluating Household Robots When Human Values Conflict Abstract RobotValues benchmark evaluates household robot planners in value-conflict scenarios, revealing that vision-language models exhibit default value preferences and struggle to override them when instructed to prioritize conflicting values. Generated by… 8 arXiv — Machine Learning research 25d ago Differentiable Efficient Operator Search arXiv:2606.05232v1 Announce Type: new Abstract: Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be… 33 arXiv — Machine Learning research 25d ago LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?") arXiv:2606.05497v1 Announce Type: new Abstract: Given the inherently multimodal nature of human experience, vision-language models (VLMs) hold substantial promise for modeling human cognition as it grows and develops with experience. Realizing their potential requires tools for… 23 arXiv — Machine Learning research 25d ago AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents arXiv:2606.05597v1 Announce Type: new Abstract: Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present… 16 arXiv — Machine Learning research 25d ago On the training of physics-informed neural operators for solving parametric partial differential equations arXiv:2606.06164v1 Announce Type: new Abstract: Physics-informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input-output simulation data. By… 6 arXiv — NLP / Computation & Language research 25d ago MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models arXiv:2606.05177v1 Announce Type: new Abstract: Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four… 5 arXiv — NLP / Computation & Language research 25d ago When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories arXiv:2606.05414v1 Announce Type: new Abstract: Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level… 21 arXiv — NLP / Computation & Language research 25d ago PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models arXiv:2606.05744v1 Announce Type: new Abstract: Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their… 5 arXiv — NLP / Computation & Language research 25d ago MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA arXiv:2606.05749v1 Announce Type: new Abstract: Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and… 10 arXiv — NLP / Computation & Language research 25d ago Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads arXiv:2606.05843v1 Announce Type: new Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In… 36 arXiv — NLP / Computation & Language research 25d ago Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models arXiv:2606.05874v1 Announce Type: new Abstract: Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios… 23 arXiv — NLP / Computation & Language research 25d ago To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both.… 21 arXiv — NLP / Computation & Language research 25d ago FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays arXiv:2606.06271v1 Announce Type: new Abstract: While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision:… 25 Hugging Face Daily Papers research 25d ago LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing Abstract LoomVideo presents an efficient 5B-parameter unified architecture for video generation and editing that reduces computational overhead through novel conditioning mechanisms and multi-modal alignment techniques. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Developing… 33 Hugging Face Daily Papers research 25d ago Video2LoRA: Parametric Video Internalization for Vision-Language Models Abstract Video2LoRA enables efficient video processing in vision-language models by predicting Low-Rank Adaptation weights from video representations, reducing computational costs while maintaining video-faithful outputs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Processing… 7 Hugging Face official-blog 25d ago Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Back to Articles Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI Enterprise + Article Published June 4, 2026 Upvote - Varun Singh varunsingh nvidia Isabel Hulseman ihulseman0220 nvidia Anuj Doshi andoshi nvidia Shyamala Prayaga sprayaga25… 6 Ollama releases dev-tools 25d ago v0.30.5-rc0: llama.cpp version update (#16511) Bump llama.cpp to b9509, which includes the upstream Gemma 4 12B multimodal projector fixes for the n_head=0 divide-by-zero crash seen on x86/CUDA/Linux/Windows. Fixes #16479 Fixes #16489 Fixes #16491 Fixes #16492 Fixes #16495 11 Hugging Face Daily Papers research 25d ago Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning Abstract Stable-Layers uses reinforcement learning with vision-language model feedback to improve layer decomposition without paired data, employing Flow-GRPO and LoRA adaptation for optimized policy training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We present… 38 Hugging Face Daily Papers research 26d ago SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes Abstract Vision-language models demonstrate strong performance on isolated spatial reasoning tasks but fail to maintain coherent spatial understanding and reliable actions during multi-turn interactive feedback in 3D environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 15 Page 7 of 10 · 500 articles ← Newer Older →