News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — NLP / Computation & Language research 14d ago XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models arXiv:2606.16137v1 Announce Type: new Abstract: Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution,… 31 arXiv — NLP / Computation & Language research 14d ago PaperJury: Due-Process Review for Bounded LaTeX Revision arXiv:2606.16322v1 Announce Type: new Abstract: Pre-submission hardening of human-authored LaTeX computer science papers differs from drafting assistance because it requires adversarial whole-paper review, explicit no-fix outcomes, and bounded artifact-safe revision. Existing… 15 arXiv — NLP / Computation & Language research 14d ago TMASC: Transmasculine Attitude and Speech Corpus arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the… 25 arXiv — NLP / Computation & Language research 14d ago Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering arXiv:2606.16494v1 Announce Type: new Abstract: Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In… 30 Hugging Face Daily Papers research 14d ago Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time Abstract Retrieval-augmented vision-language-action policies eliminate per-task fine-tuning costs by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 26 Hugging Face Daily Papers research 14d ago UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer Abstract UniDDT addresses key challenges in unified multimodal models by leveraging a Noisy ViT encoder and LLM for semantic encoding while using separate diffusion decoders to balance visual understanding and generation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 12 Hugging Face Daily Papers research 14d ago VisualClaw: A Real-Time, Personalized Agent for the Physical World Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as… 32 r/MachineLearning community 14d ago Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D] I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first… 6 NVIDIA Developer Blog official-blog 14d ago Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models Quick glossary for readers new to VLA/WAM terminology VLA Vision-Language-Action model: a robot policy that starts from a pretrained VLM backbone and adapts it... 22 arXiv — Machine Learning research 15d ago SpikF-GO: Spiking Fourier Graph Operators for Multivariate Time Series Forecasting arXiv:2606.13901v1 Announce Type: new Abstract: Spiking Neural Networks (SNNs) have emerged as an energy-efficient alternative to conventional neural networks, demonstrating strong performance in computer vision and robotics. More recently, SNNs have been applied to time series… 30 arXiv — Machine Learning research 15d ago Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs arXiv:2606.14172v1 Announce Type: new Abstract: Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative… 13 arXiv — Machine Learning research 15d ago LapidaryEngine: Fully Conversational Crystal Generation arXiv:2606.14215v1 Announce Type: new Abstract: The emergence of Large Language Models (LLMs) has inspired the vision of generating bespoke crystal materials directly from natural-language instructions, enabling users to design materials through intuitive, conversational… 35 arXiv — NLP / Computation & Language research 15d ago CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving… 34 arXiv — NLP / Computation & Language research 15d ago Multimodal Speaker Identification in Classroom Environments arXiv:2606.13712v1 Announce Type: cross Abstract: Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework… 24 arXiv — NLP / Computation & Language research 15d ago Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI arXiv:2606.14072v1 Announce Type: cross Abstract: Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage… 19 arXiv — NLP / Computation & Language research 15d ago ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where… 4 arXiv — NLP / Computation & Language research 15d ago Gaze Heads: How VLMs Look at What They Describe arXiv:2606.14703v1 Announce Type: cross Abstract: How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone,… 18 arXiv — NLP / Computation & Language research 15d ago MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We… 23 r/LocalLLaMA community 17d ago when fable gets banned but it's ok because you've about to download qwen3.7_67b_21a_mythos_father_fable_mother_distilled_ablated_ablitereted_uncensored_agi_sparse_attention_MTP_SuperHOT_q6_maybe_q7_AGI_FINAL.gguf from huggingface title   submitted by   /u/visionsmemories [link]   [comments] 9 r/LocalLLaMA community 17d ago Vista 9B/4B from inclusionAI VISTA-9B VISTA-9B are GUI-grounding vision-language models trained from Qwen3.5 9B backbones with VISTA: View-Consistent Self-Verified Training for GUI Grounding . Model Description VISTA-9B is a GUI-grounding model that maps a screenshot and a natural-language instruction to a… 30 NVIDIA Developer Blog official-blog 17d ago Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and... 25 r/MachineLearning community 17d ago Just thinking, what about conducting a 1 day virtual session on fundamentals of computer vision ??? [D] Hi all, A real story from my current experience: I'm associated with an internship where the primary work revolves around autonomous UAVs. What has shocked me the most is that almost everyone is so heavily focused on coding agents and AI tools that they're building things… 17 Hugging Face Daily Papers research 17d ago ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages Abstract ArogyaBodha dataset and ArogyaSutra framework enhance multilingual medical reasoning in low-resource settings through diverse data integration and actor-critic multi-agent reasoning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models (MLLMs)… 30 Hugging Face Daily Papers research 17d ago Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback Abstract Structured Defect Grounding (SDG) addresses limitations in text-to-image model diagnosis by modeling defects as structured sets and using vision-language models for detection and reward-based alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite generating… 22 Hugging Face Daily Papers research 18d ago From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion Abstract A multimodal image fusion approach uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal image fusion… 33 Hugging Face Daily Papers research 18d ago HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers Abstract HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, addressing spatiotemporal reconstruction and semantic awareness through causal temporal attention and hierarchical compression. Generated by… 32 Hugging Face Daily Papers research 18d ago VideoMDM: Towards 3D Human Motion Generation From 2D Supervision Abstract VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce… 5 Hugging Face Daily Papers research 18d ago LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories Abstract LabVLA, a vision-language-action model trained with a two-stage approach combining action token pretraining and flow matching, demonstrates superior performance on laboratory automation tasks through simulated data generation and robot-specific learning. Generated by… 18 arXiv — NLP / Computation & Language research 18d ago Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review arXiv:2606.12716v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of… 8 arXiv — NLP / Computation & Language research 18d ago No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions arXiv:2606.13044v1 Announce Type: new Abstract: As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more… 32 arXiv — NLP / Computation & Language research 18d ago Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization arXiv:2606.13216v1 Announce Type: new Abstract: Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We… 10 arXiv — NLP / Computation & Language research 18d ago ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages arXiv:2606.13572v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource… 20 arXiv — NLP / Computation & Language research 18d ago LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories arXiv:2606.13578v1 Announce Type: new Abstract: Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols,… 22 arXiv — NLP / Computation & Language research 18d ago ProPlay: Procedural World Models for Self-Evolving LLM Agents arXiv:2606.12780v1 Announce Type: cross Abstract: Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and… 33 arXiv — NLP / Computation & Language research 18d ago Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension arXiv:2606.12898v1 Announce Type: cross Abstract: Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC… 25 arXiv — NLP / Computation & Language research 18d ago Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality arXiv:2606.13288v1 Announce Type: cross Abstract: Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words"… 38 Hugging Face Daily Papers research 18d ago Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding? Abstract Robust-U1 enhances multimodal large language models' robustness against visual corruptions through self-recovery capabilities that improve both visual quality and reasoning performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models… 4 Hugging Face Daily Papers research 18d ago SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning Abstract SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks. Generated by… 36 Vercel — AI dev-tools 18d ago Kimi K2.7 Code now available on AI Gateway Kimi K2.7 Code from Moonshot AI is now available on AI Gateway . K2.7 Code is a coding model built for long-horizon programming tasks, generalizing across scenarios including frontend development, DevOps, and performance optimization. The model has a native multimodal… 12 Hugging Face Daily Papers research 18d ago ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use… 10 r/LocalLLaMA community 18d ago Where are we with computer-control harnesses? Seems like local vision language models models are getting smart enough so that it would be useful to hand them the cursor in a secure sandbox. What harnesses are available that can do this? edit: oh my fucking God something about this post triggered all of the bots to come out… 27 Hugging Face Daily Papers research 18d ago DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models Abstract DRIFT is a framework that adapts pretrained vision-language models for continuous decoding tasks by combining coarse prediction with iterative refinement through flow matching, improving performance across perception and planning tasks. Generated by… 12 Hugging Face Daily Papers research 18d ago Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models Abstract Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages. Generated by… 16 Hugging Face Daily Papers research 18d ago Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training Abstract ART enables parameter-efficient fine-tuning of frozen multimodal language models by optimizing raw visual input through gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs. Generated by… 8 Hugging Face Daily Papers research 19d ago Distilling LLM Feedback for Lean Theorem Proving Abstract Feedback Distillation improves post-training of reasoning models by using self-distillation with token-level supervision and privileged feedback from language models, offering better diversity and complementary benefits when combined with GRPO. Generated by… 38 arXiv — Machine Learning research 19d ago Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models arXiv:2606.11266v1 Announce Type: new Abstract: The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the… 34 arXiv — Machine Learning research 19d ago GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction arXiv:2606.11382v1 Announce Type: new Abstract: Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases,… 20 arXiv — Machine Learning research 19d ago Information-Theoretic Decomposition for Multimodal Interaction Learning arXiv:2606.11614v1 Announce Type: new Abstract: Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit… 13 arXiv — Machine Learning research 19d ago IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents arXiv:2606.11652v1 Announce Type: new Abstract: This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic… 29 arXiv — Machine Learning research 19d ago RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation arXiv:2606.11709v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution.… 4 Page 5 of 10 · 500 articles ← Newer Older →