Tag

Multimodal

500 articles archived under #multimodal · RSS

arXiv — NLP / Computation & Language research 14d ago

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

arXiv:2606.16137v1 Announce Type: new Abstract: Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution,…

31
arXiv — NLP / Computation & Language research 14d ago

PaperJury: Due-Process Review for Bounded LaTeX Revision

arXiv:2606.16322v1 Announce Type: new Abstract: Pre-submission hardening of human-authored LaTeX computer science papers differs from drafting assistance because it requires adversarial whole-paper review, explicit no-fix outcomes, and bounded artifact-safe revision. Existing…

15
arXiv — NLP / Computation & Language research 14d ago

TMASC: Transmasculine Attitude and Speech Corpus

arXiv:2606.16351v1 Announce Type: new Abstract: We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the…

25
arXiv — NLP / Computation & Language research 14d ago

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

arXiv:2606.16494v1 Announce Type: new Abstract: Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In…

30
Hugging Face Daily Papers research 14d ago

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

Abstract Retrieval-augmented vision-language-action policies eliminate per-task fine-tuning costs by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

26
Hugging Face Daily Papers research 14d ago

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Abstract UniDDT addresses key challenges in unified multimodal models by leveraging a Noisy ViT encoder and LLM for semantic encoding while using separate diffusion decoders to balance visual understanding and generation tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

12
Hugging Face Daily Papers research 14d ago

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Abstract VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision language models are serving as…

32
r/MachineLearning community 14d ago

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

I'm trying to understand where people doing sensor based ML on microcontrollers (IMU, accelerometer, vibration ,that kind of time-series data) actually lose the most time. When you've built something like this, what was the bottleneck: Getting enough real world data in the first…

6
NVIDIA Developer Blog official-blog 14d ago

Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models

Quick glossary for readers new to VLA/WAM terminology VLA Vision-Language-Action model: a robot policy that starts from a pretrained VLM backbone and adapts it...

22
arXiv — Machine Learning research 15d ago

SpikF-GO: Spiking Fourier Graph Operators for Multivariate Time Series Forecasting

arXiv:2606.13901v1 Announce Type: new Abstract: Spiking Neural Networks (SNNs) have emerged as an energy-efficient alternative to conventional neural networks, demonstrating strong performance in computer vision and robotics. More recently, SNNs have been applied to time series…

30
arXiv — Machine Learning research 15d ago

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

arXiv:2606.14172v1 Announce Type: new Abstract: Multimodal Attributed Graphs (MAGs) model real-world entities by coupling graph topology with heterogeneous attributes such as text and images. They support graph-centric tasks requiring structural and class-discriminative…

13
arXiv — Machine Learning research 15d ago

LapidaryEngine: Fully Conversational Crystal Generation

arXiv:2606.14215v1 Announce Type: new Abstract: The emergence of Large Language Models (LLMs) has inspired the vision of generating bespoke crystal materials directly from natural-language instructions, enabling users to design materials through intuitive, conversational…

35
arXiv — NLP / Computation & Language research 15d ago

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

arXiv:2606.14691v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving…

34
arXiv — NLP / Computation & Language research 15d ago

Multimodal Speaker Identification in Classroom Environments

arXiv:2606.13712v1 Announce Type: cross Abstract: Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework…

24
arXiv — NLP / Computation & Language research 15d ago

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

arXiv:2606.14072v1 Announce Type: cross Abstract: Accurate pediatric brain tumor segmentation remains challenging due to limited annotated data, heterogeneous imaging phenotypes, diffuse tumor boundaries, and class imbalance across tumor subregions. Here, we present a two-stage…

19
arXiv — NLP / Computation & Language research 15d ago

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

arXiv:2606.14697v1 Announce Type: cross Abstract: Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where…

4
arXiv — NLP / Computation & Language research 15d ago

Gaze Heads: How VLMs Look at What They Describe

arXiv:2606.14703v1 Announce Type: cross Abstract: How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone,…

18
arXiv — NLP / Computation & Language research 15d ago

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

arXiv:2502.10886v3 Announce Type: replace Abstract: Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We…

23
r/LocalLLaMA community 17d ago

when fable gets banned but it's ok because you've about to download qwen3.7_67b_21a_mythos_father_fable_mother_distilled_ablated_ablitereted_uncensored_agi_sparse_attention_MTP_SuperHOT_q6_maybe_q7_AGI_FINAL.gguf from huggingface

title   submitted by   /u/visionsmemories [link]   [comments]

9
r/LocalLLaMA community 17d ago

Vista 9B/4B from inclusionAI

VISTA-9B VISTA-9B are GUI-grounding vision-language models trained from Qwen3.5 9B backbones with VISTA: View-Consistent Self-Verified Training for GUI Grounding . Model Description VISTA-9B is a GUI-grounding model that maps a screenshot and a natural-language instruction to a…

30
NVIDIA Developer Blog official-blog 17d ago

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and...

25
r/MachineLearning community 17d ago

Just thinking, what about conducting a 1 day virtual session on fundamentals of computer vision ??? [D]

Hi all, A real story from my current experience: I'm associated with an internship where the primary work revolves around autonomous UAVs. What has shocked me the most is that almost everyone is so heavily focused on coding agents and AI tools that they're building things…

17
Hugging Face Daily Papers research 17d ago

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Abstract ArogyaBodha dataset and ArogyaSutra framework enhance multilingual medical reasoning in low-resource settings through diverse data integration and actor-critic multi-agent reasoning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models (MLLMs)…

30
Hugging Face Daily Papers research 17d ago

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Abstract Structured Defect Grounding (SDG) addresses limitations in text-to-image model diagnosis by modeling defects as structured sets and using vision-language models for detection and reward-based alignment. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Despite generating…

22
Hugging Face Daily Papers research 18d ago

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Abstract A multimodal image fusion approach uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal image fusion…

33
Hugging Face Daily Papers research 18d ago

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Abstract HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, addressing spatiotemporal reconstruction and semantic awareness through causal temporal attention and hierarchical compression. Generated by…

32
Hugging Face Daily Papers research 18d ago

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

Abstract VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce…

5
Hugging Face Daily Papers research 18d ago

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Abstract LabVLA, a vision-language-action model trained with a two-stage approach combining action token pretraining and flow matching, demonstrates superior performance on laboratory automation tasks through simulated data generation and robot-specific learning. Generated by…

18
arXiv — NLP / Computation & Language research 18d ago

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

arXiv:2606.12716v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of…

8
arXiv — NLP / Computation & Language research 18d ago

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

arXiv:2606.13044v1 Announce Type: new Abstract: As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more…

32
arXiv — NLP / Computation & Language research 18d ago

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

arXiv:2606.13216v1 Announce Type: new Abstract: Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We…

10
arXiv — NLP / Computation & Language research 18d ago

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

arXiv:2606.13572v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource…

20
arXiv — NLP / Computation & Language research 18d ago

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

arXiv:2606.13578v1 Announce Type: new Abstract: Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols,…

22
arXiv — NLP / Computation & Language research 18d ago

ProPlay: Procedural World Models for Self-Evolving LLM Agents

arXiv:2606.12780v1 Announce Type: cross Abstract: Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and…

33
arXiv — NLP / Computation & Language research 18d ago

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

arXiv:2606.12898v1 Announce Type: cross Abstract: Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC…

25
arXiv — NLP / Computation & Language research 18d ago

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

arXiv:2606.13288v1 Announce Type: cross Abstract: Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words"…

38
Hugging Face Daily Papers research 18d ago

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Abstract Robust-U1 enhances multimodal large language models' robustness against visual corruptions through self-recovery capabilities that improve both visual quality and reasoning performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models…

4
Hugging Face Daily Papers research 18d ago

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Abstract SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks. Generated by…

36
Vercel — AI dev-tools 18d ago

Kimi K2.7 Code now available on AI Gateway

Kimi K2.7 Code from Moonshot AI is now available on AI Gateway . K2.7 Code is a coding model built for long-horizon programming tasks, generalizing across scenarios including frontend development, DevOps, and performance optimization. The model has a native multimodal…

12
Hugging Face Daily Papers research 18d ago

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use…

10
r/LocalLLaMA community 18d ago

Where are we with computer-control harnesses?

Seems like local vision language models models are getting smart enough so that it would be useful to hand them the cursor in a secure sandbox. What harnesses are available that can do this? edit: oh my fucking God something about this post triggered all of the bots to come out…

27
Hugging Face Daily Papers research 18d ago

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

Abstract DRIFT is a framework that adapts pretrained vision-language models for continuous decoding tasks by combining coarse prediction with iterative refinement through flow matching, improving performance across perception and planning tasks. Generated by…

12
Hugging Face Daily Papers research 18d ago

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Abstract Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages. Generated by…

16
Hugging Face Daily Papers research 18d ago

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Abstract ART enables parameter-efficient fine-tuning of frozen multimodal language models by optimizing raw visual input through gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs. Generated by…

8
Hugging Face Daily Papers research 19d ago

Distilling LLM Feedback for Lean Theorem Proving

Abstract Feedback Distillation improves post-training of reasoning models by using self-distillation with token-level supervision and privileged feedback from language models, offering better diversity and complementary benefits when combined with GRPO. Generated by…

38
arXiv — Machine Learning research 19d ago

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

arXiv:2606.11266v1 Announce Type: new Abstract: The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the…

34
arXiv — Machine Learning research 19d ago

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

arXiv:2606.11382v1 Announce Type: new Abstract: Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases,…

20
arXiv — Machine Learning research 19d ago

Information-Theoretic Decomposition for Multimodal Interaction Learning

arXiv:2606.11614v1 Announce Type: new Abstract: Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit…

13
arXiv — Machine Learning research 19d ago

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

arXiv:2606.11652v1 Announce Type: new Abstract: This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic…

29
arXiv — Machine Learning research 19d ago

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

arXiv:2606.11709v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution.…

4

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

PaperJury: Due-Process Review for Bounded LaTeX Revision

TMASC: Transmasculine Attitude and Speech Corpus

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Embedded/edge ML folks: what actually eats the most time ,getting data, or cleaning/labeling it (time series sensor data, not computer vision/audio)? [D]

Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models

SpikF-GO: Spiking Fourier Graph Operators for Multivariate Time Series Forecasting

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

LapidaryEngine: Fully Conversational Crystal Generation

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Multimodal Speaker Identification in Classroom Environments

Diffusion-Refined Segmentation and Vision-Language Interpretation for Pediatric Brain Tumor MRI

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Gaze Heads: How VLMs Look at What They Describe

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

when fable gets banned but it's ok because you've about to download qwen3.7_67b_21a_mythos_father_fable_mother_distilled_ablated_ablitereted_uncensored_agi_sparse_attention_MTP_SuperHOT_q6_maybe_q7_AGI_FINAL.gguf from huggingface

Vista 9B/4B from inclusionAI

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Just thinking, what about conducting a 1 day virtual session on fundamentals of computer vision ??? [D]

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

ProPlay: Procedural World Models for Self-Evolving LLM Agents

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Kimi K2.7 Code now available on AI Gateway

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Where are we with computer-control harnesses?

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Distilling LLM Feedback for Lean Theorem Proving

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

Information-Theoretic Decomposition for Multimodal Interaction Learning

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation