Tag

Multimodal

500 articles archived under #multimodal · RSS

arXiv — NLP / Computation & Language research 12d ago

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

arXiv:2606.18780v1 Announce Type: cross Abstract: Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains…

4
arXiv — NLP / Computation & Language research 12d ago

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

arXiv:2606.19327v1 Announce Type: cross Abstract: Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain…

35
arXiv — NLP / Computation & Language research 12d ago

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on…

35
Hugging Face Daily Papers research 12d ago

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Abstract Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface…

38
Hugging Face Daily Papers research 12d ago

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Abstract IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical…

24
Hugging Face Daily Papers research 12d ago

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Abstract A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory…

23
Hugging Face official-blog 12d ago

Is it agentic enough? Benchmarking open models on your own tooling

Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a…

26
r/MachineLearning community 12d ago

How do you analyze the relative "strength" of probes? [R]

This question is related to topics like language+ models (including multimodal) and things like "circuit" analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I'm trying to orient to the SoTA. I found this old post on trying…

21
Hugging Face Daily Papers research 12d ago

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Abstract SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision encoders for…

21
r/LocalLLaMA community 12d ago

Multilingual-Multimodal-NLP/LoopCoder-V2 · Hugging Face

GitHub : https://github.com/CSJianYang/LoopCoder arXiv : https://arxiv.org/abs/2606.18023 Full Paper PDF : https://arxiv.org/pdf/2606.18023 LoopCoder-V2 LoopCoder-v2 is a 7B instruction-tuned code model based on the Parallel Loop Transformer (PLT). The model studies test-time…

37
Hugging Face Daily Papers research 12d ago

Self-Evolving Visual Questioner

Abstract A vision-language model autonomously improves its question-generation capabilities through self-evolution, enhancing both question quality and answerer performance without external supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-language models (VLMs)…

10
Hugging Face Daily Papers research 12d ago

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Abstract Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) have demonstrated…

25
Hugging Face Daily Papers research 13d ago

Text-Vision Co-Instructed Image Editing

Abstract A unified text-visual image editing framework is presented that combines semantic intent from textual instructions with spatial guidance from visual prompts to achieve more precise and faithful image manipulation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing…

16
Hugging Face Daily Papers research 13d ago

Learning from the Self-future: On-policy Self-distillation for dLLMs

Abstract d-OPSD introduces a novel on-policy self-distillation framework for diffusion language models by adapting self-teacher construction and supervision mechanisms to match the non-autoregressive nature of diffusion models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

29
arXiv — NLP / Computation & Language research 13d ago

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

arXiv:2606.17057v1 Announce Type: cross Abstract: Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue :…

30
arXiv — Machine Learning research 13d ago

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

arXiv:2606.17115v1 Announce Type: new Abstract: Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based…

18
arXiv — Machine Learning research 13d ago

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

arXiv:2606.17118v1 Announce Type: new Abstract: Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has…

24
arXiv — NLP / Computation & Language research 13d ago

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

arXiv:2606.17162v1 Announce Type: new Abstract: Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn…

25
arXiv — NLP / Computation & Language research 13d ago

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

arXiv:2606.17213v1 Announce Type: new Abstract: Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain,…

14
arXiv — NLP / Computation & Language research 13d ago

Are you speaking my languages? On spoken language adherence in multimodal LLMs

arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To…

9
arXiv — NLP / Computation & Language research 13d ago

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

arXiv:2606.17449v1 Announce Type: new Abstract: While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation…

38
arXiv — NLP / Computation & Language research 13d ago

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

arXiv:2606.17542v1 Announce Type: new Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction.…

19
arXiv — NLP / Computation & Language research 13d ago

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

arXiv:2606.17791v1 Announce Type: new Abstract: AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using…

24
arXiv — NLP / Computation & Language research 13d ago

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

arXiv:2606.17188v1 Announce Type: cross Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal…

15
arXiv — NLP / Computation & Language research 13d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

arXiv:2606.17389v1 Announce Type: cross Abstract: Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that…

24
arXiv — NLP / Computation & Language research 13d ago

Vision-language models for chest radiography do not always need the image

arXiv:2606.17710v1 Announce Type: cross Abstract: Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that…

16
arXiv — NLP / Computation & Language research 13d ago

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning…

38
arXiv — NLP / Computation & Language research 13d ago

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

arXiv:2602.10384v4 Announce Type: replace Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix…

6
Hugging Face Daily Papers research 13d ago

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

Abstract MemSlides presents a hierarchical memory framework for personalized presentation agents that separates long-term user profiles, working memory for session constraints, and tool memory for reusable execution experiences to enable stable personalization and reliable local…

21
Hugging Face Daily Papers research 13d ago

MotionVLA: Vision-Language-Action Model for Humanoid Motion

Abstract A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches. Generated by…

11
Hugging Face Daily Papers research 13d ago

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Abstract A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks. Generated by…

6
Hugging Face Daily Papers research 13d ago

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and…

19
Hugging Face Daily Papers research 13d ago

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

Abstract Temporal Difference in Vision (TDV) presents a novel self-supervised learning approach for video data that eliminates traditional inductive biases by leveraging causal relationships between past and future frames. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in…

30
Hugging Face Daily Papers research 13d ago

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Abstract LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language-Action models (VLAs)…

33
TechCrunch — AI news-outlet 13d ago

SpaceX to acquire Cursor for $60B in stock, days after blockbuster IPO

The deal is supposed to help SpaceX's struggling AI division. The company told IPO investors it sees a $26 trillion addressable market in AI.

21
Hugging Face Daily Papers research 14d ago

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Abstract A vision-language model operates continuously in real-time, making autonomous decisions about when to respond or delegate, enabling interactive systems that perceive and act upon environmental changes without user prompting. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

17
arXiv — Machine Learning research 14d ago

LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

arXiv:2606.15306v1 Announce Type: new Abstract: We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future…

19
arXiv — Machine Learning research 14d ago

DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

arXiv:2606.15359v1 Announce Type: new Abstract: Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment…

26
arXiv — Machine Learning research 14d ago

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

arXiv:2606.15427v1 Announce Type: new Abstract: Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight,…

23
arXiv — Machine Learning research 14d ago

When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

arXiv:2606.15695v1 Announce Type: new Abstract: Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts.…

26
arXiv — Machine Learning research 14d ago

Unsupervised Learning for Missing Modalities in Multimodal Learning

arXiv:2606.15743v1 Announce Type: new Abstract: This paper addresses the missing-modality challenge in multi-modal learning by introducing Unsupervised Learning for Missing Modalities in Multi-Modal Learning (UL4M4), a flexible framework that imputes missing feature embeddings…

35
arXiv — NLP / Computation & Language research 14d ago

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal…

22
arXiv — NLP / Computation & Language research 14d ago

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

arXiv:2606.15152v1 Announce Type: new Abstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal…

10
arXiv — NLP / Computation & Language research 14d ago

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

arXiv:2606.15307v1 Announce Type: new Abstract: Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced…

21
arXiv — NLP / Computation & Language research 14d ago

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

arXiv:2606.15714v1 Announce Type: new Abstract: Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with…

12
arXiv — NLP / Computation & Language research 14d ago

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

arXiv:2606.15821v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link…

37
arXiv — NLP / Computation & Language research 14d ago

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

arXiv:2606.15872v1 Announce Type: new Abstract: Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial…

27
arXiv — NLP / Computation & Language research 14d ago

Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models

arXiv:2606.15910v1 Announce Type: new Abstract: A vision-language model can answer a question about a medical image fluently and confidently while barely using the image, leaning instead on language priors. In medicine this is the failure that matters most, because the answer…

38
arXiv — NLP / Computation & Language research 14d ago

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

arXiv:2606.15932v1 Announce Type: new Abstract: While LLMs have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, documents, vector drawings, videos, and interactive states. These tasks…

7
arXiv — NLP / Computation & Language research 14d ago

Scaling Human and G2P Supervision for Robust Phonetic Transcription

arXiv:2606.16019v1 Announce Type: new Abstract: Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We…

20

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Is it agentic enough? Benchmarking open models on your own tooling

How do you analyze the relative "strength" of probes? [R]

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Multilingual-Multimodal-NLP/LoopCoder-V2 · Hugging Face

Self-Evolving Visual Questioner

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Text-Vision Co-Instructed Image Editing

Learning from the Self-future: On-policy Self-distillation for dLLMs

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

Are you speaking my languages? On spoken language adherence in multimodal LLMs

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

Vision-language models for chest radiography do not always need the image

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

MotionVLA: Vision-Language-Action Model for Humanoid Motion

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

SpaceX to acquire Cursor for $60B in stock, days after blockbuster IPO

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

Unsupervised Learning for Missing Modalities in Multimodal Learning

Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

Scaling Human and G2P Supervision for Robust Phonetic Transcription