News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow arXiv — NLP / Computation & Language research 12d ago SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction arXiv:2606.18780v1 Announce Type: cross Abstract: Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains… 4 arXiv — NLP / Computation & Language research 12d ago Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation arXiv:2606.19327v1 Announce Type: cross Abstract: Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain… 35 arXiv — NLP / Computation & Language research 12d ago FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs arXiv:2601.13836v2 Announce Type: replace Abstract: Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on… 35 Hugging Face Daily Papers research 12d ago Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding Abstract Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Graphical user interface… 38 Hugging Face Daily Papers research 12d ago IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products Abstract IndustryBench-MIPU is introduced as the first large-scale benchmark for multi-image industrial product understanding, focusing on structured attribute extraction from heterogeneous product images to evaluate multimodal models' ability to recover dense technical… 24 Hugging Face Daily Papers research 12d ago Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games Abstract A new benchmark suite called RNG-Bench is introduced to evaluate multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games with controlled difficulty parameters and a memory… 23 Hugging Face official-blog 12d ago Is it agentic enough? Benchmarking open models on your own tooling Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a… 26 r/MachineLearning community 12d ago How do you analyze the relative "strength" of probes? [R] This question is related to topics like language+ models (including multimodal) and things like "circuit" analyses. I think something related might come up in my work (factuality guarantees for model outputs) and I'm trying to orient to the SoTA. I found this old post on trying… 21 Hugging Face Daily Papers research 12d ago Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings Abstract SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision encoders for… 21 r/LocalLLaMA community 12d ago Multilingual-Multimodal-NLP/LoopCoder-V2 · Hugging Face GitHub : https://github.com/CSJianYang/LoopCoder arXiv : https://arxiv.org/abs/2606.18023 Full Paper PDF : https://arxiv.org/pdf/2606.18023 LoopCoder-V2 LoopCoder-v2 is a 7B instruction-tuned code model based on the Parallel Loop Transformer (PLT). The model studies test-time… 37 Hugging Face Daily Papers research 12d ago Self-Evolving Visual Questioner Abstract A vision-language model autonomously improves its question-generation capabilities through self-evolution, enhancing both question quality and answerer performance without external supervision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-language models (VLMs)… 10 Hugging Face Daily Papers research 12d ago Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning Abstract Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) have demonstrated… 25 Hugging Face Daily Papers research 13d ago Text-Vision Co-Instructed Image Editing Abstract A unified text-visual image editing framework is presented that combines semantic intent from textual instructions with spatial guidance from visual prompts to achieve more precise and faithful image manipulation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing… 16 Hugging Face Daily Papers research 13d ago Learning from the Self-future: On-policy Self-distillation for dLLMs Abstract d-OPSD introduces a novel on-policy self-distillation framework for diffusion language models by adapting self-teacher construction and supervision mechanisms to match the non-autoregressive nature of diffusion models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 29 arXiv — NLP / Computation & Language research 13d ago Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs arXiv:2606.17057v1 Announce Type: cross Abstract: Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue :… 30 arXiv — Machine Learning research 13d ago Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis arXiv:2606.17115v1 Announce Type: new Abstract: Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based… 18 arXiv — Machine Learning research 13d ago MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs arXiv:2606.17118v1 Announce Type: new Abstract: Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has… 24 arXiv — NLP / Computation & Language research 13d ago MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision arXiv:2606.17162v1 Announce Type: new Abstract: Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn… 25 arXiv — NLP / Computation & Language research 13d ago Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors arXiv:2606.17213v1 Announce Type: new Abstract: Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain,… 14 arXiv — NLP / Computation & Language research 13d ago Are you speaking my languages? On spoken language adherence in multimodal LLMs arXiv:2606.17281v1 Announce Type: new Abstract: While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To… 9 arXiv — NLP / Computation & Language research 13d ago MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation arXiv:2606.17449v1 Announce Type: new Abstract: While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation… 38 arXiv — NLP / Computation & Language research 13d ago Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings arXiv:2606.17542v1 Announce Type: new Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction.… 19 arXiv — NLP / Computation & Language research 13d ago The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports arXiv:2606.17791v1 Announce Type: new Abstract: AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using… 24 arXiv — NLP / Computation & Language research 13d ago Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation arXiv:2606.17188v1 Announce Type: cross Abstract: Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal… 15 arXiv — NLP / Computation & Language research 13d ago Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models arXiv:2606.17389v1 Announce Type: cross Abstract: Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that… 24 arXiv — NLP / Computation & Language research 13d ago Vision-language models for chest radiography do not always need the image arXiv:2606.17710v1 Announce Type: cross Abstract: Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that… 16 arXiv — NLP / Computation & Language research 13d ago EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning arXiv:2511.01650v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning… 38 arXiv — NLP / Computation & Language research 13d ago When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents arXiv:2602.10384v4 Announce Type: replace Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix… 6 Hugging Face Daily Papers research 13d ago MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision Abstract MemSlides presents a hierarchical memory framework for personalized presentation agents that separates long-term user profiles, working memory for session constraints, and tool memory for reusable execution experiences to enable stable personalization and reliable local… 21 Hugging Face Daily Papers research 13d ago MotionVLA: Vision-Language-Action Model for Humanoid Motion Abstract A dual-stream frequency tokenizer and autoregressive model are proposed to improve humanoid motion generation by separately encoding pose and physical dynamics, achieving better diversity and consistency compared to single-codebook approaches. Generated by… 11 Hugging Face Daily Papers research 13d ago ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining Abstract A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks. Generated by… 6 Hugging Face Daily Papers research 13d ago Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification Abstract UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing through multi-level feature fusion, bitwise quantization, and… 19 Hugging Face Daily Papers research 13d ago You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences Abstract Temporal Difference in Vision (TDV) presents a novel self-supervised learning approach for video data that eliminates traditional inductive biases by leveraging causal relationships between past and future frames. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Progress in… 30 Hugging Face Daily Papers research 13d ago LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies Abstract LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision-Language-Action models (VLAs)… 33 TechCrunch — AI news-outlet 13d ago SpaceX to acquire Cursor for $60B in stock, days after blockbuster IPO The deal is supposed to help SpaceX's struggling AI division. The company told IPO investors it sees a $26 trillion addressable market in AI. 21 Hugging Face Daily Papers research 14d ago JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence Abstract A vision-language model operates continuously in real-time, making autonomous decisions about when to respond or delegate, enabling interactive systems that perceive and act upon environmental changes without user prompting. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 17 arXiv — Machine Learning research 14d ago LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure arXiv:2606.15306v1 Announce Type: new Abstract: We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future… 19 arXiv — Machine Learning research 14d ago DiRecT: Safe Diffusion-Based Planning via Receding-Horizon Denoising arXiv:2606.15359v1 Announce Type: new Abstract: Diffusion models have emerged as powerful tools for planning and control by learning multimodal distributions over actions and trajectories. Yet reliable inference-time safety enforcement remains a key barrier to their deployment… 26 arXiv — Machine Learning research 14d ago Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection arXiv:2606.15427v1 Announce Type: new Abstract: Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight,… 23 arXiv — Machine Learning research 14d ago When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning arXiv:2606.15695v1 Announce Type: new Abstract: Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts.… 26 arXiv — Machine Learning research 14d ago Unsupervised Learning for Missing Modalities in Multimodal Learning arXiv:2606.15743v1 Announce Type: new Abstract: This paper addresses the missing-modality challenge in multi-modal learning by introducing Unsupervised Learning for Missing Modalities in Multi-Modal Learning (UL4M4), a flexible framework that imputes missing feature embeddings… 35 arXiv — NLP / Computation & Language research 14d ago Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal… 22 arXiv — NLP / Computation & Language research 14d ago Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation arXiv:2606.15152v1 Announce Type: new Abstract: Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal… 10 arXiv — NLP / Computation & Language research 14d ago Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes arXiv:2606.15307v1 Announce Type: new Abstract: Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced… 21 arXiv — NLP / Computation & Language research 14d ago Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models arXiv:2606.15714v1 Announce Type: new Abstract: Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with… 12 arXiv — NLP / Computation & Language research 14d ago The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages arXiv:2606.15821v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link… 37 arXiv — NLP / Computation & Language research 14d ago SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks arXiv:2606.15872v1 Announce Type: new Abstract: Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial… 27 arXiv — NLP / Computation & Language research 14d ago Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models arXiv:2606.15910v1 Announce Type: new Abstract: A vision-language model can answer a question about a medical image fluently and confidently while barely using the image, leaning instead on language priors. In medicine this is the failure that matters most, because the answer… 38 arXiv — NLP / Computation & Language research 14d ago Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence arXiv:2606.15932v1 Announce Type: new Abstract: While LLMs have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, documents, vector drawings, videos, and interactive states. These tasks… 7 arXiv — NLP / Computation & Language research 14d ago Scaling Human and G2P Supervision for Robust Phonetic Transcription arXiv:2606.16019v1 Announce Type: new Abstract: Expert phonetic annotation is costly, especially for non-standard dialects and atypical speech. A common alternative is using Grapheme-to-Phoneme (G2P) models to auto-generate phonetic labels from text transcripts at scale. We… 20 Page 4 of 10 · 500 articles ← Newer Older →