News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow Hugging Face Daily Papers research 27d ago MindZero: Learning Online Mental Reasoning With Zero Annotations Abstract MindZero presents a self-supervised reinforcement learning framework that enables multimodal large language models to perform efficient and robust online mental reasoning without requiring explicit mental state annotations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 35 Hugging Face Daily Papers research 27d ago TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation Abstract A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research Agents have shown strong… 4 r/LocalLLaMA community 27d ago I have become George Jetson: my job is now Yes/No supervision for a machine I don’t fully understand.   submitted by   /u/Helpful_Today7449 [link]   [comments] 16 Hugging Face Daily Papers research 27d ago Agent Skills Should Go Beyond Text: The Case for Visual Skills Abstract Multimodal skills that combine textual logic with visual support outperform text-only approaches in visual-centric tasks by incorporating spatial layout, visual grounding, and state-aware interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reusable skills are a… 11 Hugging Face Daily Papers research 27d ago Review Arcade: On the Human Alignment and Gameability of LLM Reviews Abstract Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM… 25 Hugging Face Daily Papers research 28d ago SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models Abstract Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while… 23 Hugging Face Daily Papers research 28d ago MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft Abstract MineExplorer benchmark evaluates multimodal large language models' open-world exploration capabilities in Minecraft through atomic and multi-hop tasks designed via multi-agent synthesis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models… 33 Hugging Face Daily Papers research 28d ago PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding Abstract PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large Vision-Language Models (LVLMs) map visual… 5 MIT Technology Review — AI news-outlet 28d ago Rehumanizing global health care with agentic AI The global health care sector is under increasing strain.  Decades of chronic underinvestment and constraints in recruitment have coincided with a surge in demand for services for aging populations. Gaps in provision are already taking a toll, with fragmented access to care… 19 Hugging Face Daily Papers research 28d ago Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems Abstract Physical AI systems face safety challenges where black-box models can execute harmful actions without detection, necessitating comprehensive runtime guardrail mechanisms for safe operation. AI-generated summary Physical AI systems increasingly map multimodal… 12 Hugging Face Daily Papers research 28d ago EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers Abstract EVA01 enables native 3D mesh integration in multimodal language models through a Mixture-of-Transformers architecture that aligns semantic and geometric manifolds for improved generation and editing capabilities. AI-generated summary This paper addresses the challenge… 11 Hugging Face Daily Papers research 28d ago OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents Abstract OpenWebRL presents a framework for training visual web agents using online reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. AI-generated summary Building capable visual web agents requires long-horizon… 10 Hugging Face Daily Papers research 28d ago 3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code Abstract Vision-language models are evaluated for procedural 3D modeling tasks through a benchmark and ranking platform that assess their ability to translate text and images into executable 3D code. AI-generated summary Procedural 3D modeling through code is emerging as a… 34 Hugging Face Daily Papers research 28d ago Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models Abstract Pretrained vision-language models can reconstruct 3D scenes from single images as editable Blender programs through progressive refinement, demonstrating improved fidelity through staged reconstruction approaches. AI-generated summary Inverse graphics is a longstanding… 38 arXiv — Machine Learning research 28d ago Hoeffding Concept Bottleneck Models with Applications to Overhead Images arXiv:2606.00082v1 Announce Type: new Abstract: Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate… 13 arXiv — Machine Learning research 28d ago Geometric Erasure by Contrastive Velocity Matching in Rectified Flows arXiv:2606.00140v1 Announce Type: new Abstract: While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure… 21 arXiv — Machine Learning research 28d ago Longitudinal Multimodal Sensing of Physical Activity and Well-Being in Older Adults arXiv:2606.00345v1 Announce Type: new Abstract: Wearable and mobile sensing technologies enable continuous monitoring of human behavior and health in real-world settings. However, predictive modeling in longitudinal multimodal data remains challenging, particularly when… 38 arXiv — Machine Learning research 28d ago GLENS: Global Search via Learning from Solver Iterates with Diffusion Models arXiv:2606.00366v1 Announce Type: new Abstract: We consider the problem of generating a large collection of initial guesses for local minima of multimodal non-convex continuous optimization problems. The goal is for these initial guesses to be high-quality (i.e., a numerical… 17 arXiv — Machine Learning research 28d ago EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing arXiv:2606.00437v1 Announce Type: new Abstract: Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations… 19 arXiv — Machine Learning research 28d ago DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation arXiv:2606.00535v1 Announce Type: new Abstract: Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored.… 11 arXiv — Machine Learning research 28d ago LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models arXiv:2606.00573v1 Announce Type: new Abstract: Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has… 33 arXiv — Machine Learning research 28d ago Score $\times$ Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation arXiv:2606.00739v1 Announce Type: new Abstract: Large language models hallucinate even when the answer lies within their parameters. While inference-time scaling can surface this latent knowledge, the most effective methods require supervision: a trained verifier or reward… 30 arXiv — NLP / Computation & Language research 28d ago DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset arXiv:2606.00012v1 Announce Type: new Abstract: Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previous studies are mostly limited to textual modality or two-party dialogue, failing to meet… 29 arXiv — NLP / Computation & Language research 28d ago lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation arXiv:2606.00022v1 Announce Type: new Abstract: Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy -- preferences vary with audience, context, and culture, and annotator… 7 arXiv — NLP / Computation & Language research 28d ago DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models arXiv:2606.00091v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the… 38 arXiv — NLP / Computation & Language research 28d ago Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance arXiv:2606.00305v1 Announce Type: new Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal… 9 arXiv — NLP / Computation & Language research 28d ago Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs arXiv:2606.00477v1 Announce Type: new Abstract: Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While… 26 arXiv — NLP / Computation & Language research 28d ago Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents arXiv:2606.00547v1 Announce Type: new Abstract: Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past… 35 arXiv — NLP / Computation & Language research 28d ago Sandboxed Coding Agents are Competitive Omni-modal Task Solvers arXiv:2606.00579v1 Announce Type: new Abstract: As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed… 37 arXiv — NLP / Computation & Language research 28d ago Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs arXiv:2606.00898v1 Announce Type: new Abstract: Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at… 20 arXiv — NLP / Computation & Language research 28d ago MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models arXiv:2606.00909v1 Announce Type: new Abstract: This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of… 37 arXiv — NLP / Computation & Language research 28d ago Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models arXiv:2606.01026v1 Announce Type: new Abstract: Masked diffusion language models (MDLMs) re-predict every position at each denoising step, but standard samplers commit tokens once revealed, leaving this revision capability unused. Existing approaches either add heuristic or… 31 arXiv — NLP / Computation & Language research 28d ago PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining arXiv:2606.01049v1 Announce Type: new Abstract: Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure… 26 arXiv — NLP / Computation & Language research 28d ago On the Generalization Gap in Self-Evolving Language Model Reasoning arXiv:2606.01075v1 Announce Type: new Abstract: Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the… 35 arXiv — NLP / Computation & Language research 28d ago Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue arXiv:2606.01223v1 Announce Type: new Abstract: Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into… 10 Hugging Face Daily Papers research 28d ago VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization Abstract Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance. AI-generated summary The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs)… 36 Hugging Face Daily Papers research 28d ago RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models Abstract RoboSemanticBench identifies a disconnect between semantic understanding and action prediction in vision-language-action models, where robots can grasp objects but fail to select semantically correct targets. AI-generated summary Vision-language-action (VLA) models are… 15 Hugging Face Daily Papers research 28d ago HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers Abstract Researchers created HakushoBench, a Japanese chart and table visual question answering benchmark derived from governmental documents, to evaluate vision-language models' ability to understand complex visual data beyond English-language datasets. AI-generated summary… 14 Hugging Face Daily Papers research 28d ago RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes Abstract RoboStressBench presents a principled benchmark for evaluating vision-language model robustness to physical visual stress in embodied AI, decomposing visual stress into material, viewpoint, lighting, and geometry dimensions. AI-generated summary Vision-Language Models… 4 Hugging Face Daily Papers research 28d ago NITP: Next Implicit Token Prediction for LLM Pre-training Abstract Next Implicit Token Prediction enhances language model training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead. AI-generated summary Standard next-token… 34 Hugging Face Daily Papers research 28d ago Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models Abstract A systematic comparison of vision-language models and video generation models reveals complementary strengths for spatial intelligence tasks, with vision-language models excelling in semantic tagging and instance grouping while video generation models perform better in… 21 Hacker News — AI on Front Page community 28d ago Should you normalize RGB values by 255 or 256? Article URL: https://30fps.net/pages/255-vs-256-division/ Comments URL: https://news.ycombinator.com/item?id=48360054 Points: 201 # Comments: 85 21 llama.cpp releases dev-tools 28d ago b9453 model: Add EXAONE 4.5 implementations ( #21733 ) Add EXAONE 4.5 and Add GQA for MMproj mtmd: EXAONE 4.5 vision markers and projector path EXAONE 4.5 uses and for image boundaries; Qwen keeps <|vision_start|> and <|vision_end|>. Route EXAONE 4.5 through the Qwen2.5-VL-style… 32 Hugging Face Daily Papers research 29d ago VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies Abstract VisualThinking-VLA enables fast, accurate vision-language-action policies through visual reasoning that preserves spatial precision and reduces latency compared to text-based approaches. AI-generated summary Recent work has begun to equip vision-language-action (VLA)… 37 Hugging Face Daily Papers research 29d ago How can embedding models bind concepts? Abstract Vision-language models like CLIP struggle with concept binding despite recognizing individual concepts, but controlled transformer models can learn low-complexity binding functions that generalize better through multiplicative interactions. AI-generated summary Humans… 11 Hugging Face Daily Papers research 29d ago Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly Abstract Large Vision-Language Models demonstrate significant limitations in fine-grained spatio-temporal reasoning and tracking abilities when evaluated on a new furniture assembly benchmark. AI-generated summary The emergence of Large Vision-Language Models (LVLMs) has… 5 Hugging Face Daily Papers research 29d ago iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning Abstract A reinforcement learning framework called iVGR is introduced to transfer visual localization capabilities into textual reasoning, improving fine-grained perception in multimodal language models without requiring explicit visual grounding during inference. AI-generated… 13 Hugging Face Daily Papers research 29d ago Benchmarking Composed Image Retrieval for Applied Earth Observation Abstract Remote sensing composed image retrieval methods are evaluated across vision-language backbones and a new change-centric dataset, demonstrating their effectiveness for Earth observation applications while highlighting distinct challenges compared to traditional… 27 Vercel — AI dev-tools 29d ago Qwen 3.7 Plus now available on AI Gateway Qwen 3.7 Plus from Alibaba is now available on Vercel AI Gateway . The model unifies vision and language into a single agent foundation, with capabilities spanning GUI and CLI operation, coding and productivity workflows with full-modality input, and visual agent tasks including… 26 Hugging Face Daily Papers research 29d ago Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)? Abstract Vision-language models exhibit overconfidence in spatial reasoning tasks and struggle to identify when additional observations are needed to resolve uncertainty. AI-generated summary Spatial reasoning is a fundamental capability for vision-language models (VLMs)… 20 Page 9 of 10 · 500 articles ← Newer Older →