Tag

Multimodal

500 articles archived under #multimodal · RSS

Hugging Face Daily Papers research 27d ago

MindZero: Learning Online Mental Reasoning With Zero Annotations

Abstract MindZero presents a self-supervised reinforcement learning framework that enables multimodal large language models to perform efficient and robust online mental reasoning without requiring explicit mental state annotations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

35
Hugging Face Daily Papers research 27d ago

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Abstract A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep Research Agents have shown strong…

4
r/LocalLLaMA community 27d ago

I have become George Jetson: my job is now Yes/No supervision for a machine I don’t fully understand.

  submitted by   /u/Helpful_Today7449 [link]   [comments]

16
Hugging Face Daily Papers research 27d ago

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Abstract Multimodal skills that combine textual logic with visual support outperform text-only approaches in visual-centric tasks by incorporating spatial layout, visual grounding, and state-aware interactions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reusable skills are a…

11
Hugging Face Daily Papers research 27d ago

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Abstract Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM…

25
Hugging Face Daily Papers research 28d ago

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Abstract Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while…

23
Hugging Face Daily Papers research 28d ago

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Abstract MineExplorer benchmark evaluates multimodal large language models' open-world exploration capabilities in Minecraft through atomic and multi-hop tasks designed via multi-agent synthesis. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models…

33
Hugging Face Daily Papers research 28d ago

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Abstract PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large Vision-Language Models (LVLMs) map visual…

5
MIT Technology Review — AI news-outlet 28d ago

Rehumanizing global health care with agentic AI

The global health care sector is under increasing strain.  Decades of chronic underinvestment and constraints in recruitment have coincided with a surge in demand for services for aging populations. Gaps in provision are already taking a toll, with fragmented access to care…

19
Hugging Face Daily Papers research 28d ago

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Abstract Physical AI systems face safety challenges where black-box models can execute harmful actions without detection, necessitating comprehensive runtime guardrail mechanisms for safe operation. AI-generated summary Physical AI systems increasingly map multimodal…

12
Hugging Face Daily Papers research 28d ago

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

Abstract EVA01 enables native 3D mesh integration in multimodal language models through a Mixture-of-Transformers architecture that aligns semantic and geometric manifolds for improved generation and editing capabilities. AI-generated summary This paper addresses the challenge…

11
Hugging Face Daily Papers research 28d ago

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Abstract OpenWebRL presents a framework for training visual web agents using online reinforcement learning on real websites, achieving state-of-the-art performance with minimal initial supervision. AI-generated summary Building capable visual web agents requires long-horizon…

10
Hugging Face Daily Papers research 28d ago

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Abstract Vision-language models are evaluated for procedural 3D modeling tasks through a benchmark and ranking platform that assess their ability to translate text and images into executable 3D code. AI-generated summary Procedural 3D modeling through code is emerging as a…

34
Hugging Face Daily Papers research 28d ago

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Abstract Pretrained vision-language models can reconstruct 3D scenes from single images as editable Blender programs through progressive refinement, demonstrating improved fidelity through staged reconstruction approaches. AI-generated summary Inverse graphics is a longstanding…

38
arXiv — Machine Learning research 28d ago

Hoeffding Concept Bottleneck Models with Applications to Overhead Images

arXiv:2606.00082v1 Announce Type: new Abstract: Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate…

13
arXiv — Machine Learning research 28d ago

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

arXiv:2606.00140v1 Announce Type: new Abstract: While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure…

21
arXiv — Machine Learning research 28d ago

Longitudinal Multimodal Sensing of Physical Activity and Well-Being in Older Adults

arXiv:2606.00345v1 Announce Type: new Abstract: Wearable and mobile sensing technologies enable continuous monitoring of human behavior and health in real-world settings. However, predictive modeling in longitudinal multimodal data remains challenging, particularly when…

38
arXiv — Machine Learning research 28d ago

GLENS: Global Search via Learning from Solver Iterates with Diffusion Models

arXiv:2606.00366v1 Announce Type: new Abstract: We consider the problem of generating a large collection of initial guesses for local minima of multimodal non-convex continuous optimization problems. The goal is for these initial guesses to be high-quality (i.e., a numerical…

17
arXiv — Machine Learning research 28d ago

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

arXiv:2606.00437v1 Announce Type: new Abstract: Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations…

19
arXiv — Machine Learning research 28d ago

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

arXiv:2606.00535v1 Announce Type: new Abstract: Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored.…

11
arXiv — Machine Learning research 28d ago

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

arXiv:2606.00573v1 Announce Type: new Abstract: Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has…

33
arXiv — Machine Learning research 28d ago

Score $\times$ Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation

arXiv:2606.00739v1 Announce Type: new Abstract: Large language models hallucinate even when the answer lies within their parameters. While inference-time scaling can surface this latent knowledge, the most effective methods require supervision: a trained verifier or reward…

30
arXiv — NLP / Computation & Language research 28d ago

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

arXiv:2606.00012v1 Announce Type: new Abstract: Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previous studies are mostly limited to textual modality or two-party dialogue, failing to meet…

29
arXiv — NLP / Computation & Language research 28d ago

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

arXiv:2606.00022v1 Announce Type: new Abstract: Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy -- preferences vary with audience, context, and culture, and annotator…

7
arXiv — NLP / Computation & Language research 28d ago

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

arXiv:2606.00091v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the…

38
arXiv — NLP / Computation & Language research 28d ago

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

arXiv:2606.00305v1 Announce Type: new Abstract: On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal…

9
arXiv — NLP / Computation & Language research 28d ago

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

arXiv:2606.00477v1 Announce Type: new Abstract: Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While…

26
arXiv — NLP / Computation & Language research 28d ago

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

arXiv:2606.00547v1 Announce Type: new Abstract: Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past…

35
arXiv — NLP / Computation & Language research 28d ago

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

arXiv:2606.00579v1 Announce Type: new Abstract: As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed…

37
arXiv — NLP / Computation & Language research 28d ago

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

arXiv:2606.00898v1 Announce Type: new Abstract: Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at…

20
arXiv — NLP / Computation & Language research 28d ago

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

arXiv:2606.00909v1 Announce Type: new Abstract: This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of…

37
arXiv — NLP / Computation & Language research 28d ago

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models

arXiv:2606.01026v1 Announce Type: new Abstract: Masked diffusion language models (MDLMs) re-predict every position at each denoising step, but standard samplers commit tokens once revealed, leaving this revision capability unused. Existing approaches either add heuristic or…

31
arXiv — NLP / Computation & Language research 28d ago

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

arXiv:2606.01049v1 Announce Type: new Abstract: Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure…

26
arXiv — NLP / Computation & Language research 28d ago

On the Generalization Gap in Self-Evolving Language Model Reasoning

arXiv:2606.01075v1 Announce Type: new Abstract: Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the…

35
arXiv — NLP / Computation & Language research 28d ago

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

arXiv:2606.01223v1 Announce Type: new Abstract: Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into…

10
Hugging Face Daily Papers research 28d ago

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Abstract Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance. AI-generated summary The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs)…

36
Hugging Face Daily Papers research 28d ago

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Abstract RoboSemanticBench identifies a disconnect between semantic understanding and action prediction in vision-language-action models, where robots can grasp objects but fail to select semantically correct targets. AI-generated summary Vision-language-action (VLA) models are…

15
Hugging Face Daily Papers research 28d ago

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Abstract Researchers created HakushoBench, a Japanese chart and table visual question answering benchmark derived from governmental documents, to evaluate vision-language models' ability to understand complex visual data beyond English-language datasets. AI-generated summary…

14
Hugging Face Daily Papers research 28d ago

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Abstract RoboStressBench presents a principled benchmark for evaluating vision-language model robustness to physical visual stress in embodied AI, decomposing visual stress into material, viewpoint, lighting, and geometry dimensions. AI-generated summary Vision-Language Models…

4
Hugging Face Daily Papers research 28d ago

NITP: Next Implicit Token Prediction for LLM Pre-training

Abstract Next Implicit Token Prediction enhances language model training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead. AI-generated summary Standard next-token…

34
Hugging Face Daily Papers research 28d ago

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Abstract A systematic comparison of vision-language models and video generation models reveals complementary strengths for spatial intelligence tasks, with vision-language models excelling in semantic tagging and instance grouping while video generation models perform better in…

21
Hacker News — AI on Front Page community 28d ago

Should you normalize RGB values by 255 or 256?

Article URL: https://30fps.net/pages/255-vs-256-division/ Comments URL: https://news.ycombinator.com/item?id=48360054 Points: 201 # Comments: 85

21
llama.cpp releases dev-tools 28d ago

b9453

model: Add EXAONE 4.5 implementations ( #21733 ) Add EXAONE 4.5 and Add GQA for MMproj mtmd: EXAONE 4.5 vision markers and projector path EXAONE 4.5 uses and for image boundaries; Qwen keeps <|vision_start|> and <|vision_end|>. Route EXAONE 4.5 through the Qwen2.5-VL-style…

32
Hugging Face Daily Papers research 29d ago

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Abstract VisualThinking-VLA enables fast, accurate vision-language-action policies through visual reasoning that preserves spatial precision and reduces latency compared to text-based approaches. AI-generated summary Recent work has begun to equip vision-language-action (VLA)…

37
Hugging Face Daily Papers research 29d ago

How can embedding models bind concepts?

Abstract Vision-language models like CLIP struggle with concept binding despite recognizing individual concepts, but controlled transformer models can learn low-complexity binding functions that generalize better through multiplicative interactions. AI-generated summary Humans…

11
Hugging Face Daily Papers research 29d ago

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Abstract Large Vision-Language Models demonstrate significant limitations in fine-grained spatio-temporal reasoning and tracking abilities when evaluated on a new furniture assembly benchmark. AI-generated summary The emergence of Large Vision-Language Models (LVLMs) has…

5
Hugging Face Daily Papers research 29d ago

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Abstract A reinforcement learning framework called iVGR is introduced to transfer visual localization capabilities into textual reasoning, improving fine-grained perception in multimodal language models without requiring explicit visual grounding during inference. AI-generated…

13
Hugging Face Daily Papers research 29d ago

Benchmarking Composed Image Retrieval for Applied Earth Observation

Abstract Remote sensing composed image retrieval methods are evaluated across vision-language backbones and a new change-centric dataset, demonstrating their effectiveness for Earth observation applications while highlighting distinct challenges compared to traditional…

27
Vercel — AI dev-tools 29d ago

Qwen 3.7 Plus now available on AI Gateway

Qwen 3.7 Plus from Alibaba is now available on Vercel AI Gateway . The model unifies vision and language into a single agent foundation, with capabilities spanning GUI and CLI operation, coding and productivity workflows with full-modality input, and visual agent tasks including…

26
Hugging Face Daily Papers research 29d ago

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Abstract Vision-language models exhibit overconfidence in spatial reasoning tasks and struggle to identify when additional observations are needed to resolve uncertainty. AI-generated summary Spatial reasoning is a fundamental capability for vision-language models (VLMs)…

20

MindZero: Learning Online Mental Reasoning With Zero Annotations

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

I have become George Jetson: my job is now Yes/No supervision for a machine I don’t fully understand.

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Rehumanizing global health care with agentic AI

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Hoeffding Concept Bottleneck Models with Applications to Overhead Images

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

Longitudinal Multimodal Sensing of Physical Activity and Well-Being in Older Adults

GLENS: Global Search via Learning from Solver Iterates with Diffusion Models

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

Score $\times$ Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

On the Generalization Gap in Self-Evolving Language Model Reasoning

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

NITP: Next Implicit Token Prediction for LLM Pre-training

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Should you normalize RGB values by 255 or 256?

b9453

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

How can embedding models bind concepts?

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Benchmarking Composed Image Retrieval for Applied Earth Observation

Qwen 3.7 Plus now available on AI Gateway

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?