Tag

Multimodal

500 articles archived under #multimodal · RSS

r/LocalLLaMA community 9d ago

AllenAI releases MolmoMotion vision models for predicting future motion based on short frame history

AllenAI just released two models in the MolmoMotion family: https://huggingface.co/allenai/MolmoMotion-4B-H3-F30 https://huggingface.co/allenai/MolmoMotion-4B-H1-F32 MolmoMotion is a 4B vision-language model that forecasts 3D point trajectories under natural-language action…

30
r/LocalLLaMA community 9d ago

[NEW MODEL] SupraLabs started the Any2Any model family!

SupraLabs Supra-A2A-Nano-Exp - ~30M Any-to-Any Multimodal Transformer Status: Experimental / Educational Prototype 🚀 Overview Supra-A2A-Nano-Exp is a ~30M parameter autoregressive Transformer that unifies text, image, and video into a single token stream. There are: - No…

7
r/LocalLLaMA community 9d ago

Best image vision model runnable on RTX 6000 Pro

I'm looking at running OCR and classification on old historical scanned documents. (Some dating back to 1950s) What's the current best vision enabled models thats open sourced and runnable on an RTX 6000 Pro? Note: I've used Gemma 4 31B and have had good success with it. It's…

20
Hacker News — AI on Front Page community 9d ago

UHF X11: X11 Built for VisionOS and Apple Vision Pro

Article URL: https://www.lispm.net/apps/uhf-x11/ Comments URL: https://news.ycombinator.com/item?id=48610853 Points: 210 # Comments: 44

8
Hugging Face Daily Papers research 10d ago

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Abstract PerceptionDLM enables efficient parallel region perception in multimodal diffusion language models through structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

12
Hugging Face Daily Papers research 10d ago

Context-Aware RL for Agentic and Multimodal LLMs

Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by…

21
arXiv — Machine Learning research 11d ago

Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting

arXiv:2606.19413v1 Announce Type: new Abstract: Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing…

8
arXiv — Machine Learning research 11d ago

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

arXiv:2606.19489v1 Announce Type: new Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need…

8
arXiv — Machine Learning research 11d ago

Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET

arXiv:2606.20037v1 Announce Type: new Abstract: Alzheimer's disease (AD) is an irreversible neurodegenerative disorder and a leading cause of death worldwide. Early diagnosis plays an important part especially at the Mild Cognitive Impairment stage, where timely intervention can…

31
arXiv — NLP / Computation & Language research 11d ago

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

arXiv:2606.20075v1 Announce Type: cross Abstract: Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome…

36
arXiv — Machine Learning research 11d ago

Effective Dimension Governs Generalization in Quantum Kernel Vision Models

arXiv:2606.20183v1 Announce Type: new Abstract: Recent quantum vision models-quantum vision transformers and quantum convolutional networks-report two striking but unexplained empirical phenomena: (i) ansatze with more, or more uniformly distributed, entanglement generalize…

18
arXiv — Machine Learning research 11d ago

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

arXiv:2606.20291v1 Announce Type: new Abstract: Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management.…

20
arXiv — Machine Learning research 11d ago

Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act

arXiv:2606.20359v1 Announce Type: new Abstract: Self-represented tenants, landlords, and help-desk staff need to be pointed at the provision of law that actually governs a question, with a correct statutory citation. We study this task on the Ontario Residential Tenancies Act,…

6
arXiv — NLP / Computation & Language research 11d ago

LaViSA: A Language and Vision Structural Ambiguity Benchmark

arXiv:2606.19552v1 Announce Type: new Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving…

22
arXiv — NLP / Computation & Language research 11d ago

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

arXiv:2606.20093v1 Announce Type: new Abstract: Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also…

33
arXiv — NLP / Computation & Language research 11d ago

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

arXiv:2606.20164v1 Announce Type: new Abstract: Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and…

29
arXiv — NLP / Computation & Language research 11d ago

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

arXiv:2606.20527v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often…

26
arXiv — NLP / Computation & Language research 11d ago

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

arXiv:2606.19501v1 Announce Type: cross Abstract: Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing…

14
arXiv — NLP / Computation & Language research 11d ago

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

arXiv:2606.19534v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that…

24
arXiv — NLP / Computation & Language research 11d ago

NEST: Narrative Event Structures in Time for Long Video Understanding

arXiv:2606.19706v1 Announce Type: cross Abstract: Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long…

24
arXiv — NLP / Computation & Language research 11d ago

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

arXiv:2606.20477v1 Announce Type: cross Abstract: We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs…

9
arXiv — NLP / Computation & Language research 11d ago

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

arXiv:2504.02885v2 Announce Type: replace Abstract: Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their…

29
arXiv — NLP / Computation & Language research 11d ago

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual…

7
r/LocalLLaMA community 11d ago

[NEW MODEL] SupraLabs just released SupraVL-Nano-900k, a Vision-Language Model built entirely from scratch!

Hey r/LocalLLaMA ! We just released SupraVL-Nano-900k , our first VLM. It has ~900k parameters, was trained from scratch on Flickr8k, and the entire architecture fits in a single Jupyter notebook. This is not a production model, it's a fully transparent, readable blueprint for…

27
Hugging Face Daily Papers research 11d ago

Thinking with Visual Grounding

Abstract Visually grounded thinking integrates natural-language reasoning with explicit visual evidence grounding in vision-language models, improving reasoning accuracy through scalable synthesis and reinforcement learning techniques. Generated by…

34
Hugging Face Daily Papers research 11d ago

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Abstract A two-stage iterative framework alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems. Generated by…

23
Hugging Face Daily Papers research 11d ago

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

Abstract RL4IL enables robust robotic manipulation under sensor dropout by using reinforcement learning to retrieve relevant demonstrations and cross-attention fusion to impute missing modalities without retraining. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Robotic systems…

23
Hugging Face Daily Papers research 11d ago

When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

Abstract Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems. Generated by…

35
Hugging Face Daily Papers research 11d ago

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Abstract ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision Transformers…

27
r/MachineLearning community 11d ago

Any idea if AAAI will be harsh on computer vision paper as last year? [R]

Hello everyone, I have a computer vision paper ready for submission, a coauthor have suggested submitting it to AAAI. However last year computer vision papers have gotten a very small acceptance rate at AAAI, with reviewers receiving emails to specifically tell them that the…

17
Hugging Face Daily Papers research 11d ago

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

Abstract Xcientist enables transparent and accountable AI-driven scientific research by creating persistent artifacts that track the complete research process from problem formulation to mechanism validation and revision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI systems…

11
Hugging Face Daily Papers research 12d ago

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Abstract SciOrch is a framework that uses a lightweight orchestrator model to coordinate multiple frontier LLMs for scientific reasoning, achieving superior performance through MCTS-based training and GRPO-style optimization while reducing API costs. Generated by…

31
Hacker News — AI on Front Page community 12d ago

DeepSeek Introduces Vision

Article URL: https://chat.deepseek.com/ Comments URL: https://news.ycombinator.com/item?id=48581458 Points: 229 # Comments: 94

29
Smol AI News news-outlet 12d ago

not much happened today

**GLM-5.2** from **Zhipu** emerged as a leading open-weight model with innovative **IndexShare** sparse-attention enabling efficient **1M-token inference**, praised as comparable to **GPT-5.5** and **Opus 4.8** but lacking vision support. Other notable open models include…

18
r/MachineLearning community 12d ago

What does provisional paper acceptance mean in ECCV? Is that the default message everyone gets? [D]

What does provisional paper acceptance mean in ECCV? Is that the default message everyone gets?   submitted by   /u/NotGondor [link]   [comments]

37
Hugging Face Daily Papers research 12d ago

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Abstract A unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning through reinforcement learning, enabling robust spatial reasoning across diverse tasks and domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial…

9
Hugging Face Daily Papers research 12d ago

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Abstract ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts. Generated by…

8
arXiv — Machine Learning research 12d ago

Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

arXiv:2606.18287v1 Announce Type: new Abstract: Multimodal neuroimaging, integrating functional connectivity from fMRI and structural connectivity from DTI, enables non-invasive analysis of brain networks using graph neural networks. However, demographic factors such as age and…

11
arXiv — Machine Learning research 12d ago

MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

arXiv:2606.18390v1 Announce Type: new Abstract: Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean…

7
arXiv — Machine Learning research 12d ago

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

arXiv:2606.18732v1 Announce Type: new Abstract: This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS)…

19
arXiv — Machine Learning research 12d ago

Reinforcement Learning Foundation Models Should Already Be A Thing

arXiv:2606.18812v1 Announce Type: new Abstract: Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data,…

17
arXiv — Machine Learning research 12d ago

Semantic Robustness Certification for Vision-Language Models

arXiv:2606.18839v1 Announce Type: new Abstract: Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification…

35
arXiv — NLP / Computation & Language research 12d ago

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

arXiv:2606.18910v1 Announce Type: cross Abstract: Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a…

27
arXiv — Machine Learning research 12d ago

A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors

arXiv:2606.19026v1 Announce Type: new Abstract: Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured…

32
arXiv — Machine Learning research 12d ago

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

arXiv:2606.19036v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection…

16
arXiv — Machine Learning research 12d ago

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

arXiv:2606.19120v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to…

31
arXiv — Machine Learning research 12d ago

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

arXiv:2606.19140v1 Announce Type: new Abstract: Accurate survival prediction is essential for personalized treatment planning in head and neck cancer, yet remains challenging due to the heterogeneous and high-dimensional nature of multimodal clinical data. While deep survival…

32
arXiv — NLP / Computation & Language research 12d ago

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the…

19
arXiv — NLP / Computation & Language research 12d ago

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

arXiv:2606.18471v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic…

11
arXiv — NLP / Computation & Language research 12d ago

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

arXiv:2606.18893v1 Announce Type: new Abstract: Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently.…

14

AllenAI releases MolmoMotion vision models for predicting future motion based on short frame history

[NEW MODEL] SupraLabs started the Any2Any model family!

Best image vision model runnable on RTX 6000 Pro

UHF X11: X11 Built for VisionOS and Apple Vision Pro

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Context-Aware RL for Agentic and Multimodal LLMs

Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

Effective Dimension Governs Generalization in Quantum Kernel Vision Models

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act

LaViSA: A Language and Vision Structural Ambiguity Benchmark

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

NEST: Narrative Event Structures in Time for Long Video Understanding

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

[NEW MODEL] SupraLabs just released SupraVL-Nano-900k, a Vision-Language Model built entirely from scratch!

Thinking with Visual Grounding

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Any idea if AAAI will be harsh on computer vision paper as last year? [R]

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

DeepSeek Introduces Vision

not much happened today

What does provisional paper acceptance mean in ECCV? Is that the default message everyone gets? [D]

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

Reinforcement Learning Foundation Models Should Already Be A Thing

Semantic Robustness Certification for Vision-Language Models

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis

VISUALSKILL: Multimodal Skills for Computer-Use Agents

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction