News / #multimodal Tag Multimodal 500 articles archived under #multimodal · RSS Sign in to follow r/LocalLLaMA community 9d ago AllenAI releases MolmoMotion vision models for predicting future motion based on short frame history AllenAI just released two models in the MolmoMotion family: https://huggingface.co/allenai/MolmoMotion-4B-H3-F30 https://huggingface.co/allenai/MolmoMotion-4B-H1-F32 MolmoMotion is a 4B vision-language model that forecasts 3D point trajectories under natural-language action… 30 r/LocalLLaMA community 9d ago [NEW MODEL] SupraLabs started the Any2Any model family! SupraLabs Supra-A2A-Nano-Exp - ~30M Any-to-Any Multimodal Transformer Status: Experimental / Educational Prototype 🚀 Overview Supra-A2A-Nano-Exp is a ~30M parameter autoregressive Transformer that unifies text, image, and video into a single token stream. There are: - No… 7 r/LocalLLaMA community 9d ago Best image vision model runnable on RTX 6000 Pro I'm looking at running OCR and classification on old historical scanned documents. (Some dating back to 1950s) What's the current best vision enabled models thats open sourced and runnable on an RTX 6000 Pro? Note: I've used Gemma 4 31B and have had good success with it. It's… 20 Hacker News — AI on Front Page community 9d ago UHF X11: X11 Built for VisionOS and Apple Vision Pro Article URL: https://www.lispm.net/apps/uhf-x11/ Comments URL: https://news.ycombinator.com/item?id=48610853 Points: 210 # Comments: 44 8 Hugging Face Daily Papers research 10d ago PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models Abstract PerceptionDLM enables efficient parallel region perception in multimodal diffusion language models through structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 12 Hugging Face Daily Papers research 10d ago Context-Aware RL for Agentic and Multimodal LLMs Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by… 21 arXiv — Machine Learning research 11d ago Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting arXiv:2606.19413v1 Announce Type: new Abstract: Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing… 8 arXiv — Machine Learning research 11d ago Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks arXiv:2606.19489v1 Announce Type: new Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need… 8 arXiv — Machine Learning research 11d ago Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET arXiv:2606.20037v1 Announce Type: new Abstract: Alzheimer's disease (AD) is an irreversible neurodegenerative disorder and a leading cause of death worldwide. Early diagnosis plays an important part especially at the Mild Cognitive Impairment stage, where timely intervention can… 31 arXiv — NLP / Computation & Language research 11d ago What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis arXiv:2606.20075v1 Announce Type: cross Abstract: Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome… 36 arXiv — Machine Learning research 11d ago Effective Dimension Governs Generalization in Quantum Kernel Vision Models arXiv:2606.20183v1 Announce Type: new Abstract: Recent quantum vision models-quantum vision transformers and quantum convolutional networks-report two striking but unexplained empirical phenomena: (i) ansatze with more, or more uniformly distributed, entanglement generalize… 18 arXiv — Machine Learning research 11d ago Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision arXiv:2606.20291v1 Announce Type: new Abstract: Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management.… 20 arXiv — Machine Learning research 11d ago Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act arXiv:2606.20359v1 Announce Type: new Abstract: Self-represented tenants, landlords, and help-desk staff need to be pointed at the provision of law that actually governs a question, with a correct statutory citation. We study this task on the Ontario Residential Tenancies Act,… 6 arXiv — NLP / Computation & Language research 11d ago LaViSA: A Language and Vision Structural Ambiguity Benchmark arXiv:2606.19552v1 Announce Type: new Abstract: Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving… 22 arXiv — NLP / Computation & Language research 11d ago Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship arXiv:2606.20093v1 Announce Type: new Abstract: Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also… 33 arXiv — NLP / Computation & Language research 11d ago MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization arXiv:2606.20164v1 Announce Type: new Abstract: Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and… 29 arXiv — NLP / Computation & Language research 11d ago StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs arXiv:2606.20527v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often… 26 arXiv — NLP / Computation & Language research 11d ago DeXposure-Claw: An Agentic System for DeFi Risk Supervision arXiv:2606.19501v1 Announce Type: cross Abstract: Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing… 14 arXiv — NLP / Computation & Language research 11d ago PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models arXiv:2606.19534v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that… 24 arXiv — NLP / Computation & Language research 11d ago NEST: Narrative Event Structures in Time for Long Video Understanding arXiv:2606.19706v1 Announce Type: cross Abstract: Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long… 24 arXiv — NLP / Computation & Language research 11d ago Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology arXiv:2606.20477v1 Announce Type: cross Abstract: We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs… 9 arXiv — NLP / Computation & Language research 11d ago Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation arXiv:2504.02885v2 Announce Type: replace Abstract: Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their… 29 arXiv — NLP / Computation & Language research 11d ago Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech arXiv:2603.16606v3 Announce Type: replace Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual… 7 r/LocalLLaMA community 11d ago [NEW MODEL] SupraLabs just released SupraVL-Nano-900k, a Vision-Language Model built entirely from scratch! Hey r/LocalLLaMA ! We just released SupraVL-Nano-900k , our first VLM. It has ~900k parameters, was trained from scratch on Flickr8k, and the entire architecture fits in a single Jupyter notebook. This is not a production model, it's a fully transparent, readable blueprint for… 27 Hugging Face Daily Papers research 11d ago Thinking with Visual Grounding Abstract Visually grounded thinking integrates natural-language reasoning with explicit visual evidence grounding in vision-language models, improving reasoning accuracy through scalable synthesis and reinforcement learning techniques. Generated by… 34 Hugging Face Daily Papers research 11d ago REVES: REvision and VErification--Augmented Training for Test-Time Scaling Abstract A two-stage iterative framework alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems. Generated by… 23 Hugging Face Daily Papers research 11d ago Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities Abstract RL4IL enables robust robotic manipulation under sensor dropout by using reinforcement learning to retrieve relevant demonstrations and cross-attention fusion to impute missing modalities without retraining. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Robotic systems… 23 Hugging Face Daily Papers research 11d ago When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning? Abstract Offline reinforcement learning with trajectory-level outcome supervision presents statistical challenges that can be addressed through pessimistic actor-critic methods, though fundamental barriers exist for certain generalized outcome-based problems. Generated by… 35 Hugging Face Daily Papers research 11d ago ViT-Up: Faithful Feature Upsampling for Vision Transformers Abstract ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Vision Transformers… 27 r/MachineLearning community 11d ago Any idea if AAAI will be harsh on computer vision paper as last year? [R] Hello everyone, I have a computer vision paper ready for submission, a coauthor have suggested submitting it to AAAI. However last year computer vision papers have gotten a very small acceptance rate at AAAI, with reviewers receiving emails to specifically tell them that the… 17 Hugging Face Daily Papers research 11d ago Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness Abstract Xcientist enables transparent and accountable AI-driven scientific research by creating persistent artifacts that track the complete research process from problem formulation to mechanism validation and revision. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI systems… 11 Hugging Face Daily Papers research 12d ago SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks Abstract SciOrch is a framework that uses a lightweight orchestrator model to coordinate multiple frontier LLMs for scientific reasoning, achieving superior performance through MCTS-based training and GRPO-style optimization while reducing API costs. Generated by… 31 Hacker News — AI on Front Page community 12d ago DeepSeek Introduces Vision Article URL: https://chat.deepseek.com/ Comments URL: https://news.ycombinator.com/item?id=48581458 Points: 229 # Comments: 94 29 Smol AI News news-outlet 12d ago not much happened today **GLM-5.2** from **Zhipu** emerged as a leading open-weight model with innovative **IndexShare** sparse-attention enabling efficient **1M-token inference**, praised as comparable to **GPT-5.5** and **Opus 4.8** but lacking vision support. Other notable open models include… 18 r/MachineLearning community 12d ago What does provisional paper acceptance mean in ECCV? Is that the default message everyone gets? [D] What does provisional paper acceptance mean in ECCV? Is that the default message everyone gets?   submitted by   /u/NotGondor [link]   [comments] 37 Hugging Face Daily Papers research 12d ago Reinforcing Dual-Path Reasoning in Spatial Vision Language Models Abstract A unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning through reinforcement learning, enabling robust spatial reasoning across diverse tasks and domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Spatial… 9 Hugging Face Daily Papers research 12d ago Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation Abstract ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts. Generated by… 8 arXiv — Machine Learning research 12d ago Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS arXiv:2606.18287v1 Announce Type: new Abstract: Multimodal neuroimaging, integrating functional connectivity from fMRI and structural connectivity from DTI, enables non-invasive analysis of brain networks using graph neural networks. However, demographic factors such as age and… 11 arXiv — Machine Learning research 12d ago MOLAR: Learning Multimodal Molecular Representations from Noisy Labels arXiv:2606.18390v1 Announce Type: new Abstract: Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean… 7 arXiv — Machine Learning research 12d ago Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs arXiv:2606.18732v1 Announce Type: new Abstract: This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS)… 19 arXiv — Machine Learning research 12d ago Reinforcement Learning Foundation Models Should Already Be A Thing arXiv:2606.18812v1 Announce Type: new Abstract: Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data,… 17 arXiv — Machine Learning research 12d ago Semantic Robustness Certification for Vision-Language Models arXiv:2606.18839v1 Announce Type: new Abstract: Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification… 35 arXiv — NLP / Computation & Language research 12d ago REVES: REvision and VErification--Augmented Training for Test-Time Scaling arXiv:2606.18910v1 Announce Type: cross Abstract: Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a… 27 arXiv — Machine Learning research 12d ago A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors arXiv:2606.19026v1 Announce Type: new Abstract: Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured… 32 arXiv — Machine Learning research 12d ago Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts arXiv:2606.19036v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection… 16 arXiv — Machine Learning research 12d ago Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation arXiv:2606.19120v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to… 31 arXiv — Machine Learning research 12d ago ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis arXiv:2606.19140v1 Announce Type: new Abstract: Accurate survival prediction is essential for personalized treatment planning in head and neck cancer, yet remains challenging due to the heterogeneous and high-dimensional nature of multimodal clinical data. While deep survival… 32 arXiv — NLP / Computation & Language research 12d ago VISUALSKILL: Multimodal Skills for Computer-Use Agents arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the… 19 arXiv — NLP / Computation & Language research 12d ago Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text arXiv:2606.18471v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic… 11 arXiv — NLP / Computation & Language research 12d ago Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction arXiv:2606.18893v1 Announce Type: new Abstract: Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently.… 14 Page 3 of 10 · 500 articles ← Newer Older →