News / #multimodal Tag Multimodal 77 articles archived under #multimodal · RSS Sign in to follow Hugging Face Daily Papers research 2h ago Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation Abstract INSET is a unified multimodal model that embeds images as native vocabulary within textual instructions, enabling better handling of complex interleaved inputs through transformer-based contextual locality and supporting both image generation and editing tasks.… 34 Hugging Face Daily Papers research 2h ago Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training Abstract Training efficiency is improved by strategically allocating scarce labeled data through staged reinforcement learning and dense supervision, using sparse rewards for teacher model discovery and dense rewards for student model compression. AI-generated summary In… 35 Hugging Face Daily Papers research 6h ago Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception Abstract Urban-ImageNet presents a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, organized under a hierarchical taxonomy for scene classification, cross-modal retrieval, and instance segmentation tasks.… 36 r/LocalLLaMA community 7h ago sensenova/SenseNova-U1-A3B-MoT · Hugging Face SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture 🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental… 37 Hugging Face Daily Papers research 10h ago UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning Abstract Unified multimodal models can improve performance by adaptively selecting coordination paths rather than using fixed patterns, enabling diverse reasoning strategies for different inputs. AI-generated summary Unified multimodal models (UMMs) aim to integrate… 19 r/LocalLLaMA community 11h ago AIDC-AI/Ovis2.6-80B-A3B · Hugging Face We introduce Ovis2.6-80B-A3B , the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal… 31 r/MachineLearning community 11h ago Elastic Attention Cores for Scalable Vision Transformers [R] Wanted to share our latest paper on an alternative building block for Vision Transformers. Illustration of our model's accuracy and dense features Traditional ViTs utilize dense ( N 2 ) self-attention, which can become pretty costly at higher resolutions. In this work, we… 35 Hugging Face Daily Papers research 17h ago A Causal Language Modeling Detour Improves Encoder Continued Pretraining Abstract Switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts through dense supervision effects in lower transformer layers. AI-generated summary When adapting an encoder to a new… 25 Hugging Face Daily Papers research 18h ago World Action Models: The Next Frontier in Embodied AI Abstract World Action Models unify predictive state modeling with action generation for embodied policy learning, forming a cohesive framework for understanding environment dynamics and action prediction. AI-generated summary Vision-Language-Action (VLA) models have achieved… 15 Hugging Face Daily Papers research 18h ago Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents Abstract A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks. AI-generated summary Multimodal deep search… 33 Hugging Face Daily Papers research 18h ago SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning Abstract SeePhys Pro benchmark reveals that current multimodal models struggle with representation-invariant reasoning when information shifts from text to visual formats, and demonstrates that blind training can improve performance through residual textual cues. AI-generated… 36 Hugging Face Daily Papers research 19h ago AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward Abstract AlphaGRPO enhances multimodal generation by applying Group Relative Policy Optimization to AR-Diffusion Unified Multimodal Models through self-reflective refinement and decompositional verifiable reward mechanisms. AI-generated summary In this paper, we propose… 26 arXiv — Machine Learning research 19h ago Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies… 17 arXiv — Machine Learning research 19h ago 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone arXiv:2605.11405v1 Announce Type: new Abstract: Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take… 33 arXiv — NLP / Computation & Language research 19h ago ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction arXiv:2605.11212v1 Announce Type: new Abstract: Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly,… 11 arXiv — NLP / Computation & Language research 19h ago Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation arXiv:2605.11533v1 Announce Type: new Abstract: Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for… 30 arXiv — NLP / Computation & Language research 19h ago OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models arXiv:2605.11629v1 Announce Type: new Abstract: Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource… 38 arXiv — NLP / Computation & Language research 19h ago Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability arXiv:2605.11663v1 Announce Type: new Abstract: Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed… 13 arXiv — NLP / Computation & Language research 19h ago Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation arXiv:2605.11739v1 Announce Type: new Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level… 6 arXiv — NLP / Computation & Language research 19h ago Towards Visually-Guided Movie Subtitle Translation for Indic Languages arXiv:2605.11993v1 Announce Type: new Abstract: Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu,… 13 arXiv — NLP / Computation & Language research 19h ago LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer? arXiv:2605.11301v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than… 24 arXiv — NLP / Computation & Language research 19h ago PresentAgent-2: Towards Generalist Multimodal Presentation Agents arXiv:2605.11363v1 Announce Type: cross Abstract: Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework… 30 Hugging Face Daily Papers research 20h ago SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture Abstract Unified vision-language models treat understanding and generation as integrated processes rather than separate tasks, demonstrating strong performance across multiple multimodal capabilities including image synthesis and action reasoning. AI-generated summary Recent… 37 Hugging Face Daily Papers research 21h ago Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization Abstract DRoRAE enhances visual representation by fusing multi-layer features from pretrained vision encoders through adaptive routing and incremental correction, improving reconstruction and generation quality. AI-generated summary Representation autoencoders that reuse frozen… 6 Hugging Face Daily Papers research 21h ago LychSim: A Controllable and Interactive Simulation Framework for Vision Research Abstract A simulation framework called LychSim is introduced, featuring a Python API, procedural data pipeline, and MCP integration to enable controllable and interactive environments for vision system development and evaluation. AI-generated summary While self-supervised… 23 Ollama releases dev-tools 1d ago v0.23.4-rc0 launch/opencode: add image modalities for vision models ( #15922 ) 24 Ollama releases dev-tools 1d ago v0.23.4 launch/opencode: add image modalities for vision models ( #15922 ) 36 llama.cpp releases dev-tools 1d ago b9122 ggml-webgpu: address precision issues for multimodal ( #22808 ) fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 fix(unary): correct the gelu, gelu quick and gelu erf functions fix(flash-attn-tile): fix the hardcode v type… 9 Ars Technica — AI news-outlet 1d ago Google's Android-powered laptops are called Googlebooks, and they're coming this year Google has revealed its vision for the AI laptop of tomorrow. 34 llama.cpp releases dev-tools 1d ago b9116 mtmd: add MiMo v2.5 vision ( #22883 ) mimo-v2.5: vision support mimo-v2.5: use fused qkv for vision mimi-v2.5: fix f16 vision overflow mimo-v2.5: comment cleanups mimo-v2.5: Flash doesn't have mmproj more cleanup remember to use filter_tensors mimo-v2.5: fix trailing whitespace… 25 Smol AI News news-outlet 2d ago not much happened today **Thinking Machines** previewed their new **native interaction models** designed for **full-duplex multimodal interaction** enabling real-time concurrent listening, speaking, watching, thinking, searching, and reacting, marking a shift beyond turn-based AI. This approach… 36 NVIDIA Developer Blog official-blog 8d ago How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car The automotive cockpit is undergoing a fundamental shift from rule-based interfaces to agentic, multimodal AI systems capable of reasoning, planning, and... 11 Stratechery (Ben Thompson) community 12d ago 2026.18: Long-term, Peripheral & Myopic Visions The best Stratechery content from the week of April 27, 2026, including Amazon and AI, the future of AR devices, and Beijing's myopia. 12 MIT News — AI research 14d ago Solving the “Whac-a-mole dilemma”: A smarter way to debias AI vision models A new debiasing technique called WRING avoids creating or amplifying biases that can occur with existing debiasing approaches. 4 Vercel — AI dev-tools 14d ago Vercel now supports Pro plan in Stripe Projects You can now sign up for or upgrade to a Vercel Pro plan directly from Stripe Projects using shared payment tokens (SPTs). Agents and developers can manage plan changes programmatically from the Stripe CLI, without leaving their workflow. What’s new Provision or upgrade to Vercel… 28 NVIDIA Developer Blog official-blog 15d ago NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model Agentic systems often reason across screens, documents, audio, video, and text within a single perception‑to‑action loop. However, they still rely on... 7 Smol AI News news-outlet 21d ago not much happened today **Alibaba** released **Qwen3.6-27B**, a dense, Apache 2.0 open coding model with thinking and non-thinking modes, outperforming the larger Qwen3.5-397B-A17B on multiple coding benchmarks including SWE-bench and Terminal-Bench. It supports native vision-language reasoning over… 15 Smol AI News news-outlet 23d ago not much happened today **Moonshot's Kimi K2.6** is a major open-weight **1T-parameter MoE** model featuring **32B active parameters**, **384 experts**, **MLA attention**, **256K context window**, native multimodality, and **INT4 quantization**. It supports day-0 integration with platforms like… 9 NVIDIA Developer Blog official-blog 27d ago How to Build Vision AI Pipelines Using NVIDIA DeepStream Coding Agents Developing real-time vision AI applications presents a significant challenge for developers, often demanding intricate data pipelines, countless lines of code,... 11 Smol AI News news-outlet 1mo ago not much happened today **Meta Superintelligence Labs** launched **Muse Spark**, a natively multimodal reasoning model featuring tool use, visual chain of thought, and multi-agent orchestration. It is live on **meta.ai** and the Meta AI app with a private API preview and plans for open-sourcing future… 29 MIT News — AI research 1mo ago Working to advance the nuclear renaissance Dean Price, assistant professor in the Department of Nuclear Science and Engineering, sees a bright future for nuclear power, and believes AI can help us realize that vision. 38 Smol AI News news-outlet 1mo ago not much happened today **Gemma 4** was launched by **Google** under an **Apache 2.0 license**, marking a significant open-model release focused on **reasoning, agentic workflows, multimodality, and on-device use**. It outperforms models 10x larger and has immediate ecosystem support including… 35 NVIDIA Developer Blog official-blog 1mo ago Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU... 17 NVIDIA Developer Blog official-blog 1mo ago Bringing AI Closer to the Edge and On-Device with Gemma 4 The Gemmaverse expands with the launch of the latest Gemma 4 multimodal and multilingual models, designed to scale across the full spectrum of deployments, from... 27 Vercel — AI dev-tools 1mo ago Qwen 3.6 Plus on AI Gateway Qwen 3.6 Plus from Alibaba is now available on Vercel AI Gateway . Compared to Qwen 3.5 Plus, this model adds stronger agentic coding capabilities, from frontend development to repository-level problem solving, along with improved multimodal perception and reasoning. It features… 19 Vercel — AI dev-tools 1mo ago Zero-configuration Go backend support Go API backends can now be deployed on Vercel with zero-configuration deployment. Vercel now recognizes Go servers as first-class backends and automatically provisions the right resources and configures your application without redirects in vercel.json or the /api folder… 18 Smol AI News news-outlet 1mo ago Gemma 4 **Google DeepMind** released **Gemma 4**, a family of open-weight, multimodal models with long-context support up to **256K tokens** under an **Apache 2.0 license**, marking a major capability and licensing shift. The lineup includes **31B dense**, **26B MoE (A4B)**, and two… 14 Hugging Face official-blog 1mo ago Welcome Gemma 4: Frontier multimodal intelligence on device Back to Articles Welcome Gemma 4: Frontier multimodal intelligence on device Published April 2, 2026 Update on GitHub Upvote 891 merve merve Pedro Cuenca pcuenq Sergio Paniego sergiopaniego ben burtenshaw burtenshaw Steven Zheng Steveeeeeeen Alvaro Bartolome alvarobartt Nathan… 9 Vercel — AI dev-tools 1mo ago GLM 5V Turbo on AI Gateway GLM 5V Turbo from Z.ai is now available on Vercel AI Gateway . GLM 5V Turbo is a multimodal coding model that turns screenshots and designs into code, debugs visually, and operates GUIs autonomously. It's strong at design-to-code generation, visual code generation, and… 26 Smol AI News news-outlet 1mo ago not much happened today **Arcee’s Trinity-Large-Thinking** was released with **open weights under Apache 2.0**, featuring a **400B total / 13B active** model size and strong agentic performance, ranking **#2 on PinchBench**. **Z.ai’s GLM-5V-Turbo** is a **vision coding model** with **native multimodal… 13 Hugging Face official-blog 1mo ago Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Back to Articles Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Enterprise Article Published March 31, 2026 Upvote 34 Madison Lee kristunlee ibm-granite Rogerio Feris rferis ibm-granite Eli Schwartz elischwartz ibm-granite Dhiraj Joshi… 11 MIT News — AI research 1mo ago Augmenting citizen science with computer vision for fish monitoring MIT Sea Grant works with the Woodwell Climate Research Center and other collaborators to demonstrate a deep learning-based system for fish monitoring. 8 NVIDIA Developer Blog official-blog 1mo ago Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale,... 37 Smol AI News news-outlet 1mo ago not much happened today **Google** launched **Gemini 3.1 Flash Live**, a realtime voice and vision agent model with **2x longer conversation memory**, supporting **70 languages** and **128k context**. **Mistral AI** released **Voxtral TTS**, a low-latency, open-weight text-to-speech model supporting… 31 Vercel — AI dev-tools 1mo ago new.website joins forces with v0 v0 and new.website have joined forces to accelerate our vision of helping anyone ship complete, production-ready software with AI. new.website was founded to make it effortless to create beautiful websites with all the tools included, from built-in forms to SEO. They’re joining… 28 MIT News — AI research 1mo ago Generative AI improves a wireless vision system that sees through obstructions With this new technique, a robot could more accurately detect hidden objects or understand an indoor scene using reflected Wi-Fi signals. 28 Smol AI News news-outlet 1mo ago not much happened today **OpenAI** released **GPT-5.4 mini** and **GPT-5.4 nano**, their most capable small models optimized for coding, multimodal understanding, and subagents, featuring a **400k context window** and over **2x speed** compared to GPT-5 mini. The mini model approaches larger GPT-5.4… 32 Import AI news-outlet 1mo ago ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text Will AI cause a political interregnum 5 Smol AI News news-outlet 2mo ago not much happened today **Alibaba** released the **Qwen 3.5** series with models ranging from **0.8B to 9B** parameters, featuring **native multimodality**, **scaled reinforcement learning**, and targeting **edge and lightweight agent** deployments. The models support very long context windows up to… 18 NVIDIA Developer Blog official-blog 2mo ago Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints Alibaba has introduced the new open source Qwen3.5 series built for native multimodal agents. The first model in this series is a ~400B parameter native... 25 NVIDIA Developer Blog official-blog 2mo ago Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms,... 6 Smol AI News news-outlet 2mo ago Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model **Alibaba** released **Qwen3.5-397B-A17B**, an open-weight model featuring **native multimodality**, **spatial intelligence**, and a **hybrid linear attention + sparse MoE** architecture supporting **201 languages** and **long context windows** up to **256K tokens**. The model… 35 NVIDIA Developer Blog official-blog 3mo ago R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab Building robust, intelligent robots requires testing them in complex environments. However, gathering data in the physical world is expensive, slow, and often... 27 Smol AI News news-outlet 3mo ago Context Graphs: Hype or actually Trillion-dollar opportunity? **Zhipu AI** launched **GLM-OCR**, a lightweight **0.9B** multimodal OCR model excelling in complex document understanding with top benchmark scores and day-0 deployment support from **lmsys**, **vllm**, and **novita labs**. **Ollama** enabled local-first usage with easy offline… 28 Smol AI News news-outlet 3mo ago Moonshot Kimi K2.5 - Beats Sonnet 4.5 at half the cost, SOTA Open Model, first Native Image+Video, 100 parallel Agent Swarm manager **MoonshotAI's Kimi K2.5** is a **32B active-1T parameter open-weights model** featuring **native multimodality** with image and video understanding, built through continual pretraining on **15 trillion mixed visual and text tokens**. It introduces a new **MoonViT vision… 22 Google DeepMind official-blog 5mo ago Introducing Nano Banana Pro Introducing Nano Banana Pro Nov 20, 2025 · Share x.com Facebook LinkedIn Mail Turn your visions into studio-quality designs with unprecedented control, improved text rendering and enhanced world knowledge. Naina Raisinghani Product Manager, Google DeepMind General summary Google… 22 Google DeepMind official-blog 6mo ago MedGemma: Our most capable open models for health AI development We’re announcing new multimodal models in the MedGemma collection, our most capable open models for health AI development. 4 Google DeepMind official-blog 6mo ago Gemini 2.5 Flash-Lite is now ready for scaled production use Gemini 2.5 Flash-Lite, previously in preview, is now stable and generally available. This cost-efficient model provides high quality in a small size, and includes 2.5 family features like a 1 million-token context window and multimodality. 30 Zed Editor dev-tools 8mo ago Sequoia Backs Zed's Vision for Collaborative Coding This investment lets us pursue our vision for bringing a new kind of collaboration directly into the IDE. 25 Google DeepMind official-blog 11mo ago Announcing Gemma 3n preview: Powerful, efficient, mobile-first AI Gemma 3n is a cutting-edge open model designed for fast, multimodal AI on devices, featuring optimized performance, unique flexibility with a 2-in-1 model, and expanded multimodal understanding with audio, empowering developers to build live, interactive applications and… 23 Google DeepMind official-blog 11mo ago Our vision for building a universal AI assistant We’re extending Gemini to become a world model that can make plans and imagine new experiences by simulating aspects of the world. 28 Zed Editor dev-tools 27mo ago We Have to Start Over: From Atom to Zed Thorsten interviews co-founders Nathan, Max, Antonio about the vision and the technological choices behind Zed, how they went from Atom and Electron to Rust and GPUs with Zed. 30 Chip Huyen research 31mo ago Multimodality and Large Multimodal Models (LMMs) For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read, talk, and… 10 Lil'Log (Lilian Weng) research 47mo ago Generalized Visual Language Models Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a… 7 Eugene Yan research 51mo ago Mailbag: How to Define a Data Team's Vision and Roadmap I'm heading into a team lead role and would like to define the vision and roadmap. 25 Eugene Yan research 58mo ago Bootstrapping Labels via ___ Supervision & Human-In-The-Loop How to generate labels from scratch with semi, active, and weakly supervised learning. 17 Lil'Log (Lilian Weng) research 103mo ago Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS I’ve never worked in the field of computer vision and has no idea how the magic could work when an autonomous car is configured to tell apart a stop sign from a pedestrian in a red hat. To motivate myself to look into the maths behind object recognition and detection… 29