News / #multimodal Tag Multimodal 63 articles archived under #multimodal · RSS Sign in to follow r/LocalLLaMA community 3h ago sensenova/SenseNova-U1-A3B-MoT · Hugging Face SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture 🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental… 37 r/LocalLLaMA community 6h ago AIDC-AI/Ovis2.6-80B-A3B · Hugging Face We introduce Ovis2.6-80B-A3B , the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal… 31 r/MachineLearning community 7h ago Elastic Attention Cores for Scalable Vision Transformers [R] Wanted to share our latest paper on an alternative building block for Vision Transformers. Illustration of our model's accuracy and dense features Traditional ViTs utilize dense ( N 2 ) self-attention, which can become pretty costly at higher resolutions. In this work, we… 35 arXiv — Machine Learning research 15h ago Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies… 17 arXiv — Machine Learning research 15h ago 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone arXiv:2605.11405v1 Announce Type: new Abstract: Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take… 33 arXiv — NLP / Computation & Language research 15h ago ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction arXiv:2605.11212v1 Announce Type: new Abstract: Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly,… 11 arXiv — NLP / Computation & Language research 15h ago Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation arXiv:2605.11533v1 Announce Type: new Abstract: Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for… 30 arXiv — NLP / Computation & Language research 15h ago OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models arXiv:2605.11629v1 Announce Type: new Abstract: Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource… 38 arXiv — NLP / Computation & Language research 15h ago Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability arXiv:2605.11663v1 Announce Type: new Abstract: Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed… 13 arXiv — NLP / Computation & Language research 15h ago Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation arXiv:2605.11739v1 Announce Type: new Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level… 6 arXiv — NLP / Computation & Language research 15h ago Towards Visually-Guided Movie Subtitle Translation for Indic Languages arXiv:2605.11993v1 Announce Type: new Abstract: Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu,… 13 arXiv — NLP / Computation & Language research 15h ago LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer? arXiv:2605.11301v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than… 24 arXiv — NLP / Computation & Language research 15h ago PresentAgent-2: Towards Generalist Multimodal Presentation Agents arXiv:2605.11363v1 Announce Type: cross Abstract: Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework… 30 Ollama releases dev-tools 23h ago v0.23.4-rc0 launch/opencode: add image modalities for vision models ( #15922 ) 24 llama.cpp releases dev-tools 1d ago b9122 ggml-webgpu: address precision issues for multimodal ( #22808 ) fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 fix(unary): correct the gelu, gelu quick and gelu erf functions fix(flash-attn-tile): fix the hardcode v type… 9 Ars Technica — AI news-outlet 1d ago Google's Android-powered laptops are called Googlebooks, and they're coming this year Google has revealed its vision for the AI laptop of tomorrow. 34 llama.cpp releases dev-tools 1d ago b9116 mtmd: add MiMo v2.5 vision ( #22883 ) mimo-v2.5: vision support mimo-v2.5: use fused qkv for vision mimi-v2.5: fix f16 vision overflow mimo-v2.5: comment cleanups mimo-v2.5: Flash doesn't have mmproj more cleanup remember to use filter_tensors mimo-v2.5: fix trailing whitespace… 25 Smol AI News news-outlet 2d ago not much happened today **Thinking Machines** previewed their new **native interaction models** designed for **full-duplex multimodal interaction** enabling real-time concurrent listening, speaking, watching, thinking, searching, and reacting, marking a shift beyond turn-based AI. This approach… 36 NVIDIA Developer Blog official-blog 8d ago How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car The automotive cockpit is undergoing a fundamental shift from rule-based interfaces to agentic, multimodal AI systems capable of reasoning, planning, and... 10 MIT News — AI research 13d ago Solving the “Whac-a-mole dilemma”: A smarter way to debias AI vision models A new debiasing technique called WRING avoids creating or amplifying biases that can occur with existing debiasing approaches. 4 Vercel — AI dev-tools 14d ago Vercel now supports Pro plan in Stripe Projects You can now sign up for or upgrade to a Vercel Pro plan directly from Stripe Projects using shared payment tokens (SPTs). Agents and developers can manage plan changes programmatically from the Stripe CLI, without leaving their workflow. What’s new Provision or upgrade to Vercel… 28 NVIDIA Developer Blog official-blog 15d ago NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model Agentic systems often reason across screens, documents, audio, video, and text within a single perception‑to‑action loop. However, they still rely on... 7 Smol AI News news-outlet 21d ago not much happened today **Alibaba** released **Qwen3.6-27B**, a dense, Apache 2.0 open coding model with thinking and non-thinking modes, outperforming the larger Qwen3.5-397B-A17B on multiple coding benchmarks including SWE-bench and Terminal-Bench. It supports native vision-language reasoning over… 15 Smol AI News news-outlet 23d ago not much happened today **Moonshot's Kimi K2.6** is a major open-weight **1T-parameter MoE** model featuring **32B active parameters**, **384 experts**, **MLA attention**, **256K context window**, native multimodality, and **INT4 quantization**. It supports day-0 integration with platforms like… 9 NVIDIA Developer Blog official-blog 27d ago How to Build Vision AI Pipelines Using NVIDIA DeepStream Coding Agents Developing real-time vision AI applications presents a significant challenge for developers, often demanding intricate data pipelines, countless lines of code,... 11 Smol AI News news-outlet 1mo ago not much happened today **Meta Superintelligence Labs** launched **Muse Spark**, a natively multimodal reasoning model featuring tool use, visual chain of thought, and multi-agent orchestration. It is live on **meta.ai** and the Meta AI app with a private API preview and plans for open-sourcing future… 29 MIT News — AI research 1mo ago Working to advance the nuclear renaissance Dean Price, assistant professor in the Department of Nuclear Science and Engineering, sees a bright future for nuclear power, and believes AI can help us realize that vision. 38 Smol AI News news-outlet 1mo ago not much happened today **Gemma 4** was launched by **Google** under an **Apache 2.0 license**, marking a significant open-model release focused on **reasoning, agentic workflows, multimodality, and on-device use**. It outperforms models 10x larger and has immediate ecosystem support including… 35 NVIDIA Developer Blog official-blog 1mo ago Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU... 17 NVIDIA Developer Blog official-blog 1mo ago Bringing AI Closer to the Edge and On-Device with Gemma 4 The Gemmaverse expands with the launch of the latest Gemma 4 multimodal and multilingual models, designed to scale across the full spectrum of deployments, from... 27 Vercel — AI dev-tools 1mo ago Qwen 3.6 Plus on AI Gateway Qwen 3.6 Plus from Alibaba is now available on Vercel AI Gateway . Compared to Qwen 3.5 Plus, this model adds stronger agentic coding capabilities, from frontend development to repository-level problem solving, along with improved multimodal perception and reasoning. It features… 19 Vercel — AI dev-tools 1mo ago Zero-configuration Go backend support Go API backends can now be deployed on Vercel with zero-configuration deployment. Vercel now recognizes Go servers as first-class backends and automatically provisions the right resources and configures your application without redirects in vercel.json or the /api folder… 18 Smol AI News news-outlet 1mo ago Gemma 4 **Google DeepMind** released **Gemma 4**, a family of open-weight, multimodal models with long-context support up to **256K tokens** under an **Apache 2.0 license**, marking a major capability and licensing shift. The lineup includes **31B dense**, **26B MoE (A4B)**, and two… 14 Hugging Face official-blog 1mo ago Welcome Gemma 4: Frontier multimodal intelligence on device Back to Articles Welcome Gemma 4: Frontier multimodal intelligence on device Published April 2, 2026 Update on GitHub Upvote 891 merve merve Pedro Cuenca pcuenq Sergio Paniego sergiopaniego ben burtenshaw burtenshaw Steven Zheng Steveeeeeeen Alvaro Bartolome alvarobartt Nathan… 9 Vercel — AI dev-tools 1mo ago GLM 5V Turbo on AI Gateway GLM 5V Turbo from Z.ai is now available on Vercel AI Gateway . GLM 5V Turbo is a multimodal coding model that turns screenshots and designs into code, debugs visually, and operates GUIs autonomously. It's strong at design-to-code generation, visual code generation, and… 26 Smol AI News news-outlet 1mo ago not much happened today **Arcee’s Trinity-Large-Thinking** was released with **open weights under Apache 2.0**, featuring a **400B total / 13B active** model size and strong agentic performance, ranking **#2 on PinchBench**. **Z.ai’s GLM-5V-Turbo** is a **vision coding model** with **native multimodal… 13 Hugging Face official-blog 1mo ago Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Back to Articles Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Enterprise Article Published March 31, 2026 Upvote 34 Madison Lee kristunlee ibm-granite Rogerio Feris rferis ibm-granite Eli Schwartz elischwartz ibm-granite Dhiraj Joshi… 11 MIT News — AI research 1mo ago Augmenting citizen science with computer vision for fish monitoring MIT Sea Grant works with the Woodwell Climate Research Center and other collaborators to demonstrate a deep learning-based system for fish monitoring. 8 NVIDIA Developer Blog official-blog 1mo ago Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale,... 37 Smol AI News news-outlet 1mo ago not much happened today **Google** launched **Gemini 3.1 Flash Live**, a realtime voice and vision agent model with **2x longer conversation memory**, supporting **70 languages** and **128k context**. **Mistral AI** released **Voxtral TTS**, a low-latency, open-weight text-to-speech model supporting… 31 Vercel — AI dev-tools 1mo ago new.website joins forces with v0 v0 and new.website have joined forces to accelerate our vision of helping anyone ship complete, production-ready software with AI. new.website was founded to make it effortless to create beautiful websites with all the tools included, from built-in forms to SEO. They’re joining… 28 MIT News — AI research 1mo ago Generative AI improves a wireless vision system that sees through obstructions With this new technique, a robot could more accurately detect hidden objects or understand an indoor scene using reflected Wi-Fi signals. 28 Smol AI News news-outlet 1mo ago not much happened today **OpenAI** released **GPT-5.4 mini** and **GPT-5.4 nano**, their most capable small models optimized for coding, multimodal understanding, and subagents, featuring a **400k context window** and over **2x speed** compared to GPT-5 mini. The mini model approaches larger GPT-5.4… 32 Import AI news-outlet 1mo ago ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text Will AI cause a political interregnum 5 Smol AI News news-outlet 2mo ago not much happened today **Alibaba** released the **Qwen 3.5** series with models ranging from **0.8B to 9B** parameters, featuring **native multimodality**, **scaled reinforcement learning**, and targeting **edge and lightweight agent** deployments. The models support very long context windows up to… 18 NVIDIA Developer Blog official-blog 2mo ago Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints Alibaba has introduced the new open source Qwen3.5 series built for native multimodal agents. The first model in this series is a ~400B parameter native... 25 NVIDIA Developer Blog official-blog 2mo ago Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms,... 6 Smol AI News news-outlet 2mo ago Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model **Alibaba** released **Qwen3.5-397B-A17B**, an open-weight model featuring **native multimodality**, **spatial intelligence**, and a **hybrid linear attention + sparse MoE** architecture supporting **201 languages** and **long context windows** up to **256K tokens**. The model… 35 NVIDIA Developer Blog official-blog 3mo ago R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab Building robust, intelligent robots requires testing them in complex environments. However, gathering data in the physical world is expensive, slow, and often... 27 Smol AI News news-outlet 3mo ago Context Graphs: Hype or actually Trillion-dollar opportunity? **Zhipu AI** launched **GLM-OCR**, a lightweight **0.9B** multimodal OCR model excelling in complex document understanding with top benchmark scores and day-0 deployment support from **lmsys**, **vllm**, and **novita labs**. **Ollama** enabled local-first usage with easy offline… 28 Smol AI News news-outlet 3mo ago Moonshot Kimi K2.5 - Beats Sonnet 4.5 at half the cost, SOTA Open Model, first Native Image+Video, 100 parallel Agent Swarm manager **MoonshotAI's Kimi K2.5** is a **32B active-1T parameter open-weights model** featuring **native multimodality** with image and video understanding, built through continual pretraining on **15 trillion mixed visual and text tokens**. It introduces a new **MoonViT vision… 22 Google DeepMind official-blog 5mo ago Introducing Nano Banana Pro Introducing Nano Banana Pro Nov 20, 2025 · Share x.com Facebook LinkedIn Mail Turn your visions into studio-quality designs with unprecedented control, improved text rendering and enhanced world knowledge. Naina Raisinghani Product Manager, Google DeepMind General summary Google… 22 Google DeepMind official-blog 6mo ago MedGemma: Our most capable open models for health AI development We’re announcing new multimodal models in the MedGemma collection, our most capable open models for health AI development. 4 Google DeepMind official-blog 6mo ago Gemini 2.5 Flash-Lite is now ready for scaled production use Gemini 2.5 Flash-Lite, previously in preview, is now stable and generally available. This cost-efficient model provides high quality in a small size, and includes 2.5 family features like a 1 million-token context window and multimodality. 30 Zed Editor dev-tools 8mo ago Sequoia Backs Zed's Vision for Collaborative Coding This investment lets us pursue our vision for bringing a new kind of collaboration directly into the IDE. 25 Google DeepMind official-blog 11mo ago Announcing Gemma 3n preview: Powerful, efficient, mobile-first AI Gemma 3n is a cutting-edge open model designed for fast, multimodal AI on devices, featuring optimized performance, unique flexibility with a 2-in-1 model, and expanded multimodal understanding with audio, empowering developers to build live, interactive applications and… 23 Google DeepMind official-blog 11mo ago Our vision for building a universal AI assistant We’re extending Gemini to become a world model that can make plans and imagine new experiences by simulating aspects of the world. 28 Zed Editor dev-tools 27mo ago We Have to Start Over: From Atom to Zed Thorsten interviews co-founders Nathan, Max, Antonio about the vision and the technological choices behind Zed, how they went from Atom and Electron to Rust and GPUs with Zed. 30 Chip Huyen research 31mo ago Multimodality and Large Multimodal Models (LMMs) For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read, talk, and… 10 Lil'Log (Lilian Weng) research 47mo ago Generalized Visual Language Models Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a… 7 Eugene Yan research 51mo ago Mailbag: How to Define a Data Team's Vision and Roadmap I'm heading into a team lead role and would like to define the vision and roadmap. 25 Eugene Yan research 58mo ago Bootstrapping Labels via ___ Supervision & Human-In-The-Loop How to generate labels from scratch with semi, active, and weakly supervised learning. 17 Lil'Log (Lilian Weng) research 103mo ago Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS I’ve never worked in the field of computer vision and has no idea how the magic could work when an autonomous car is configured to tell apart a stop sign from a pedestrian in a red hat. To motivate myself to look into the maths behind object recognition and detection… 29