Tag

Multimodal

63 articles archived under #multimodal · RSS

r/LocalLLaMA community 3h ago

sensenova/SenseNova-U1-A3B-MoT · Hugging Face

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture 🚀 SenseNova U1 is a new series of native multimodal models that unifies multimodal understanding, reasoning, and generation within a monolithic architecture. It marks a fundamental…

37
r/LocalLLaMA community 6h ago

AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

We introduce Ovis2.6-80B-A3B , the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture, delivering superior multimodal…

31
r/MachineLearning community 7h ago

Elastic Attention Cores for Scalable Vision Transformers [R]

Wanted to share our latest paper on an alternative building block for Vision Transformers. Illustration of our model's accuracy and dense features Traditional ViTs utilize dense ( N 2 ) self-attention, which can become pretty costly at higher resolutions. In this work, we…

35
arXiv — Machine Learning research 15h ago

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

arXiv:2605.11387v1 Announce Type: new Abstract: We address the problem of fine-tuning pre-trained generative policies with reinforcement learning (RL) while preserving the multimodality of their action distributions. Existing methods for RL fine-tuning of generative policies…

17
arXiv — Machine Learning research 15h ago

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

arXiv:2605.11405v1 Announce Type: new Abstract: Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take…

33
arXiv — NLP / Computation & Language research 15h ago

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

arXiv:2605.11212v1 Announce Type: new Abstract: Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly,…

11
arXiv — NLP / Computation & Language research 15h ago

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

arXiv:2605.11533v1 Announce Type: new Abstract: Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for…

30
arXiv — NLP / Computation & Language research 15h ago

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

arXiv:2605.11629v1 Announce Type: new Abstract: Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource…

38
arXiv — NLP / Computation & Language research 15h ago

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

arXiv:2605.11663v1 Announce Type: new Abstract: Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed…

13
arXiv — NLP / Computation & Language research 15h ago

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv:2605.11739v1 Announce Type: new Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level…

6
arXiv — NLP / Computation & Language research 15h ago

Towards Visually-Guided Movie Subtitle Translation for Indic Languages

arXiv:2605.11993v1 Announce Type: new Abstract: Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu,…

13
arXiv — NLP / Computation & Language research 15h ago

LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

arXiv:2605.11301v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than…

24
arXiv — NLP / Computation & Language research 15h ago

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

arXiv:2605.11363v1 Announce Type: cross Abstract: Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework…

30
Ollama releases dev-tools 23h ago

v0.23.4-rc0

launch/opencode: add image modalities for vision models ( #15922 )

24
llama.cpp releases dev-tools 1d ago

b9122

ggml-webgpu: address precision issues for multimodal ( #22808 ) fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32 fix(unary): correct the gelu, gelu quick and gelu erf functions fix(flash-attn-tile): fix the hardcode v type…

9
Ars Technica — AI news-outlet 1d ago

Google's Android-powered laptops are called Googlebooks, and they're coming this year

Google has revealed its vision for the AI laptop of tomorrow.

34
llama.cpp releases dev-tools 1d ago

b9116

mtmd: add MiMo v2.5 vision ( #22883 ) mimo-v2.5: vision support mimo-v2.5: use fused qkv for vision mimi-v2.5: fix f16 vision overflow mimo-v2.5: comment cleanups mimo-v2.5: Flash doesn't have mmproj more cleanup remember to use filter_tensors mimo-v2.5: fix trailing whitespace…

25
Smol AI News news-outlet 2d ago

not much happened today

**Thinking Machines** previewed their new **native interaction models** designed for **full-duplex multimodal interaction** enabling real-time concurrent listening, speaking, watching, thinking, searching, and reacting, marking a shift beyond turn-based AI. This approach…

36
NVIDIA Developer Blog official-blog 8d ago

How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car

The automotive cockpit is undergoing a fundamental shift from rule-based interfaces to agentic, multimodal AI systems capable of reasoning, planning, and...

10
MIT News — AI research 13d ago

Solving the “Whac-a-mole dilemma”: A smarter way to debias AI vision models

A new debiasing technique called WRING avoids creating or amplifying biases that can occur with existing debiasing approaches.

4
Vercel — AI dev-tools 14d ago

Vercel now supports Pro plan in Stripe Projects

You can now sign up for or upgrade to a Vercel Pro plan directly from Stripe Projects using shared payment tokens (SPTs). Agents and developers can manage plan changes programmatically from the Stripe CLI, without leaving their workflow. What’s new Provision or upgrade to Vercel…

28
NVIDIA Developer Blog official-blog 15d ago

NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model

Agentic systems often reason across screens, documents, audio, video, and text within a single perception‑to‑action loop. However, they still rely on...

7
Smol AI News news-outlet 21d ago

not much happened today

**Alibaba** released **Qwen3.6-27B**, a dense, Apache 2.0 open coding model with thinking and non-thinking modes, outperforming the larger Qwen3.5-397B-A17B on multiple coding benchmarks including SWE-bench and Terminal-Bench. It supports native vision-language reasoning over…

15
Smol AI News news-outlet 23d ago

not much happened today

**Moonshot's Kimi K2.6** is a major open-weight **1T-parameter MoE** model featuring **32B active parameters**, **384 experts**, **MLA attention**, **256K context window**, native multimodality, and **INT4 quantization**. It supports day-0 integration with platforms like…

9
NVIDIA Developer Blog official-blog 27d ago

How to Build Vision AI Pipelines Using NVIDIA DeepStream Coding Agents

Developing real-time vision AI applications presents a significant challenge for developers, often demanding intricate data pipelines, countless lines of code,...

11
Smol AI News news-outlet 1mo ago

not much happened today

**Meta Superintelligence Labs** launched **Muse Spark**, a natively multimodal reasoning model featuring tool use, visual chain of thought, and multi-agent orchestration. It is live on **meta.ai** and the Meta AI app with a private API preview and plans for open-sourcing future…

29
MIT News — AI research 1mo ago

Working to advance the nuclear renaissance

Dean Price, assistant professor in the Department of Nuclear Science and Engineering, sees a bright future for nuclear power, and believes AI can help us realize that vision.

38
Smol AI News news-outlet 1mo ago

not much happened today

**Gemma 4** was launched by **Google** under an **Apache 2.0 license**, marking a significant open-model release focused on **reasoning, agentic workflows, multimodality, and on-device use**. It outperforms models 10x larger and has immediate ecosystem support including…

35
NVIDIA Developer Blog official-blog 1mo ago

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU...

17
NVIDIA Developer Blog official-blog 1mo ago

Bringing AI Closer to the Edge and On-Device with Gemma 4

The Gemmaverse expands with the launch of the latest Gemma 4 multimodal and multilingual models, designed to scale across the full spectrum of deployments, from...

27
Vercel — AI dev-tools 1mo ago

Qwen 3.6 Plus on AI Gateway

Qwen 3.6 Plus from Alibaba is now available on Vercel AI Gateway . Compared to Qwen 3.5 Plus, this model adds stronger agentic coding capabilities, from frontend development to repository-level problem solving, along with improved multimodal perception and reasoning. It features…

19
Vercel — AI dev-tools 1mo ago

Zero-configuration Go backend support

Go API backends can now be deployed on Vercel with zero-configuration deployment. Vercel now recognizes Go servers as first-class backends and automatically provisions the right resources and configures your application without redirects in vercel.json or the /api folder…

18
Smol AI News news-outlet 1mo ago

Gemma 4

**Google DeepMind** released **Gemma 4**, a family of open-weight, multimodal models with long-context support up to **256K tokens** under an **Apache 2.0 license**, marking a major capability and licensing shift. The lineup includes **31B dense**, **26B MoE (A4B)**, and two…

14
Hugging Face official-blog 1mo ago

Welcome Gemma 4: Frontier multimodal intelligence on device

Back to Articles Welcome Gemma 4: Frontier multimodal intelligence on device Published April 2, 2026 Update on GitHub Upvote 891 merve merve Pedro Cuenca pcuenq Sergio Paniego sergiopaniego ben burtenshaw burtenshaw Steven Zheng Steveeeeeeen Alvaro Bartolome alvarobartt Nathan…

9
Vercel — AI dev-tools 1mo ago

GLM 5V Turbo on AI Gateway

GLM 5V Turbo from Z.ai is now available on Vercel AI Gateway . GLM 5V Turbo is a multimodal coding model that turns screenshots and designs into code, debugs visually, and operates GUIs autonomously. It's strong at design-to-code generation, visual code generation, and…

26
Smol AI News news-outlet 1mo ago

not much happened today

**Arcee’s Trinity-Large-Thinking** was released with **open weights under Apache 2.0**, featuring a **400B total / 13B active** model size and strong agentic performance, ranking **#2 on PinchBench**. **Z.ai’s GLM-5V-Turbo** is a **vision coding model** with **native multimodal…

13
Hugging Face official-blog 1mo ago

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Back to Articles Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents Enterprise Article Published March 31, 2026 Upvote 34 Madison Lee kristunlee ibm-granite Rogerio Feris rferis ibm-granite Eli Schwartz elischwartz ibm-granite Dhiraj Joshi…

11
MIT News — AI research 1mo ago

Augmenting citizen science with computer vision for fish monitoring

MIT Sea Grant works with the Woodwell Climate Research Center and other collaborators to demonstrate a deep learning-based system for fish monitoring.

8
NVIDIA Developer Blog official-blog 1mo ago

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale,...

37
Smol AI News news-outlet 1mo ago

not much happened today

**Google** launched **Gemini 3.1 Flash Live**, a realtime voice and vision agent model with **2x longer conversation memory**, supporting **70 languages** and **128k context**. **Mistral AI** released **Voxtral TTS**, a low-latency, open-weight text-to-speech model supporting…

31
Vercel — AI dev-tools 1mo ago

new.website joins forces with v0

v0 and new.website have joined forces to accelerate our vision of helping anyone ship complete, production-ready software with AI. new.website was founded to make it effortless to create beautiful websites with all the tools included, from built-in forms to SEO. They’re joining…

28
MIT News — AI research 1mo ago

Generative AI improves a wireless vision system that sees through obstructions

With this new technique, a robot could more accurately detect hidden objects or understand an indoor scene using reflected Wi-Fi signals.

28
Smol AI News news-outlet 1mo ago

not much happened today

**OpenAI** released **GPT-5.4 mini** and **GPT-5.4 nano**, their most capable small models optimized for coding, multimodal understanding, and subagents, featuring a **400k context window** and over **2x speed** compared to GPT-5 mini. The mini model approaches larger GPT-5.4…

32
Import AI news-outlet 1mo ago

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

Will AI cause a political interregnum

5
Smol AI News news-outlet 2mo ago

not much happened today

**Alibaba** released the **Qwen 3.5** series with models ranging from **0.8B to 9B** parameters, featuring **native multimodality**, **scaled reinforcement learning**, and targeting **edge and lightweight agent** deployments. The models support very long context windows up to…

18
NVIDIA Developer Blog official-blog 2mo ago

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Alibaba has introduced the new open source Qwen3.5 series built for native multimodal agents. The first model in this series is a ~400B parameter native...

25
NVIDIA Developer Blog official-blog 2mo ago

Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities

Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms,...

6
Smol AI News news-outlet 2mo ago

Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model

**Alibaba** released **Qwen3.5-397B-A17B**, an open-weight model featuring **native multimodality**, **spatial intelligence**, and a **hybrid linear attention + sparse MoE** architecture supporting **201 languages** and **long context windows** up to **256K tokens**. The model…

35
NVIDIA Developer Blog official-blog 3mo ago

R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab

Building robust, intelligent robots requires testing them in complex environments. However, gathering data in the physical world is expensive, slow, and often...

27
Smol AI News news-outlet 3mo ago

Context Graphs: Hype or actually Trillion-dollar opportunity?

**Zhipu AI** launched **GLM-OCR**, a lightweight **0.9B** multimodal OCR model excelling in complex document understanding with top benchmark scores and day-0 deployment support from **lmsys**, **vllm**, and **novita labs**. **Ollama** enabled local-first usage with easy offline…

28
Smol AI News news-outlet 3mo ago

Moonshot Kimi K2.5 - Beats Sonnet 4.5 at half the cost, SOTA Open Model, first Native Image+Video, 100 parallel Agent Swarm manager

**MoonshotAI's Kimi K2.5** is a **32B active-1T parameter open-weights model** featuring **native multimodality** with image and video understanding, built through continual pretraining on **15 trillion mixed visual and text tokens**. It introduces a new **MoonViT vision…

22
Google DeepMind official-blog 5mo ago

Introducing Nano Banana Pro

Introducing Nano Banana Pro Nov 20, 2025 · Share x.com Facebook LinkedIn Mail Turn your visions into studio-quality designs with unprecedented control, improved text rendering and enhanced world knowledge. Naina Raisinghani Product Manager, Google DeepMind General summary Google…

22
Google DeepMind official-blog 6mo ago

MedGemma: Our most capable open models for health AI development

We’re announcing new multimodal models in the MedGemma collection, our most capable open models for health AI development.

4
Google DeepMind official-blog 6mo ago

Gemini 2.5 Flash-Lite is now ready for scaled production use

Gemini 2.5 Flash-Lite, previously in preview, is now stable and generally available. This cost-efficient model provides high quality in a small size, and includes 2.5 family features like a 1 million-token context window and multimodality.

30
Zed Editor dev-tools 8mo ago

Sequoia Backs Zed's Vision for Collaborative Coding

This investment lets us pursue our vision for bringing a new kind of collaboration directly into the IDE.

25
Google DeepMind official-blog 11mo ago

Announcing Gemma 3n preview: Powerful, efficient, mobile-first AI

Gemma 3n is a cutting-edge open model designed for fast, multimodal AI on devices, featuring optimized performance, unique flexibility with a 2-in-1 model, and expanded multimodal understanding with audio, empowering developers to build live, interactive applications and…

23
Google DeepMind official-blog 11mo ago

Our vision for building a universal AI assistant

We’re extending Gemini to become a world model that can make plans and imagine new experiences by simulating aspects of the world.

28
Zed Editor dev-tools 27mo ago

We Have to Start Over: From Atom to Zed

Thorsten interviews co-founders Nathan, Max, Antonio about the vision and the technological choices behind Zed, how they went from Atom and Electron to Rust and GPUs with Zed.

30
Chip Huyen research 31mo ago

Multimodality and Large Multimodal Models (LMMs)

For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition). However, natural intelligence is not limited to just a single modality. Humans can read, talk, and…

10
Lil'Log (Lilian Weng) research 47mo ago

Generalized Visual Language Models

Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a…

7
Eugene Yan research 51mo ago

Mailbag: How to Define a Data Team's Vision and Roadmap

I'm heading into a team lead role and would like to define the vision and roadmap.

25
Eugene Yan research 58mo ago

Bootstrapping Labels via ___ Supervision & Human-In-The-Loop

How to generate labels from scratch with semi, active, and weakly supervised learning.

17
Lil'Log (Lilian Weng) research 103mo ago

Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS

I’ve never worked in the field of computer vision and has no idea how the magic could work when an autonomous car is configured to tell apart a stop sign from a pedestrian in a red hat. To motivate myself to look into the maths behind object recognition and detection…

29

sensenova/SenseNova-U1-A3B-MoT · Hugging Face

AIDC-AI/Ovis2.6-80B-A3B · Hugging Face

Elastic Attention Cores for Scalable Vision Transformers [R]

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Towards Visually-Guided Movie Subtitle Translation for Indic Languages

LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

v0.23.4-rc0

b9122

Google&#039;s Android-powered laptops are called Googlebooks, and they&#039;re coming this year

b9116

not much happened today

How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car

Solving the “Whac-a-mole dilemma”: A smarter way to debias AI vision models

Vercel now supports Pro plan in Stripe Projects

NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model

not much happened today

not much happened today

How to Build Vision AI Pipelines Using NVIDIA DeepStream Coding Agents

not much happened today

Working to advance the nuclear renaissance

not much happened today

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

Bringing AI Closer to the Edge and On-Device with Gemma 4

Qwen 3.6 Plus on AI Gateway

Zero-configuration Go backend support

Gemma 4

Welcome Gemma 4: Frontier multimodal intelligence on device

GLM 5V Turbo on AI Gateway

not much happened today

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Augmenting citizen science with computer vision for fish monitoring

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

not much happened today

new.website joins forces with v0

Generative AI improves a wireless vision system that sees through obstructions

not much happened today

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

not much happened today

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities

Qwen3.5-397B-A17B: the smallest Open-Opus class, very efficient model

R²D²: Scaling Multimodal Robot Learning with NVIDIA Isaac Lab

Context Graphs: Hype or actually Trillion-dollar opportunity?

Moonshot Kimi K2.5 - Beats Sonnet 4.5 at half the cost, SOTA Open Model, first Native Image+Video, 100 parallel Agent Swarm manager

Introducing Nano Banana Pro

MedGemma: Our most capable open models for health AI development

Gemini 2.5 Flash-Lite is now ready for scaled production use

Sequoia Backs Zed's Vision for Collaborative Coding

Announcing Gemma 3n preview: Powerful, efficient, mobile-first AI

Our vision for building a universal AI assistant

We Have to Start Over: From Atom to Zed

Multimodality and Large Multimodal Models (LMMs)

Generalized Visual Language Models

Mailbag: How to Define a Data Team's Vision and Roadmap

Bootstrapping Labels via ___ Supervision & Human-In-The-Loop

Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS

Google's Android-powered laptops are called Googlebooks, and they're coming this year