Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure
Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.
Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure
AI-Generated Summary
- MiniMax M3, a 428B parameter Mixture-of-Experts model with 1M-token context and native multimodality, leverages NVIDIA Blackwell infrastructure to unify text, vision, and code tasks, supporting agentic workflows and extended creative applications within a single architecture.
- The core MiniMax Sparse Attention mechanism replaces standard quadratic attention with a pre-filtering stage, enabling more than 4x faster contiguous KV cache access, 1/20th per-token compute cost at 1M context, and significant speedups in prefill and decoding, with no loss in precision or compression of key-values.
- Deployment and customization leverage the NVIDIA ecosystem, including open source inference on TensorRT LLM, SGLang, and vLLM, large-scale serving with NVIDIA Dynamo for disaggregated inference, and advanced fine-tuning or RL via the NVIDIA NeMo Framework with full N-D parallelism and context parallelism up to 128k tokens.
AI-generated content may summarize information incompletely. Verify important information. Learn more
As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and code—leading to added complexity, higher costs, and slower iteration.
MiniMax M3—available on NVIDIA accelerated infrastructure including NVIDIA Blackwell—changes this by enabling a single multimodal system capable of long-context reasoning, agentic workflows, and creative tasks.
The 428B parameter MoE supports up to 1M tokens and native multimodal input. Developers can build applications like long video understanding, extended coding sessions (8+ hours), and high-quality design workflows—all with a unified model and production-ready deployment paths on NVIDIA platforms.
| Name | MiniMax M3 |
| Input modalities | Video, image, text |
| Total parameters | 428B |
| Visual encoder parameters | 600M |
| Active parameters | 22B |
| Context length | 1M |
| Experts | Total 128, 4 experts activated per token |
| Precision format | BF16, MXFP8 |
MiniMax M3’s core architectural innovation is MiniMax Sparse Attention (MSA), which replaces standard quadratic attention with a pre-filtering stage that identifies relevant context blocks and attends only to those. At the operator level, each KV cache block is read once with contiguous memory access—more than 4x faster than existing sparse attention implementations. This yields 1/20th the per-token compute of M2 at 1M-token context, with 9x faster prefill and 15x faster decoding, all without compressing key-values or sacrificing precision. The model also trains text, images, and video natively from step 0 across ~100 trillion interleaved tokens, rather than adding multimodality post-training.
Open source inference
Developers can use accelerated computing with their open source inference engine of choice, such as NVIDIA TensorRT LLM (text-only), SGLang or vLLM.
Deploying with NVIDIA TensorRT LLM
The optimizations are available on the NVIDIA TensorRT LLM GitHub repository. Follow the quick start guide to stand up a high-performance server—it covers downloading model checkpoints from Hugging Face, a ready-to-run Docker container, and configuration options for both low-latency and max-throughput serving. NVIDIA also collaborated on the developer experience through the Transformers library.
Deploying with SGLang
Users deploying models with the SGLang serving framework can use the following instructions. See the SGLang documentation for more information and configuration options.
# 8 GPUs node case
$ python -m sglang.launch_server \
--model-path MiniMaxAI/MiniMax-M3 \
--dtype bfloat16 \
--tp-size 8 \
--ep-size 8 \
--trust-remote-code \
--mem-fraction-static 0.8 \
--enable-multimodal \
--quantization mxfp8 \
--attention-backend flashinfer \
--mm-attention-backend flashinfer_cudnn \
--moe-runner-backend deep_gemm \
--chunked-prefill-size 8192 \
--reasoning-parser minimax-m3 \
--tool-call-parser minimax-m3-nom
--tr
Deploying with vLLM
When deploying models with the vLLM serving framework, use the following instructions. For more information, see the vLLM Recipe.
vllm serve MiniMaxAI/MiniMax-M3 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --block-size 128 \ --mm-encoder-attn-backend FLASHINFER \ --mm-processor-cache-type shm \ --tool-call-parser minimax_m3 \ --enable-auto-tool-choice \ --reasoning-parser minimax_m3 \ --trust-remote-code
Scaling with NVIDIA Dynamo
Dynamo is an open source distributed inference serving platform for developers to deploy frontier models like MiniMax M3 for large-scale applications. Deploying MiniMax M3 using Dynamo with TensorRT LLM improves performance for long input sequence lengths without sacrificing throughput or increasing GPU budget. At 32k ISL, Dynamo delivers a 4x improvement in interactivity on NVIDIA Blackwell through disaggregated serving—a technique that separates the prefill and decode phases of inference across distinct GPUs to increase system efficiency.
Dynamo integrates with all major inference engines and frameworks, including PyTorch, SGLang, TensorRT LLM, and vLLM, and offers LLM-aware routing, elastic autoscaling, and low-latency data transfer. Developers can follow the deployment guide to run MiniMax M3 with Dynamo.
Customize with NVIDIA NeMo Framework
MiniMax M3 can be customized and fine-tuned with the open source NVIDIA NeMo Framework. Users can:
- Use NVIDIA NeMo AutoModel for out-of-the-box fine-tuning (both SFT and LoRA) over Hugging Face checkpoints without any conversion, with high-throughput acceleration from full N-D parallelism. Specifically, context parallel support is available for sequence lengths up to 128k.
- Use NVIDIA NeMo RL to conduct reinforcement learning on top of Minimax M3, referencing the following sample accuracy curves.
These libraries provide developers with a suite of lightweight tools for rapid experimentation on the latest frontier models.
Get started today
Developers can prototype and evaluate MiniMax M3 by using the GPU-accelerated API on build.nvidia.com or by downloading the weights from Hugging Face.
Tags
About the Authors
Anu Srivastava is a senior technical marketing manager who focuses on NVIDIA’s lighthouse AI model collaborations. She works with key partners and foundations to enable NVIDIA accelerated platform support for the open source developer ecosystem. Prior to NVIDIA, she worked at Google for over a decade in various engineering and management roles and holds a degree in computer science from the University of Texas at Austin.
Comments
More from NVIDIA Developer Blog
-
NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark
Jun 12
-
One-Click Multi-Tenant Security with NVIDIA Quantum InfiniBand
Jun 11
-
Run DiffusionGemma on NVIDIA for Developer-Ready, High-Throughput Text Generation
Jun 10
-
Designing Production-Ready Battery Energy Storage Systems for AI Factories
Jun 10
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.