NVIDIA Developer Blog · May 29, 2026 · 4 min read

Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

#multimodal #gpu

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

Like Read original ↗

Agentic AI / Generative AI

Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

May 28, 2026

By Anu Srivastava

Discuss (0)

AI-Generated Summary

Dislike

Step 3.7 Flash from StepFun is a 198 billion-parameter Mixture-of-Experts vision-language model optimized for enterprise-scale workflows combining perception, search, and multi-step reasoning with native image and video input and a 256k context window.
The model supports deployment via open-source frameworks like SGLang, NVIDIA TensorRT-LLM, and vLLM, leveraging NVIDIA-accelerated infrastructure and GPU-accelerated endpoints for prototyping and evaluation.
NVIDIA NIM enables production-ready deployment of Step 3.7 Flash as containerized inference microservices with standardized APIs, supporting on-premises, cloud, and hybrid environments, alongside Day 0 fine-tuning capabilities using the NVIDIA NeMo framework for domain-specific customization.

AI-generated content may summarize information incompletely. Verify important information. Learn more

AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and language in real time—turning fragmented information into actionable insights.

Step 3.7 Flash, the latest from StepFun, brings these capabilities to production and enterprise-scale, available on NVIDIA-accelerated infrastructure. It is a 198B-parameter Mixture-of-Experts vision-language model, with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning at production scale.

With native image and video input, three configurable reasoning levels—low, medium, and high—and a 256k context window, it is designed for enterprise use cases such as financial analysis, concurrent coding agents, and other high-throughput multimodal use cases. Developers can use StepFun’s NVFP4-quantized checkpoint available through Hugging Face for boosted inference due to reduced memory bandwidth and storage requirements.

Model	Step 3.7 Flash
Total parameters	198B
Visual encoder parameters	1.8B
Active parameters	11B
Context length	256K
Experts	288 (8 active)

Table 1. Overview of the key Step 3.7 Flash specs, such as parameter counts, context length, and MoE configuration

A diagram that shows how text and images are processed by the model through the vision encoder and core language model to provide text output. — *Figure 1. A high-level diagram of the Step 3.7 Flash components for text and vision processing*

Step 3.7 Flash can be deployed with open source frameworks such as SGLang, NVIDIA TensorRT-LLM, and vLLM to utilize kernels optimized for NVIDIA hardware.

Build with NVIDIA endpoints

Developers can use GPU-accelerated endpoints available through build.nvidia.com for prototyping and evaluating Step 3.7 Flash. Test this out in the demo notebook, which uses Step 3.7 Flash and NVIDIA Nemotron Parse. The multi-step document intelligence pipeline extracts structured insights from large, complex documents with bounding boxes like financial reports, slide decks, and scientific papers, including PDFs, and organizes the output.

Video 1. See how document intelligence pipelines extract usable data, then follow the workflow in a JupyterLab notebook

Production-ready deployment with NVIDIA NIM

NVIDIA NIM makes it easy to take Step 3.7 Flash from development into production. Available as optimized, containerized inference microservices, NIM packages the model with the performance tuning, standardized APIs, and deployment flexibility enterprises need. Download and run it on-premises, in the cloud, or across hybrid environments. NIM provides a standard OpenAI inference for sending inference requests to the NIM server.

Download the NIM container from the NVIDIA container registry (enterprise license required).
Start a server with the OpenAI client.
Send either text or image input to the endpoint.

from openai import OpenAI 
  
client = OpenAI( 
  base_url = "http://0.0.0.0:8000/v1", 
  api_key="no-key-required" 
) 
  
completion = client.chat.completions.create( 
  model="stepfun/step-3.7-flash", 
  messages=[{"role":"user","content":"Explain particle physics?"}] 
  temperature=0.5, 
  top_p=1, 
  max_tokens=1024, 
  stream=True 
) 
  
for chunk in completion: 
  if chunk.choices[0].delta.content is not None: 
    print(chunk.choices[0].delta.content, end="")

Day 0 fine-tuning with NVIDIA NeMo Framework

Step 3.7 Flash can be customized with domain-specific data using open libraries from the NVIDIA NeMo framework. NVIDIA NeMo Automodel library combines native PyTorch n-D parallelisms with optimized performance and supports Day 0 fine-tuning directly from Hugging Face model checkpoints without checkpoint conversion. The Automodel fine-tuning recipe for Step 3.7 supports techniques such as supervised fine-tuning (SFT) and memory-efficient LoRA at 600 tokens/sec on Hopper GPUs.

For advanced large-scale training, teams can also use the NeMo Megatron-Bridge fine-tuning recipe, which provides additional performance optimizations.

From data center deployments on NVIDIA Blackwell to deskside with NVIDIA DGX Station to managed NIM microservices and Day 0 fine-tuning workflows, NVIDIA provides a range of options for integrating Step 3.7 Flash across different stages of development and deployment. With 748 GB of coherent memory, DGX Station is ideal for running Step 3.7 Flash with increased headroom for the full 256k context length, and faster local developer iteration.

NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open source licenses. NVIDIA is committed to open models such as Step 3.7 Flash that promote AI transparency and enable users to share their AI safety and resilience work.

To get started, check out Step 3.7 Flash on Hugging Face, test it with your own data on build.nvidia.com, or locally on DGX Station using the vLLM Playbook.

Discuss (0)

About the Authors

About Anu Srivastava
Anu Srivastava is a senior technical marketing manager who focuses on NVIDIA’s lighthouse AI model collaborations. She works with key partners and foundations to enable NVIDIA accelerated platform support for the open source developer ecosystem. Prior to NVIDIA, she worked at Google for over a decade in various engineering and management roles and holds a degree in computer science from the University of Texas at Austin.

View all posts by Anu Srivastava

Comments

Discussion (0)

No comments yet. Sign in and be the first to say something.

Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI

Build with NVIDIA endpoints

Production-ready deployment with NVIDIA NIM

Day 0 fine-tuning with NVIDIA NeMo Framework

Tags

About the Authors

Comments

Discussion (0)

More from NVIDIA Developer Blog