Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI
Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.
Run Step 3.7 Flash on NVIDIA GPUs with Enterprise-Ready Multimodal AI
AI-Generated Summary
- Step 3.7 Flash from StepFun is a 198 billion-parameter Mixture-of-Experts vision-language model optimized for enterprise-scale workflows combining perception, search, and multi-step reasoning with native image and video input and a 256k context window.
- The model supports deployment via open-source frameworks like SGLang, NVIDIA TensorRT-LLM, and vLLM, leveraging NVIDIA-accelerated infrastructure and GPU-accelerated endpoints for prototyping and evaluation.
- NVIDIA NIM enables production-ready deployment of Step 3.7 Flash as containerized inference microservices with standardized APIs, supporting on-premises, cloud, and hybrid environments, alongside Day 0 fine-tuning capabilities using the NVIDIA NeMo framework for domain-specific customization.
AI-generated content may summarize information incompletely. Verify important information. Learn more
AI applications are moving beyond text generation to multimodal systems that can perceive, search, and reason across images, documents, video, and language in real time—turning fragmented information into actionable insights.
Step 3.7 Flash, the latest from StepFun, brings these capabilities to production and enterprise-scale, available on NVIDIA-accelerated infrastructure. It is a 198B-parameter Mixture-of-Experts vision-language model, with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning at production scale.
With native image and video input, three configurable reasoning levels—low, medium, and high—and a 256k context window, it is designed for enterprise use cases such as financial analysis, concurrent coding agents, and other high-throughput multimodal use cases. Developers can use StepFun’s NVFP4-quantized checkpoint available through Hugging Face for boosted inference due to reduced memory bandwidth and storage requirements.
| Model | Step 3.7 Flash |
| Total parameters | 198B |
| Visual encoder parameters | 1.8B |
| Active parameters | 11B |
| Context length | 256K |
| Experts | 288 (8 active) |
Step 3.7 Flash can be deployed with open source frameworks such as SGLang, NVIDIA TensorRT-LLM, and vLLM to utilize kernels optimized for NVIDIA hardware.
Build with NVIDIA endpoints
Developers can use GPU-accelerated endpoints available through build.nvidia.com for prototyping and evaluating Step 3.7 Flash. Test this out in the demo notebook, which uses Step 3.7 Flash and NVIDIA Nemotron Parse. The multi-step document intelligence pipeline extracts structured insights from large, complex documents with bounding boxes like financial reports, slide decks, and scientific papers, including PDFs, and organizes the output.
Production-ready deployment with NVIDIA NIM
NVIDIA NIM makes it easy to take Step 3.7 Flash from development into production. Available as optimized, containerized inference microservices, NIM packages the model with the performance tuning, standardized APIs, and deployment flexibility enterprises need. Download and run it on-premises, in the cloud, or across hybrid environments. NIM provides a standard OpenAI inference for sending inference requests to the NIM server.
- Download the NIM container from the NVIDIA container registry (enterprise license required).
- Start a server with the OpenAI client.
- Send either text or image input to the endpoint.
from openai import OpenAI
client = OpenAI(
base_url = "http://0.0.0.0:8000/v1",
api_key="no-key-required"
)
completion = client.chat.completions.create(
model="stepfun/step-3.7-flash",
messages=[{"role":"user","content":"Explain particle physics?"}]
temperature=0.5,
top_p=1,
max_tokens=1024,
stream=True
)
for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Day 0 fine-tuning with NVIDIA NeMo Framework
Step 3.7 Flash can be customized with domain-specific data using open libraries from the NVIDIA NeMo framework. NVIDIA NeMo Automodel library combines native PyTorch n-D parallelisms with optimized performance and supports Day 0 fine-tuning directly from Hugging Face model checkpoints without checkpoint conversion. The Automodel fine-tuning recipe for Step 3.7 supports techniques such as supervised fine-tuning (SFT) and memory-efficient LoRA at 600 tokens/sec on Hopper GPUs.
For advanced large-scale training, teams can also use the NeMo Megatron-Bridge fine-tuning recipe, which provides additional performance optimizations.
From data center deployments on NVIDIA Blackwell to deskside with NVIDIA DGX Station to managed NIM microservices and Day 0 fine-tuning workflows, NVIDIA provides a range of options for integrating Step 3.7 Flash across different stages of development and deployment. With 748 GB of coherent memory, DGX Station is ideal for running Step 3.7 Flash with increased headroom for the full 256k context length, and faster local developer iteration.
NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open source licenses. NVIDIA is committed to open models such as Step 3.7 Flash that promote AI transparency and enable users to share their AI safety and resilience work.
To get started, check out Step 3.7 Flash on Hugging Face, test it with your own data on build.nvidia.com, or locally on DGX Station using the vLLM Playbook.
Tags
About the Authors
Anu Srivastava is a senior technical marketing manager who focuses on NVIDIA’s lighthouse AI model collaborations. She works with key partners and foundations to enable NVIDIA accelerated platform support for the open source developer ecosystem. Prior to NVIDIA, she worked at Google for over a decade in various engineering and management roles and holds a degree in computer science from the University of Texas at Austin.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.