r/LocalLLaMA · · 2 min read

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Model Summary

Total Parameters 550B (55B active)
Architecture LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP)
Context Length Up to 1M tokens
Minimum GPU Requirement 8x GB200/B200/GB300/B300, 16x H100, 8x H200
Supported Languages English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese
Best For Frontier reasoning, complex agentic workflows, long-context analysis, tool use, multilingual reasoning, high-stakes RAG
Reasoning Mode Configurable on/off via chat template (enable_thinking=True/False)
License OpenMDW License Agreement, version 1.1
Release Date June 4, 2026

What is Nemotron?

NVIDIA Nemotron™ is a family of open models with open weights, training data, and recipes, delivering leading efficiency and accuracy for building specialized AI agents.

Description

Nemotron-3-Ultra-550B-A55B-BF16 is a frontier-scale large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is optimized for the most demanding workloads, including complex multi-step agents, long-context analysis, and high-accuracy reasoning over code, math, and science. Like other models in the family, it responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be configured through a flag in the chat template.

The model employs a hybrid Latent Mixture-of-Experts (LatentMoE) architecture, utilizing interleaved Mamba-2 and MoE layers, along with select Attention layers. Like the Super model, the Ultra model incorporates Multi-Token Prediction (MTP) layers for faster text generation and improved quality, and it is trained using an NVFP4 pre-training recipe to maximize compute efficiency. The model has 55B active parameters and 550B parameters in total.

The supported languages include: English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Korean, Brazilian Portuguese, and Chinese.

This model is ready for commercial and non-commercial use.

Too big to run locally on my setup, 8xH200 anyone?

submitted by /u/jacek2023
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA