nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is the quantized version of Alibaba's Qwen3.6-35B-A3B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is quantized with Model Optimizer. Post Training QuantizationThis model was obtained by quantizing the weights of Qwen3.6-35B-A3B to NVFP4 data type, ready for inference with vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 3.06x. EvaluationThe accuracy benchmark results are presented in the table below:
[link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.