r/LocalLLaMA · · 1 min read

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is the quantized version of Alibaba's Qwen3.6-35B-A3B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is quantized with Model Optimizer.

Post Training Quantization

This model was obtained by quantizing the weights of Qwen3.6-35B-A3B to NVFP4 data type, ready for inference with vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 3.06x.

Evaluation

The accuracy benchmark results are presented in the table below:

Precision MMLU Pro GPQA Diamond τ²-Bench Telecom SciCode AIME 2025 AA-LCR IFBench MMMU PRO
BF16 85.6 84.9 95.5 40.8 89.2 62.0 62.3 74.1
NVFP4 85.0 84.8 94.7 40.6 88.8 62.0 62.8 74.5
submitted by /u/pmttyji
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA