r/LocalLLaMA · May 30, 2026 · 1 min read

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is the quantized version of Alibaba's Qwen3.6-35B-A3B model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is quantized with Model Optimizer.

Post Training Quantization

This model was obtained by quantizing the weights of Qwen3.6-35B-A3B to NVFP4 data type, ready for inference with vLLM. Only the weights and activations of the linear operators within transformer blocks in MoE are quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 3.06x.

Evaluation

The accuracy benchmark results are presented in the table below:

Precision	MMLU Pro	GPQA Diamond	τ²-Bench Telecom	SciCode	AIME 2025	AA-LCR	IFBench	MMMU PRO
BF16	85.6	84.9	95.5	40.8	89.2	62.0	62.3	74.1
NVFP4	85.0	84.8	94.7	40.6	88.8	62.0	62.8	74.5

submitted by /u/pmttyji
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Post Training Quantization

Evaluation

Discussion (0)

More from r/LocalLLaMA