r/LocalLLaMA · May 31, 2026 · 2 min read

mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released !

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Description of the module:

I host 30+ free APEX MoE quantizations as independent research. My only local hardware is an NVIDIA DGX Spark (122 GB unified memory) — enough for ~30-50B-class MoEs, but bigger ones (200B+) require rented compute on H100/H200/Blackwell, typically $20-100 per quant.
If APEX quants are useful to you, your support directly funds those bigger runs.

Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — APEX-MTP GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, with the MTP (multi-token prediction) head bundled for in-the-box self-speculative decoding.

What's different from the plain APEX repo?

These GGUFs bundle the model's MTP (multi-token prediction) head alongside the trunk in a single file, courtesy of llama.cpp PR #22673. With a recent llama.cpp (>= commit 255582687) you can enable self-speculative decoding using just this one file — no separate draft model needed:

llama-server -m Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf --draft-mtp

The non-MTP version is still available at mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-GGUF — slightly smaller, but no self-spec.

File sizes

Each quant is ~2.5% larger than its non-MTP counterpart (one extra transformer-block worth of weights, no embedding duplication since MTP shares the trunk's embed_tokens).

MTP draft head precision

The bundled MTP head (blk.40.* including the nextn.* projection + norms) is quantized to Q8_0 (near-lossless) on every tier except I-Nano. I-Nano keeps the trunk-tier precision on the MTP block (Q3_K routed experts, Q4_K attention) but pins blk.40.nextn.eh_proj to Q4_K — see the explainer below.

This keeps draft accuracy high (important for spec-decode acceptance rate) at a modest ~1 GB cost per file vs. trunk-tier precision.

Why the MTP head doesn't use imatrix

llama-imatrix runs normal forward passes that only activate the trunk (blk.0..blk.39). The MTP head only fires during --draft-mtp spec decoding, so its tensors get no imatrix activation data. We work around this by quantizing the MTP head with static K-quant / Q8_0 which doesn't require imatrix.

(A patch to llama-imatrix that records MTP activations during collection is in progress at mudler/llama.cpp#mtp-imatrix — once upstream this will let us push the drafter to lower bit-widths cleanly.)

What is APEX?

APEX is a MoE-aware mixed-precision quantization strategy. Per-tensor-role gradient: routed experts compress hardest, shared experts kept high (always active), attention/Mamba uniform; 5+5 symmetric edge gradient across the 40 trunk layers + MTP layer 40 at edge precision. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

Architecture

Base: Qwen 3.6 35B-A3B family (Qwen3_5MoeForCausalLM)
Layers: 40 trunk + 1 MTP (bundled)
Experts: 256 routed + 1 shared (8 active per token)
Hidden size: 2048
Calibration: v1.3 diverse dataset

submitted by /u/PhotographerUSA
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

File sizes

MTP draft head precision

Why the MTP head doesn't use imatrix

What is APEX?

Discussion (0)

More from r/LocalLLaMA