r/MachineLearning · · 2 min read

Adaptive Mixture of Experts Gate (AMG) [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

[Project] Post-hoc Adaptive MoE Gating on Qwen3.6-35B — empirical benchmarking of an open research gap

Adaptive MoE routing — selecting a variable number of experts per token based on routing confidence — has been studied in papers (XMoE 2024, DynMoE ICLR 2025, TopP routing Huang et al. 2024). All successful implementations train from scratch. Nobody has published empirical results for post-hoc application to a pretrained fixed-k model at production scale. This is that experiment.

What we built

An inference-time patch to llama.cpp for Qwen3.6-35B-A3B (256 experts/layer, k=8 fixed) that applies cumulative probability thresholding to expert routing weights after normalisation. The GGML static graph constraint prevents truly dynamic k — the workaround is zero-gating: all k FFNs compute, but low-confidence experts are zeroed and renormalised out of the output. Threshold, min_k, and max_k cap are runtime-configurable via env vars.

Results (PPL on PTB, 192 chunks, ctx=512)

Config PPL ±σ Avg experts active
k8 baseline 11.3277 ±0.143 8.00/8
k8 + threshold 0.75 12.1226 ±0.155 5.42/8
k12 no gating 11.3379 ±0.144 12.00/12
k12 + threshold 0.90 11.2925 ±0.143 10.31/12

Key empirical finding

Post-hoc threshold gating on a fixed-k trained model cannot produce meaningful per-token variability without quality cost. The router's distributions after norm_w are flat by construction — training with fixed k=8 produces distributions like [0.16, 0.14, 0.13, 0.12, 0.12, 0.11, 0.11, 0.11]. The threshold has nothing peaked to bite into. Cutting from 8 to 5.4 experts removes experts contributing 11-13% of the output each — that's real signal loss, not noise.

The k12 + 0.90 result (PPL 11.2925, marginally below baseline) is interesting precisely because it uses 4 experts the model was never trained to use. AMG at 0.90 removes the weakest 1-2 of those untrained extras, leaving a slightly cleaner signal. Whether this is a real effect or noise is ambiguous at ±0.143 error, but the direction is consistent.

What's genuinely new

No published work describes a working ggml_map_custom1 callback for adaptive gating in a production inference engine. The zero-gating workaround for static GGML graphs is a practical contribution. The empirical quantification of why post-hoc AMG is limited on fixed-k models fills a gap the papers don't address — they all train from scratch and don't measure the degradation curve of applying adaptive gating to a pre-existing flat-distribution router.

Open problem

The path to genuine per-token variability is router fine-tuning with entropy regularization (L = L_LM + λ_entropy H(router) + λ_balance KL(usage, uniform)), targeting only the 21M gate weight parameters with all expert FFN weights frozen. A training pipeline for this is included. Hardware requirement is ~20GB VRAM — currently blocked on 16GB A5000. If anyone wants to run it, the script is ready and I'd be interested in the results.

GitHub: https://github.com/cjhudlin/Adaptive-MoE-Gate-AMG-for-Qwen3.6-35B

Full methodology, raw perplexity logs, patch script, and router training pipeline included.

submitted by /u/cjhudlin
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning