Adaptive Mixture of Experts Gate (AMG) [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
[Project] Post-hoc Adaptive MoE Gating on Qwen3.6-35B — empirical benchmarking of an open research gap
Adaptive MoE routing — selecting a variable number of experts per token based on routing confidence — has been studied in papers (XMoE 2024, DynMoE ICLR 2025, TopP routing Huang et al. 2024). All successful implementations train from scratch. Nobody has published empirical results for post-hoc application to a pretrained fixed-k model at production scale. This is that experiment.
What we built
An inference-time patch to llama.cpp for Qwen3.6-35B-A3B (256 experts/layer, k=8 fixed) that applies cumulative probability thresholding to expert routing weights after normalisation. The GGML static graph constraint prevents truly dynamic k — the workaround is zero-gating: all k FFNs compute, but low-confidence experts are zeroed and renormalised out of the output. Threshold, min_k, and max_k cap are runtime-configurable via env vars.
Results (PPL on PTB, 192 chunks, ctx=512)
| Config | PPL | ±σ | Avg experts active |
|---|---|---|---|
| k8 baseline | 11.3277 | ±0.143 | 8.00/8 |
| k8 + threshold 0.75 | 12.1226 | ±0.155 | 5.42/8 |
| k12 no gating | 11.3379 | ±0.144 | 12.00/12 |
| k12 + threshold 0.90 | 11.2925 | ±0.143 | 10.31/12 |
Key empirical finding
Post-hoc threshold gating on a fixed-k trained model cannot produce meaningful per-token variability without quality cost. The router's distributions after norm_w are flat by construction — training with fixed k=8 produces distributions like [0.16, 0.14, 0.13, 0.12, 0.12, 0.11, 0.11, 0.11]. The threshold has nothing peaked to bite into. Cutting from 8 to 5.4 experts removes experts contributing 11-13% of the output each — that's real signal loss, not noise.
The k12 + 0.90 result (PPL 11.2925, marginally below baseline) is interesting precisely because it uses 4 experts the model was never trained to use. AMG at 0.90 removes the weakest 1-2 of those untrained extras, leaving a slightly cleaner signal. Whether this is a real effect or noise is ambiguous at ±0.143 error, but the direction is consistent.
What's genuinely new
No published work describes a working ggml_map_custom1 callback for adaptive gating in a production inference engine. The zero-gating workaround for static GGML graphs is a practical contribution. The empirical quantification of why post-hoc AMG is limited on fixed-k models fills a gap the papers don't address — they all train from scratch and don't measure the degradation curve of applying adaptive gating to a pre-existing flat-distribution router.
Open problem
The path to genuine per-token variability is router fine-tuning with entropy regularization (L = L_LM + λ_entropy H(router) + λ_balance KL(usage, uniform)), targeting only the 21M gate weight parameters with all expert FFN weights frozen. A training pipeline for this is included. Hardware requirement is ~20GB VRAM — currently blocked on 16GB A5000. If anyone wants to run it, the script is ready and I'd be interested in the results.
GitHub: https://github.com/cjhudlin/Adaptive-MoE-Gate-AMG-for-Qwen3.6-35B
Full methodology, raw perplexity logs, patch script, and router training pipeline included.
[link] [comments]
More from r/MachineLearning
-
Loss functions in Instance Representation Learning [R]
Jun 29
-
Price elasticity model [R]
Jun 29
-
Rejected MICCAI paper: workshop -> journal/conference or directly journal/conference [R]
Jun 29
-
I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p]
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.