Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Cohere dropped Command A+ on the 20th (218B total / 25B active, 128 experts top-8, Apache 2.0). Wrote a cohere2_moe implementation for mlx-lm to get it running on Apple Silicon. Architecture notes for anyone digging into this model: - Single shared expert with a larger intermediate (16384 = 4096×4) combined with the routed output via (routed + shared)/2 - Sigmoid routing (not softmax), normalized top-8 - Sliding window 3:1 (3 sliding + 1 full), interleaved RoPE on sliding layers only - Parallel attn+MLP block off the same LayerNorm - Gotcha that cost me a few iterations: the biases in the W4A4 checkpoint are NVFP4 quantization artifacts — the BF16 model is entirely bias-free. sanitize() handles both formats. I couldn't validate locally (W4A4 needs ~132GB, my M3 Max is 128). https://github.com/vlbosch ran it on a bigger box: BF16→Q8 conversion + clean generation, tool calling, multi-turn with KV-cache continuation, 22.9 tok/s gen / 57.6 tok/s prompt, 241GB peak. PR is open on ml-explore/mlx-lm (in review). Happy to take feedback or fixes — and if someone with 192GB+ wants to test the W4A4 path directly, would love the error output. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.