Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache. Model is at q8_0 to mitigate some weird behavior I was seeing at lower quants.
Speed is very slow at around 50tps pp, 10tps tg, but usable for coding agent workflows.
Anybody else running MoE models in this size class on relatively low-end hardware? For my purposes, speed is less important than accuracy, as long as it's not like literally all day. Any other models you'd recommend I'd try or additional optimization tips that could help within my constraints? I wish they'd released the draft model for MTP on this model but it looks like they declined to do so for 2.7.
My ik_llama flags -- sorry for the funny formatting, this is pasted out of my vibe coded NixOS config:
"${ik-llama-cuda}/bin/llama-server" + " -m ${modelPath}" + " --host 0.0.0.0" + " --port ${toString cfg.port}" + " -c ${toString cfg.contextLength}" + " -ngl 999" + " --cpu-moe" + " -sm graph" + " -fa on" + " -t 16" + " -tb 16" + " -b 4096" + " -ub 4096" + " -np 1" + " -muge" + " -ger" + " --jinja" + " --metrics" + " --temp 1.0" + " --top-p 0.95" + " --top-k 40" + " --min-p 0.01" [link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.