TL;DR Speeds are not too ugly for this old 2018 hardware but imo, not very usable for agentic coding (if you compare with qwen3.6 27B on 8 MI50 @ 50 tps TG 800 tps PP). More concerning is that the reasoning output is very very long and still didn’t check about the quality of code output…
As said before, I think there’s still room to have higher speeds (by updating the software & hardware stacks, eg. use of pcie switch with lower latency, more optimized mtp without overhead for rocm/gfx906, fp16 dequant, etc)
Inference engine used (vllm fork v0.23.1 with rocm7.2.1): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main
Huggingface Quants used:
cyankiwi/MiniMax-M3-AWQ-INT4
bullerwins/MiniMax-M3-4bit-W4A16-v0
Main commands to run:
sudo docker run -it --name vllm-gfx906-mobydick -v /home:/home --network host --device=/dev/kfd --device=/dev/dri \ --group-add video --group-add $(getent group render | cut -d: -f3) \ --cap-add=SYS_ADMIN --volume /sys:/sys:ro --pid=host --privileged \ --ipc=host aiinfos/vllm-gfx906-mobydick:v0.23.1rc0.x-rocm7.2.1-pytorch2.11.0
Cmd for 8 MI50 bullerwins/MiniMax-M3-4bit-W4A16-v0:
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /home/llm/models/MiniMax-M3-4bit-W4A16-v0 \ --served-model-name MiniMax-M3-4bit-W4A16-v0 \ --enable-auto-tool-choice \ --tool-call-parser minimax_m3 \ --reasoning-parser minimax_m3 \ --max-model-len auto \ --max-num-seqs 8 \ --gpu-memory-utilization 0.975 \ --enable-log-requests \ --enable-log-outputs \ --log-error-stack \ --speculative-config '{"method": "eagle3", "model": "/home/rig9/llm/models/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "TRITON_ATTN"}' \ --dtype float32 \ --kv-cache-dtype float16 \ --attention-config.indexer_kv_dtype float16 \ --block-size 128 \ --skip-mm-profiling \ --limit-mm-per-prompt '{"image":1,"video":{"count":1,"num_frames":32}}' \ --tensor-parallel-size 8 --port 8000 2>&1 | tee log.txt
>>> 11.9 tok/s TG & 326 tok/s PP (no MTP) (16k tok prompt) (36,597 tokens ctx MAX)
>>> 19.2 tok/s TG & 1005 tok/s PP (MTP 3) (1k tok prompt) (7,680 tokens ctx MAX)
>>> TP16 : garbage output / not supported
Cmd for 16 MI50 cyankiwi/MiniMax-M3-AWQ-INT4:
VLLM_TRITON_ATTN_NUM_PAR_SOFTMAX_SEGMENTS=64 FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /home/rig9/llm/models/MiniMax-M3-AWQ-INT4 \ --served-model-name MiniMax-M3-AWQ-INT4 \ --enable-auto-tool-choice \ --tool-call-parser minimax_m3 \ --reasoning-parser minimax_m3 \ --max-model-len auto \ --max-num-seqs 4 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.92 \ --enable-log-requests \ --enable-log-outputs \ --log-error-stack \ --speculative-config '{"method": "eagle3", "model": "/home/rig9/llm/models/MiniMax-M3-EAGLE3", "num_speculative_tokens": 5, "attention_backend": "TRITON_ATTN", "use_local_argmax_reduction":true}' \ --dtype float32 \ --kv-cache-dtype float16 \ --attention-config.indexer_kv_dtype float16 \ --block-size 128 \ --skip-mm-profiling \ --limit-mm-per-prompt '{"image":1,"video":{"count":1,"num_frames":32}}' \ --tensor-parallel-size 16 --port 8000 2>&1 | tee log.txt
>>> 6.6 tok/s TG & 296 tok/s PP (no MTP) (16k tok prompt) (220,416tokens ctx MAX with 0.95 --gmu)
>>> 18.2 tok/s TG & 135 tok/s PP (MTP 5) (16k tok prompt) (143,488 tokens ctx MAX)
>>> TP8 : OOM / not supported
VLLM_LOGGING_LEVEL=DEBUG vllm bench serve \ --dataset-name random \ --random-input-len 10000 \ --random-output-len 1000 \ --num-prompts 2 \ --seed 1 \ --temperature 1 --top-p 0.95 --top-k 40 \ --request-rate inf \ --max-concurrency 1 \ --ignore-eos 2>&1 | tee logb.txt
============ Serving Benchmark Result ============ Successful requests: 2 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 279.80 Total input tokens: 20000 Total generated tokens: 2000 Request throughput (req/s): 0.01 Output token throughput (tok/s): 7.15 Peak output token throughput (tok/s): 5.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 78.63 ---------------Time to First Token---------------- Mean TTFT (ms): 73626.88 Median TTFT (ms): 73626.88 P99 TTFT (ms): 73681.87 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 66.34 Median TPOT (ms): 66.34 P99 TPOT (ms): 89.21 ---------------Inter-token Latency---------------- Mean ITL (ms): 232.54 Median ITL (ms): 231.55 P99 ITL (ms): 237.26 ---------------Speculative Decoding--------------- Acceptance rate (%): 50.28 Acceptance length: 3.51 Drafts: 570 Draft tokens: 2850 Accepted tokens: 1433 Per-position acceptance (%): Position 0: 69.82 Position 1: 53.68 Position 2: 46.32 Position 3: 41.93 Position 4: 39.65 ==================================================
submitted by
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.