Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Getting 10.3 tps using this prompt:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" ./build-mimo-5090-3090/bin/llama-server -m "$MIMO" -ngl 999 --n-cpu-moe 43 --no-mmap -c 100000 -ctk q8_0 -ctv q8_0 -fa on --main-gpu 0 -t 8 --prio 3 --host 0.0.0.0 --port 8083
cpu: 9950x3d (using igpu for display)
ram: 256gb 5600mhz
gpu: single rtx 5090
os: linux mint 22.xx
is 10.3 tps on token generation is the absolute limit? I guess turbo quant is the only way to move forward from here. or is there anything else i can do to squeeze 1-2 more tps?
[link] [comments]
More from r/LocalLLaMA
-
gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now, A Writing Finetune that Aims to Improve Gemma 4 31B it Writing Quality with More Natural English and Better Prose, Good for Creative Writings, Translations and RPs!
May 16
-
Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs
May 16
-
How I started programming differently over the last year. What about you?
May 16
-
GitHub - richardr1126/openreader: An open-source read-along document reader server with high-quality TTS options, synchronized highlighting, and audiobook export for EPUB, PDF, DOCX, TXT, and MD.
May 16
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.