Another shout out to llama.cpp build b9455 2x3090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| As you guys know, the next highest quant is Unsloth's /Qwen3.6-27B-UD-Q8_K_XL.gguf. With llama.cpp before, i was getting 30-50 tk/s. vllm was kicking llama's ass with its tensor splits speeding up the 2x3090s at 70+ tk/s for months. But I can't seem to find good quants for vllm and settle for some unknown qwen3.6-mtp-8.0...it was also making minor coding mistakes here and there... now being able to run unsloth's UDQ8KXL at 70+t/s, its code output are so clean, its like a different beast altogether. Finally got around to test out the llama ver b9455b with tensor-split, and holy f. Results below: ------------------------------- No more watching paint dry:
Example coding run below: ctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold ctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached ctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached ctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached ctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached ctx 2.7K · pp 2.0K/1.5s 1294t/s · out 691/9.7s 71t/s ctx 13K · pp 7.2K/5.0s 1421t/s · out 964/13.0s 73t/s · 5.5K cached ctx 46K · pp 27K/19.8s 1370t/s · out 694/10.2s 67t/s · 19K cached ctx 52K · pp 2.4K/2.6s 919t/s · out 464/6.9s 66t/s · 50K cached ctx 58K · pp 6.5K/6.3s 1036t/s · out 101/1.5s 69t/s · 52K cached ctx 60K · pp 2.1K/2.3s 889t/s · out 163/2.2s 74t/s · 58K cached ctx 2.1K · pp 2.1K/2.3s 880t/s · out 1.9K/32.7s 57t/s ctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached · queue 1 ctx 7.3K · pp cached · out 4.5K/82.5s 54t/s · 7.3K cached ctx 64K · pp 7.8K/5.6s 1402t/s · out 453/5.8s 78t/s · 57K cached ctx 65K · pp 2.3K/2.8s 823t/s · out 99/1.4s 71t/s · 63K cached ctx 65K · pp 120/0.4s · out 93/1.3s 70t/s · 65K cached ctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold ctx 27K take 18.8s to fill cold. ctx100K will take ~60+s. Imagine every turn, waiting a minute.. or 5 minutes for pp to fill.. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.