I have (even faster) DeepSeek V4 Pro at home
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Few days ago I posted about my DeepSeek V4 Pro at home - now time for an update. Yesterday I finally managed to run this model in ktransformers (sglang + kt-kernel). I followed the tutorial for DeepSeek V4 Flash and tweaked some options (NUMA, cores) for my hardware (Epyc 9374F + RTX PRO 6000 Max-Q). Then I ran llama-benchy with increasing context depth to check the performance. Results:
Depth 0:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:----------------------------|-------:|-------------:|------------:|----------------:|----------------:|----------------:| | deepseek-ai/DeepSeek-V4-Pro | pp512 | 39.76 ± 0.00 | | 12878.44 ± 0.00 | 12877.59 ± 0.00 | 12878.44 ± 0.00 | | deepseek-ai/DeepSeek-V4-Pro | tg32 | 7.54 ± 0.00 | 8.00 ± 0.00 | | | | Depth 2048:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:----------------------------|--------------:|-------------:|------------:|----------------:|----------------:|----------------:| | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d2048 | 45.13 ± 0.00 | | 56726.85 ± 0.00 | 56725.93 ± 0.00 | 56726.85 ± 0.00 | | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d2048 | 7.32 ± 0.00 | 8.00 ± 0.00 | | | | Depth 4096:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:----------------------------|--------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:| | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d4096 | 45.75 ± 0.00 | | 100729.28 ± 0.00 | 100728.46 ± 0.00 | 100729.28 ± 0.00 | | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d4096 | 7.29 ± 0.00 | 8.00 ± 0.00 | | | | Depth 8192:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:----------------------------|--------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:| | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d8192 | 45.97 ± 0.00 | | 189354.94 ± 0.00 | 189354.03 ± 0.00 | 189354.94 ± 0.00 | | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d8192 | 7.25 ± 0.00 | 8.00 ± 0.00 | | | | Depth 16384:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:----------------------------|---------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:| | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d16384 | 46.16 ± 0.00 | | 365997.22 ± 0.00 | 365996.26 ± 0.00 | 365997.22 ± 0.00 | | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d16384 | 7.17 ± 0.00 | 8.00 ± 0.00 | | | | Depth 32768:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:----------------------------|---------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:| | deepseek-ai/DeepSeek-V4-Pro | pp512 @ d32768 | 46.18 ± 0.00 | | 720687.13 ± 0.00 | 720685.67 ± 0.00 | 720687.13 ± 0.00 | | deepseek-ai/DeepSeek-V4-Pro | tg32 @ d32768 | 7.07 ± 0.00 | 8.00 ± 0.00 | | | | During 64k test (that took over 20 min) llama-benchy did not report the result despite sglang finishing processing the request so I aborted the test. I don't know, maybe there is some kind of timeout happening.
This is all running the original model files, no need for conversion.
- GPU VRAM usage: 90815MiB / 97887MiB
- GPU power usage: ~100W during PP, ~150W during TG
- RAM usage: 907.5GB / 1152GB
- CPU+MB power usage: ~400W
[link] [comments]
More from r/LocalLLaMA
-
internlm/Intern-S2-Preview · Hugging Face
May 15
-
Came home to find Pi with Qwen3.627B had run rm -rf .....
May 15
-
Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)
May 15
-
China modded GPU (eg. 4090 48gb) --> I'm gonna figure it out. IS THERE NO ONE ELSE CURIOUS??
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.