Deepseek V4 flash performance on DGX Spark
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hello Reddit
I have been trying to get Deepseek V4 on the DGX Spark for the past week. Yesterday I was finally able to get it to work thanks to the hard work from the folks at local-inference-lab.
The variants I have are the ASUS GX10. Two GX10s are Hooked up to their connect X-7 port running in docker with a very janky setup.
The max context I can safely fit is around 1M tokens in the KV cache. I typically run it at 256k max for concurrency. It's running the original MXFP8 x MXFP4 model for Deepseek v4 flash. There's some NVFP4 variants out there but I haven't tested them. Once software support is more mature I suspect the NVFP4 variants will provide much better performance at high concurrency on the spark.
The throughput is pretty good. I'm the only user for the spark concurrency so isn't important for me. At most I'll run 3-4 request in parallel for a batch job but typically its about 1-2. The spark handles that just fine but TTFT will naturally take a little longer.
I don't use llama.cpp or any of the variants like LM Studio since performance is more important for me than compatibility. Therefore I am currently using vLLM.
Here is my performance at the following context windows for concurrency = 1.
| Context | Prefill T/S | Decode T/S (MTP =2 ) |
|---|---|---|
| 4K | 2050 | 49.4 |
| 16k | 2150 | 43.0 |
| 32k | 2130 | 37.9 |
| 128k | 1920 | 42.5 |
| 256K | 1680 | 39.8 |
As you can see performance is very consistent and degradation is pretty low. The performance anomaly at 32k probably has to with a cold kernel at that shape. At C=4 128k I typically see around 40-42 tokens aggregate so 10 t/s for each request. Not very fast but also not a very common event for me.
The model is also insanely smart, on a private benchmark for high context retrieval and reasoning V4 flash easily beats M2.7 and Stepfun 3.7 at high reasoning (not max). It lacks the world knowledge of denser models like V4 Pro or Kimi K2.6 but in terms of raw intelligence it's very good. It's probably the best model I've ever used.
I'm pretty happy with my DGX Spark. Deepseek clearly has done an excellent job and alot of the new technology Deepseek made with V4 will be used elsewhere.
Generally I'm very impressed with the spark. It's not very good at running dense models like Gemma4 27B or Qwen3.6 26B but on MOE models the performance is spectacular. Especially if the active weights are below 15B. Power consumption is very low. Sitting at 280~ watts total at max load and can run very stably at high load for extended periods.
If you want to run DSV4-flash on the DGX Spark here's a docker compose.
It was built on top of:
local-inference-lab/vllm at dev/unholy-fusion
mostly to fix a few issues with prefix caching and crashing issues I had. These will be pushed to my own fork at
aidendle94 (Aiden Le).
# DeepSeek-V4-Flash on DGX Spark GB10 (arm64 / sm_121a) — TP=2 over RoCE. # # IMPORTANT NOTES BEFORE YOU RUN: # 1. arm64 ONLY (GB10/Spark). Will NOT run on x86. # 2. The MODEL WEIGHTS are NOT in the image. They live in the mounted HF cache # (~148 GB). Download deepseek-ai/DeepSeek-V4-Flash into ${HF_CACHE} first. # 3. This is a 2-NODE setup. docker compose is single-host, so you run this SAME # file on EACH node with different env (NODE_RANK / HEADLESS). Start the WORKER # (rank 1) first, then the HEAD (rank 0). # 4. The NCCL_* values (NCCL_IB_HCA, NCCL_SOCKET_IFNAME) and MASTER_ADDR are # SITE-SPECIFIC — edit them to match YOUR NICs and head-node IP. # 5. For a SINGLE GPU / single node: set TP=1 and delete the --nnodes/--node-rank/ # --master-addr lines + the multi-node env, and drop /dev/infiniband + NCCL_IB_*. # # Per-node launch (set via a .env file or inline): # HEAD (node 0): NODE_RANK=0 HEADLESS= MASTER_ADDR=<head-ip> docker compose up # WORKER(node 1): NODE_RANK=1 HEADLESS=1 MASTER_ADDR=<head-ip> docker compose up # start this FIRST services: vllm: image: aidendle94/sparkrun-vllm-ds4-gb10:production-ready network_mode: host # NCCL bootstrap + RoCE need the host network ipc: host shm_size: "10gb" gpus: all # all local GPUs (1 per GB10 node). If your compose # is older and rejects this, use the deploy: block below instead. devices: - /dev/infiniband:/dev/infiniband # RoCE / IB verbs (omit on single-node) volumes: - ${HF_CACHE:-${HOME}/.cache/huggingface}:/cache/huggingface # model + JIT caches - /etc/passwd:/etc/passwd:ro - /etc/group:/etc/group:ro environment: # --- model / cache / vLLM --- HF_HOME: /cache/huggingface HF_HUB_OFFLINE: "1" VLLM_CACHE_ROOT: /cache/huggingface/vllm-cache VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1" VLLM_USE_B12X_MOE: "1" VLLM_SPARSE_INDEXER_MAX_LOGITS_MB: "256" VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/site-packages/nvidia/nccl/lib/libnccl.so.2 # --- GB10 arch --- TORCH_CUDA_ARCH_LIST: "12.1a" FLASHINFER_CUDA_ARCH_LIST: "12.1a" # --- NCCL / RoCE (SITE-SPECIFIC: edit for your NICs) --- NCCL_NET: IB NCCL_IB_DISABLE: "0" NCCL_IB_HCA: "rocep1s0f0,roceP2p1s0f0" NCCL_SOCKET_IFNAME: "enP7s7,enp1s0f0np0,enP2p1s0f0np0" NCCL_IB_GID_INDEX: "3" NCCL_CROSS_NIC: "1" NCCL_CUMEM_ENABLE: "0" NCCL_IGNORE_CPU_AFFINITY: "1" NCCL_DEBUG: WARN # --- per-node (CHANGE PER HOST) --- NODE_RANK: "${NODE_RANK:?set 0 on head, 1 on worker}" HEADLESS: "${HEADLESS:-}" # empty on head, "1" on worker MASTER_ADDR: "${MASTER_ADDR:?head-node IP}" # The image bakes /usr/local/bin/dsv4-vllm-entrypoint. We wrap in bash so the # ${HEADLESS:+--headless} flag is only added on the worker. command: - bash - -lc - > exec /usr/local/bin/dsv4-vllm-entrypoint serve deepseek-ai/DeepSeek-V4-Flash --served-model-name ChatGPTN --host 0.0.0.0 --port 8000 --trust-remote-code --tensor-parallel-size 2 --pipeline-parallel-size 1 --kv-cache-dtype fp8 --block-size 256 --max-model-len 262144 --max-num-seqs 4 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.8 --enable-prefix-caching --speculative-config '{"method":"mtp","num_speculative_tokens":2}' --tokenizer-mode deepseek_v4 --distributed-executor-backend mp --tool-call-parser deepseek_v4 --enable-auto-tool-choice --reasoning-parser deepseek_v4 --default-chat-template-kwargs.thinking=true --default-chat-template-kwargs.reasoning_effort=high --enable-flashinfer-autotune --nnodes 2 --node-rank ${NODE_RANK} --master-addr ${MASTER_ADDR} --master-port 25000 ${HEADLESS:+--headless} # If `gpus: all` isn't supported by your compose version, remove it and use: # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: all # capabilities: [gpu] [link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.