r/LocalLLaMA · May 20, 2026 · 1 min read

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with:

docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e FLASHINFER_DISABLE_VERSION_CHECK=1 \ -e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \ vllm/vllm-openai:cu130-nightly \ --model Qwen/Qwen3.6-35B-A3B-FP8 \ --served-model-name qwen36 \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.75 \ --dtype auto \ --kv-cache-dtype fp8 \ --max-model-len 262144 \ --max-num-batched-tokens 32768 \ --max-num-seqs 4 \ --attention-backend flashinfer \ --enable-prefix-caching \ --enable-chunked-prefill \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --trust-remote-code \ --reasoning-parser qwen3 \ --performance-mode throughput \ --default-chat-template-kwargs '{"preserve_thinking":true}' \ --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

It boots successfully and seems stable so far, but I’d love opinions from people running similar long-context / agentic setups.

Any feedback or suggestions are welcome.

submitted by /u/povedaaqui
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA