Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with:
docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e FLASHINFER_DISABLE_VERSION_CHECK=1 \ -e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \ vllm/vllm-openai:cu130-nightly \ --model Qwen/Qwen3.6-35B-A3B-FP8 \ --served-model-name qwen36 \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.75 \ --dtype auto \ --kv-cache-dtype fp8 \ --max-model-len 262144 \ --max-num-batched-tokens 32768 \ --max-num-seqs 4 \ --attention-backend flashinfer \ --enable-prefix-caching \ --enable-chunked-prefill \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --trust-remote-code \ --reasoning-parser qwen3 \ --performance-mode throughput \ --default-chat-template-kwargs '{"preserve_thinking":true}' \ --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' It boots successfully and seems stable so far, but I’d love opinions from people running similar long-context / agentic setups.
Any feedback or suggestions are welcome.
[link] [comments]
More from r/LocalLLaMA
-
AMD Powers Next-Generation Agent Computers with New Ryzen AI Halo Developer Platform and Ryzen AI Max PRO 400 Series Processors
May 21
-
Qwen3.6 27B and llama.cpp appreciation post
May 21
-
Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B
May 21
-
Training a vision model from scratch on iPod touch 4 images
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.