r/LocalLLaMA · · 1 min read

qwen3.6 just stops

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

qwen3.6 just stops

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b

Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it?

This is qwen-code CLI, but also happens on opencode.

Running with vLLM with docker compose:

services: vllm-qwen36-27b-dual-dflash-noviz: image: vllm/vllm-openai:nightly-1acd67a795ebccdf9b9db7697ae9082058301657 container_name: vllm-qwen36-27b-dual-dflash-noviz restart: on-failure ports: - "${BIND_HOST:-0.0.0.0}:${PORT:-8080}:8000" volumes: - ${MODEL_DIR:-/home/ai/models/vllm}:/root/.cache/huggingface - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/torch_compile:/root/.cache/vllm/torch_compile_cache - /home/ai/club-3090/models/qwen3.6-27b/vllm/cache/triton:/root/.triton/cache - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/marlin.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py:ro - /home/ai/club-3090/models/qwen3.6-27b/vllm/patches/vllm-marlin-pad/MPLinearKernel.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py:ro environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-} - CUDA_DEVICE_ORDER=PCI_BUS_ID - VLLM_WORKER_MULTIPROC_METHOD=spawn - NCCL_CUMEM_ENABLE=0 - NCCL_P2P_DISABLE=1 - VLLM_NO_USAGE_STATS=1 - VLLM_USE_FLASHINFER_SAMPLER=1 - OMP_NUM_THREADS=1 - PYTORCH_CUDA_ALLOC_CONF=${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True,max_split_size_mb:512} shm_size: "16gb" ipc: host deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0", "2"] capabilities: [gpu] entrypoint: - /bin/bash - -c - | exec vllm serve ${VLLM_ENFORCE_EAGER:+--enforce-eager} "$@" - -- command: - --model - /root/.cache/huggingface/qwen3.6-27b-autoround-int4 - --served-model-name - qwen - --quantization - auto_round - --dtype - bfloat16 - --tensor-parallel-size - "2" - --disable-custom-all-reduce - --max-model-len - "${MAX_MODEL_LEN:-185000}" - --gpu-memory-utilization - "${GPU_MEMORY_UTILIZATION:-0.95}" - --max-num-seqs - "${MAX_NUM_SEQS:-2}" - --max-num-batched-tokens - "8192" - --language-model-only - --trust-remote-code - --reasoning-parser - qwen3 - --default-chat-template-kwargs - '{"enable_thinking": true}' - --enable-auto-tool-choice - --tool-call-parser - qwen3_coder - --enable-prefix-caching - --enable-chunked-prefill - --speculative-config - '{"method":"dflash","model":"/root/.cache/huggingface/qwen3.6-27b-dflash","num_speculative_tokens":5}' - --host - 0.0.0.0 - --port - "8000" 

Based on https://github.com/noonghunna/club-3090

Any ideas how to improve?

submitted by /u/robertpro01
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA