Last active
February 22, 2026 22:03
-
-
Save armand1m/8f354797ed39f14e14cea0ed5c52c770 to your computer and use it in GitHub Desktop.
qwen3-coder-next - vllm 0.15.1 - transformers 5 - optimized for dgx spark
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/bin/bash | |
| docker run -d \ | |
| --name vllm \ | |
| --restart unless-stopped \ | |
| --gpus all \ | |
| --ipc host \ | |
| --shm-size 64gb \ | |
| --memory 110g \ | |
| --memory-swap 120g \ | |
| --pids-limit 4096 \ | |
| -p 0.0.0.0:18080:8000 \ | |
| -e HF_TOKEN="${HF_TOKEN:-}" \ | |
| -e VLLM_LOGGING_LEVEL="INFO" \ | |
| -e NVIDIA_TF32_OVERRIDE="1" \ | |
| -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE="1" \ | |
| -e VLLM_TORCH_COMPILE="1" \ | |
| -e VLLM_FLOAT32_MATMUL_PRECISION="high" \ | |
| -e VLLM_LOG_STATS_INTERVAL="10" \ | |
| -e VLLM_ATTENTION_BACKEND="FLASHINFER" \ | |
| -e VLLM_FLASHINFER_FORCE_TENSOR_CORES="1" \ | |
| -e VLLM_FLASHINFER_MOE_BACKEND="throughput" \ | |
| -e CUDA_VISIBLE_DEVICES="0" \ | |
| -e PYTHONHASHSEED="0" \ | |
| -e VLLM_USE_V2_MODEL_RUNNER="0" \ | |
| -e VLLM_ENABLE_PREFIX_CACHING="1" \ | |
| -e TORCH_CUDA_ARCH_LIST="12.1f" \ | |
| -v $HOME/huggingface:/root/.cache/huggingface \ | |
| scitrera/dgx-spark-vllm:0.15.1-t5 \ | |
| vllm serve Qwen/Qwen3-Coder-Next-FP8 \ | |
| --served-model-name qwen3-coder-next \ | |
| --load-format fastsafetensors \ | |
| --attention-backend flashinfer \ | |
| --port 8000 \ | |
| --max-model-len 262144 \ | |
| --block-size 128 \ | |
| --max-num-seqs 16 \ | |
| --max-num-batched-tokens 131072 \ | |
| --gpu-memory-utilization 0.80 \ | |
| --kv-cache-dtype auto \ | |
| --enable-prefix-caching \ | |
| --trust-remote-code \ | |
| --enable-auto-tool-choice \ | |
| --tool-call-parser qwen3_coder \ | |
| --disable-uvicorn-access-log \ | |
| --kv-cache-metrics \ | |
| --cudagraph-metrics \ | |
| --enable-mfu-metrics \ | |
| -cc.max_cudagraph_capture_size 512 \ | |
| --tensor-parallel-size 1 |
Author
--gpu-memory-utilization 0.90 is too high, my spark went in OOM after one hour of coding.
with 0.80, after one day of coding I am at 170.000 tokens, the ram 117G/120G and swap 3.47GB used, but still working.
Author
@capitangiaco indeed, 0.80 is safer. I reduced it as well
Author
also, better to use 0.16.0-t5 at this stage most likely
also, better to use 0.16.0-t5 at this stage most likely
I will try it
I had to use --max-num-batched-tokens 65536 \
with 131072 the system begin to swap at about 130-140K tokens
I stopped the docker at 114GB, I will retry with -e VLLM_TORCH_COMPILE="0".
the next steps are --max-num-batched-tokens 32K and --gpu-memory-utilization 0.75.
Iβm starting to think that with 128GB, the context to use should be 128K.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
GLM4.7 analysed this instance logs after I asked a fairly long code analysis request using OpenCode in plan mode.
Screenshots at the end