Skip to content

Instantly share code, notes, and snippets.

@louity
Last active May 7, 2026 11:27
Show Gist options
  • Select an option

  • Save louity/ce48ff04e85d3b8eddefa18588478a84 to your computer and use it in GitHub Desktop.

Select an option

Save louity/ce48ff04e85d3b8eddefa18588478a84 to your computer and use it in GitHub Desktop.

Deploy Qwen3.6-35B-A3B on DGX Spark

Use it with continue VSCode extension (once deployed)

Update the config.yaml file (usually located in ~/.continue/config.yaml):

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: Qwen3.6 35B (vLLM)
    provider: openai
    model: Qwen/Qwen3.6-35B-A3B
    apiBase: http://LOCAL_SERVER_IP:PORT/v1
    apiKey: YOUR_API_KEE
    roles:
      - chat
      - edit
      - apply
    requestOptions:
      verifySsl: false

Hardware

Machine NVIDIA DGX Spark
GPU GB10 (Blackwell, cc 12.1)
Memory 128 GB unified LPDDR5X (CPU + GPU share the same pool)
CPU aarch64 (Cortex-X925 + Cortex-A725)
CUDA 13.0
Local IP 192.168.68.83

Model weights are ~70 GB BF16. The MoE design activates only 3 B params per token, so KV cache is cheap and 128 K context fits comfortably.


1 — Install uv-managed Python

uv python install 3.12

Installs a self-contained CPython 3.12 with headers included (no system python3.12-dev needed).


2 — Create virtual environment

cd /home/user/llms
uv venv .venv --python cpython-3.12.13-linux-aarch64-gnu
source .venv/bin/activate

3 — Install vLLM

uv pip install "vllm>=0.19.0" --torch-backend=auto

--torch-backend=auto picks the right CUDA 13 / Blackwell PyTorch wheel automatically. Installs vLLM 0.20.0 with torch==2.11.0+cu130.


4 — Start the server

./start.sh
# or equivalently:
VLLM_API_KEY="changeme" ./start.sh

start.sh runs:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --quantization fp8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --api-key "changeme"

--quantization fp8 quantizes weights on-the-fly to FP8 (~35 GB vs ~70 GB BF16), freeing ~35 GB of unified memory for a larger KV cache. Quality impact is minimal; Blackwell (GB10) accelerates FP8 natively.

The first run downloads the model from HuggingFace (~70 GB). Set HF_HOME to control where it lands:

export HF_HOME=/home/user/llms/hf-cache

API endpoint: http://192.168.68.83:8000/v1 (reachable from any device on the LAN).


5 — Smoke test

From the DGX Spark itself:

curl http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

From another machine on the local network:

curl http://192.168.68.83:8000/v1/chat/completions \
  -H "Authorization: Bearer changeme" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 64
  }'

Without thinking

curl http://IP_ADDRESS:8000/v1/chat/completions \          
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mysecret" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [{"role": "user", "content": "Hello, are you working?"}],
    "max_tokens": 1000,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

6 — Connect from another machine

export OPENAI_BASE_URL="http://192.168.68.83:8000/v1"
export OPENAI_API_KEY="changeme"

Then use the OpenAI SDK or any OpenAI-compatible client normally.


Recommended sampling parameters

Mode temperature top_p top_k presence_penalty
Thinking — general 1.0 0.95 20 1.5
Thinking — coding/WebDev 0.6 0.95 20 0.0
Non-thinking (instruct) 0.7 0.80 20 1.5

Suggested max_tokens: 32 768 for most queries; 81 920 for hard math/coding competitions.

To disable thinking (non-thinking mode):

extra_body={"chat_template_kwargs": {"enable_thinking": False}}

Context length tuning

Default: 131 072 tokens (128 K). Can increase up to 262 144 natively; beyond that requires YaRN:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.6-35B-A3B \
  --max-model-len 1010000 \
  --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11,11,10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'

Run as a background service (optional)

nohup ./start.sh > /home/user/llms/vllm.log 2>&1 &
echo $! > /home/user/llms/vllm.pid

Stop it:

kill $(cat /home/user/llms/vllm.pid)

start.sh script:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/.venv/bin/activate"

# Set a real secret here — any client must send: Authorization: Bearer <API_KEY>
API_KEY="${VLLM_API_KEY:-changeme}"

exec vllm serve Qwen/Qwen3.6-35B-A3B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --quantization fp8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --api-key "$API_KEY"

usage: VLLM_API_KEY="mysecret" ./start.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment