Update the config.yaml file (usually located in ~/.continue/config.yaml):
name: Local Config
version: 1.0.0
schema: v1
models:
- name: Qwen3.6 35B (vLLM)
provider: openai
model: Qwen/Qwen3.6-35B-A3B
apiBase: http://LOCAL_SERVER_IP:PORT/v1
apiKey: YOUR_API_KEE
roles:
- chat
- edit
- apply
requestOptions:
verifySsl: false| Machine | NVIDIA DGX Spark |
| GPU | GB10 (Blackwell, cc 12.1) |
| Memory | 128 GB unified LPDDR5X (CPU + GPU share the same pool) |
| CPU | aarch64 (Cortex-X925 + Cortex-A725) |
| CUDA | 13.0 |
| Local IP | 192.168.68.83 |
Model weights are ~70 GB BF16. The MoE design activates only 3 B params per token, so KV cache is cheap and 128 K context fits comfortably.
uv python install 3.12Installs a self-contained CPython 3.12 with headers included (no system python3.12-dev needed).
cd /home/user/llms
uv venv .venv --python cpython-3.12.13-linux-aarch64-gnu
source .venv/bin/activateuv pip install "vllm>=0.19.0" --torch-backend=auto--torch-backend=auto picks the right CUDA 13 / Blackwell PyTorch wheel automatically.
Installs vLLM 0.20.0 with torch==2.11.0+cu130.
./start.sh
# or equivalently:
VLLM_API_KEY="changeme" ./start.shstart.sh runs:
vllm serve Qwen/Qwen3.6-35B-A3B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--quantization fp8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--api-key "changeme"--quantization fp8 quantizes weights on-the-fly to FP8 (~35 GB vs ~70 GB BF16), freeing ~35 GB of unified memory for a larger KV cache. Quality impact is minimal; Blackwell (GB10) accelerates FP8 natively.
The first run downloads the model from HuggingFace (~70 GB). Set HF_HOME to control where it lands:
export HF_HOME=/home/user/llms/hf-cacheAPI endpoint: http://192.168.68.83:8000/v1 (reachable from any device on the LAN).
From the DGX Spark itself:
curl http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer changeme" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'From another machine on the local network:
curl http://192.168.68.83:8000/v1/chat/completions \
-H "Authorization: Bearer changeme" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 64
}'Without thinking
curl http://IP_ADDRESS:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer mysecret" \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"messages": [{"role": "user", "content": "Hello, are you working?"}],
"max_tokens": 1000,
"chat_template_kwargs": {"enable_thinking": false}
}'export OPENAI_BASE_URL="http://192.168.68.83:8000/v1"
export OPENAI_API_KEY="changeme"Then use the OpenAI SDK or any OpenAI-compatible client normally.
| Mode | temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking — general | 1.0 | 0.95 | 20 | 1.5 |
| Thinking — coding/WebDev | 0.6 | 0.95 | 20 | 0.0 |
| Non-thinking (instruct) | 0.7 | 0.80 | 20 | 1.5 |
Suggested max_tokens: 32 768 for most queries; 81 920 for hard math/coding competitions.
To disable thinking (non-thinking mode):
extra_body={"chat_template_kwargs": {"enable_thinking": False}}Default: 131 072 tokens (128 K). Can increase up to 262 144 natively; beyond that requires YaRN:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.6-35B-A3B \
--max-model-len 1010000 \
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11,11,10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'nohup ./start.sh > /home/user/llms/vllm.log 2>&1 &
echo $! > /home/user/llms/vllm.pidStop it:
kill $(cat /home/user/llms/vllm.pid)start.sh script:
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/.venv/bin/activate"
# Set a real secret here — any client must send: Authorization: Bearer <API_KEY>
API_KEY="${VLLM_API_KEY:-changeme}"
exec vllm serve Qwen/Qwen3.6-35B-A3B \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--quantization fp8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--api-key "$API_KEY"usage:
VLLM_API_KEY="mysecret" ./start.sh