Radxa Dragon Q6A - AI Server Quick Start

Turn the Radxa Dragon Q6A into a self-hosted AI appliance. Features:

Brain: Llama 3.2 1B (4096 context) running on NPU (Real-time).
Ears: Whisper Small running on CPU (Fast).
Interface: Open WebUI (ChatGPT-style) accessible over WiFi.

Hardware: Radxa Dragon Q6A (QCS6490) OS: Ubuntu 24.04 Noble (T7 Image or newer) Status: ✅ Production Ready (Jan 2026)

🛠️ Step 1: System Preparation

1. Install Dependencies

sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 \
    python3-pip python3.12-venv libportaudio2 ffmpeg git docker.io

2. Set Permanent NPU Permissions

sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger

3. Setup Python Environment

python3 -m venv ~/qai-venv
source ~/qai-venv/bin/activate
pip install --upgrade pip
pip install "qai-hub-models[whisper-small]" librosa sounddevice fastapi uvicorn

🦙 Step 2: Llama 3.2 on NPU

1. Download Model

deactivate  # Exit venv for system-wide modelscope
pip3 install modelscope --break-system-packages
mkdir -p ~/llama-4k && cd ~/llama-4k
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .
chmod +x genie-t2t-run

2. Create API Bridge Script

This allows Open WebUI to talk to the NPU. nano ~/llama-4k/api_server.py:

import os, subprocess, re, time
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Optional

app = FastAPI()
MODEL_DIR = os.path.expanduser("~/llama-4k")
RUNNER = "./genie-t2t-run"
CONFIG = "htp-model-config-llama32-1b-gqa.json"

class ChatRequest(BaseModel):
    messages: List[dict]
    model: Optional[str] = "llama3"

def run_npu(prompt):
    env = os.environ.copy()
    env["LD_LIBRARY_PATH"] = f"{MODEL_DIR}:{env.get('LD_LIBRARY_PATH', '')}"
    result = subprocess.run([RUNNER, "-c", CONFIG, "-p", prompt], cwd=MODEL_DIR, env=env, capture_output=True, text=True)
    match = re.search(r'\[BEGIN\]:(.*?)\[END\]', result.stdout, re.DOTALL)
    return match.group(1).strip() if match else "Error parsing NPU output."

@app.get("/v1/models")
async def list_models():
    return {"object": "list", "data": [{"id": "llama3-npu", "object": "model", "created": int(time.time()), "owned_by": "radxa"}]}

@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
    # Simple Llama 3 format
    prompt = "<|begin_of_text|>"
    for m in req.messages:
        prompt += f"<|start_header_id|>{m['role']}<|end_header_id|>\n\n{m['content']}<|eot_id|>"
    prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n"
    
    return {"choices": [{"message": {"role": "assistant", "content": run_npu(prompt)}}]}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

🌐 Step 3: Production Server (Auto-Start)

1. Create System Service

This ensures the API Bridge starts automatically on boot.

sudo tee /etc/systemd/system/radxa-llm.service << 'EOF'
[Unit]
Description=Radxa NPU LLM Bridge
After=network.target

[Service]
User=radxa
WorkingDirectory=/home/radxa/llama-4k
ExecStart=/home/radxa/qai-venv/bin/python /home/radxa/llama-4k/api_server.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

2. Enable Service

sudo systemctl daemon-reload
sudo systemctl enable --now radxa-llm

3. Install Open WebUI (Docker)

This runs the beautiful web interface.

# Add user to docker group
sudo usermod -aG docker radxa
newgrp docker

# Run container (Replace IP with your board's actual IP if needed)
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
  -e OPENAI_API_BASE_URL="[http://host.docker.internal:8000/v1](http://host.docker.internal:8000/v1)" \
  -e OPENAI_API_KEY="sk-dummy" \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

🚀 How to Use

1. Access from Network

Open any browser on your WiFi network and go to: http://<BOARD_IP_ADDRESS>:3000

(Find IP using ip addr show wlan0)

2. Chat with NPU

Text: Select llama3-npu from the dropdown and type.
Voice: Click the Microphone icon in the text box. Your browser (Phone/Laptop) will convert speech to text using Web Speech API and send it to the Dragon.

3. Maintenance

Check Logs: sudo journalctl -u radxa-llm -f
Restart AI: sudo systemctl restart radxa-llm
Restart UI: docker restart open-webui

📊 Performance

Component	Processor	Latency
Llama 3.2 1B	NPU (Hexagon)	~15 tokens/sec
Model Load	Disk -> NPU	~2.0 seconds (per message)
Voice Recog.	Client Device	Instant (Web Speech API)

Foadsf/RADXA_DRAGON_Q6A_NPU_QUICKSTART.md

Select an option

No results found