Turn the Radxa Dragon Q6A into a self-hosted AI appliance. Features:
- Brain: Llama 3.2 1B (4096 context) running on NPU (Real-time).
- Ears: Whisper Small running on CPU (Fast).
- Interface: Open WebUI (ChatGPT-style) accessible over WiFi.
Hardware: Radxa Dragon Q6A (QCS6490) OS: Ubuntu 24.04 Noble (T7 Image or newer) Status: β Production Ready (Jan 2026)
sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 \
python3-pip python3.12-venv libportaudio2 ffmpeg git docker.io
sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger
python3 -m venv ~/qai-venv
source ~/qai-venv/bin/activate
pip install --upgrade pip
pip install "qai-hub-models[whisper-small]" librosa sounddevice fastapi uvicorn
deactivate # Exit venv for system-wide modelscope
pip3 install modelscope --break-system-packages
mkdir -p ~/llama-4k && cd ~/llama-4k
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .
chmod +x genie-t2t-run
This allows Open WebUI to talk to the NPU.
nano ~/llama-4k/api_server.py:
import os, subprocess, re, time
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Optional
app = FastAPI()
MODEL_DIR = os.path.expanduser("~/llama-4k")
RUNNER = "./genie-t2t-run"
CONFIG = "htp-model-config-llama32-1b-gqa.json"
class ChatRequest(BaseModel):
messages: List[dict]
model: Optional[str] = "llama3"
def run_npu(prompt):
env = os.environ.copy()
env["LD_LIBRARY_PATH"] = f"{MODEL_DIR}:{env.get('LD_LIBRARY_PATH', '')}"
result = subprocess.run([RUNNER, "-c", CONFIG, "-p", prompt], cwd=MODEL_DIR, env=env, capture_output=True, text=True)
match = re.search(r'\[BEGIN\]:(.*?)\[END\]', result.stdout, re.DOTALL)
return match.group(1).strip() if match else "Error parsing NPU output."
@app.get("/v1/models")
async def list_models():
return {"object": "list", "data": [{"id": "llama3-npu", "object": "model", "created": int(time.time()), "owned_by": "radxa"}]}
@app.post("/v1/chat/completions")
async def chat(req: ChatRequest):
# Simple Llama 3 format
prompt = "<|begin_of_text|>"
for m in req.messages:
prompt += f"<|start_header_id|>{m['role']}<|end_header_id|>\n\n{m['content']}<|eot_id|>"
prompt += "<|start_header_id|>assistant<|end_header_id|>\n\n"
return {"choices": [{"message": {"role": "assistant", "content": run_npu(prompt)}}]}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)This ensures the API Bridge starts automatically on boot.
sudo tee /etc/systemd/system/radxa-llm.service << 'EOF'
[Unit]
Description=Radxa NPU LLM Bridge
After=network.target
[Service]
User=radxa
WorkingDirectory=/home/radxa/llama-4k
ExecStart=/home/radxa/qai-venv/bin/python /home/radxa/llama-4k/api_server.py
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now radxa-llm
This runs the beautiful web interface.
# Add user to docker group
sudo usermod -aG docker radxa
newgrp docker
# Run container (Replace IP with your board's actual IP if needed)
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
-e OPENAI_API_BASE_URL="[http://host.docker.internal:8000/v1](http://host.docker.internal:8000/v1)" \
-e OPENAI_API_KEY="sk-dummy" \
--name open-webui --restart always \
ghcr.io/open-webui/open-webui:main
Open any browser on your WiFi network and go to:
http://<BOARD_IP_ADDRESS>:3000
(Find IP using ip addr show wlan0)
- Text: Select
llama3-npufrom the dropdown and type. - Voice: Click the Microphone icon in the text box. Your browser (Phone/Laptop) will convert speech to text using Web Speech API and send it to the Dragon.
- Check Logs:
sudo journalctl -u radxa-llm -f - Restart AI:
sudo systemctl restart radxa-llm - Restart UI:
docker restart open-webui
| Component | Processor | Latency |
|---|---|---|
| Llama 3.2 1B | NPU (Hexagon) | ~15 tokens/sec |
| Model Load | Disk -> NPU | ~2.0 seconds (per message) |
| Voice Recog. | Client Device | Instant (Web Speech API) |