Radxa Dragon Q6A - NPU Quick Start Guide

Run Llama 3.2 1B (4096 Context) on the 12 TOPS Hexagon NPU.

Hardware: Radxa Dragon Q6A (QCS6490) OS: Ubuntu 24.04 Noble (T7 Image or newer) Status: ✅ Verified Working (Jan 29, 2026)

Step 1: System Prep (First Boot)

Login: radxa / radxa
Install NPU Drivers & Tools:

sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 python3-pip git

Set Permanent NPU Permissions (Fixes "Permission denied" errors after reboot):

# Create a udev rule to auto-grant permissions on boot
sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF

# Apply rules immediately
sudo udevadm control --reload-rules
sudo udevadm trigger

Step 2: Download Llama 3.2 Model

We use the 4096 context version for longer, more coherent conversations.

# Install ModelScope downloader
pip3 install modelscope --break-system-packages

# Create directory and download
mkdir -p ~/llama-4k && cd ~/llama-4k
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .

# Verify the download (look for .bin file ~1.7GB)
ls -lh models/

Step 3: Run the AI (The Easy Way)

Manually typing the run command is error-prone. We will create a chat shortcut script.

Create the script:

cd ~/llama-4k

cat << 'EOF' > chat
#!/bin/bash
# NPU Chat Wrapper for Dragon Q6A

# 1. Enter the model directory
cd ~/llama-4k

# 2. Set library path (Crucial for NPU drivers)
export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"

# 3. Format the prompt with Llama 3 special tokens
#    <|begin_of_text|><|start_header_id|>user...
FULL_PROMPT="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n$1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

# 4. Run Inference
#    Note: Do NOT add -m or -t flags. The JSON config handles file paths.
./genie-t2t-run -c htp-model-config-llama32-1b-gqa.json -p "$FULL_PROMPT"
EOF

Make it executable:

chmod +x chat
chmod +x genie-t2t-run

Run it:

./chat "Explain quantum physics to a 5 year old"

Performance Benchmarks

Metric	Llama 3.2 1B (4096 Context)
RAM Usage	~80 MB (NPU Buffer) + ~2GB System RAM
Model Load Time	~1.5 seconds
Inference Speed	~12 - 15 tokens/sec (Real-time)
Device	QCS6490 (Proxy)

Troubleshooting

Error Message	Cause	Solution
`Unknown option: -m`	Using incorrect flags	Do not use `-m` or `-t`. Use only `-c config.json`
`Please provide an embedding file`	Confusion with flags	Remove `-t tokenizer.json` from your command.
`Permission denied` (genie-t2t-run)	File execution rights	Run `chmod +x genie-t2t-run`
`Permission denied` (/dev/fastrpc)	Driver access	Run the Step 1 udev commands again.
`cannot open shared object file`	Missing Library Path	Ensure `export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"` is set.

Foadsf/RADXA_DRAGON_Q6A_NPU_QUICKSTART.md

Select an option

No results found