Run Llama 3.2 1B on the 12 TOPS Hexagon NPU.
Hardware: Radxa Dragon Q6A (QCS6490)
OS: Ubuntu 24.04 Noble
Last tested: January 2026 (T7 image)
Download the latest image from GitHub Releases.
# Insert SD card and identify device
lsblk -o NAME,SIZE,FSTYPE,TYPE,MOUNTPOINT
# Unmount partitions (replace <username> with yours)
sudo umount /media/<username>/config 2>/dev/null
sudo umount /media/<username>/rootfs 2>/dev/null
# Flash image (replace /dev/sdb with your SD card device)
xzcat radxa-dragon-q6a_noble_gnome_t7.output_512.img.xz | sudo dd of=/dev/sdb bs=4M status=progress conv=fsync
sudo sync
⚠️ Warning: Double-check the device path. Using the wrong device will destroy data.
- Insert SD card into Dragon Q6A
- Connect power, monitor, keyboard (or use SSH after boot)
- Login:
radxa/radxa
# Install NPU packages
sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490
# Set device permissions
sudo chmod 666 /dev/fastrpc-*
sudo chmod 666 /dev/dma_heap/system
# Reboot to apply changes
sudo rebootAfter reboot, SSH back in or use the console:
# Set permissions again (resets after reboot)
sudo chmod 666 /dev/fastrpc-*
sudo chmod 666 /dev/dma_heap/system
# Install modelscope
export PATH="$HOME/.local/bin:$PATH"
pip3 install modelscope --break-system-packages
# Download model (~1.7GB)
mkdir -p ~/llama-test && cd ~/llama-test
modelscope download --model radxa/Llama3.2-1B-1024-qairt-v68 --local_dir .
# Run inference
chmod +x genie-t2t-run
export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"
./genie-t2t-run -c htp-model-config-llama32-1b-gqa.json \
-p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'Expected output:
[INFO] "Using create From Binary List Async"
[INFO] "Allocated total size = 33333760 across 1 buffers"
[PROMPT]: ...
[BEGIN]: Hello! How can I assist you today?[END]
To avoid running chmod after every reboot:
# Create udev rule
sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF
sudo udevadm control --reload-rules
sudo udevadm triggerSave as ~/test-npu.sh:
#!/bin/bash
cd ~/llama-test
export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"
./genie-t2t-run -c htp-model-config-llama32-1b-gqa.json \
-p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n$1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"Usage:
chmod +x ~/test-npu.sh
~/test-npu.sh "Write a haiku about Linux"| Issue | Solution |
|---|---|
cannot open shared object file |
Run export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH" |
Error 14001 Device Creation Failure |
Install libcdsprpc1 and set permissions |
Permission denied on /dev/fastrpc-* |
Run sudo chmod 666 /dev/fastrpc-* |
| Metric | Value |
|---|---|
| Model | Llama 3.2 1B (INT8 quantized) |
| Context length | 1024 tokens |
| Inference speed | ~10-12 tokens/sec |
| First response | ~1.8s (includes model load) |
| Longer generation | ~6s for ~50 tokens |
Tested on Radxa Dragon Q6A 8GB with T7 image (January 2026)
great job