Run Llama 3.2 1B (4096 Context) on the 12 TOPS Hexagon NPU.
Hardware: Radxa Dragon Q6A (QCS6490) OS: Ubuntu 24.04 Noble (T7 Image or newer) Status: ✅ Verified Working (Jan 29, 2026)
- Login:
radxa/radxa - Install NPU Drivers & Tools:
sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 python3-pip git
- Set Permanent NPU Permissions (Fixes "Permission denied" errors after reboot):
# Create a udev rule to auto-grant permissions on boot
sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF
# Apply rules immediately
sudo udevadm control --reload-rules
sudo udevadm trigger
We use the 4096 context version for longer, more coherent conversations.
# Install ModelScope downloader
pip3 install modelscope --break-system-packages
# Create directory and download
mkdir -p ~/llama-4k && cd ~/llama-4k
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .
# Verify the download (look for .bin file ~1.7GB)
ls -lh models/
Manually typing the run command is error-prone. We will create a chat shortcut script.
- Create the script:
cd ~/llama-4k
cat << 'EOF' > chat
#!/bin/bash
# NPU Chat Wrapper for Dragon Q6A
# 1. Enter the model directory
cd ~/llama-4k
# 2. Set library path (Crucial for NPU drivers)
export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"
# 3. Format the prompt with Llama 3 special tokens
# <|begin_of_text|><|start_header_id|>user...
FULL_PROMPT="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n$1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
# 4. Run Inference
# Note: Do NOT add -m or -t flags. The JSON config handles file paths.
./genie-t2t-run -c htp-model-config-llama32-1b-gqa.json -p "$FULL_PROMPT"
EOF
- Make it executable:
chmod +x chat
chmod +x genie-t2t-run
- Run it:
./chat "Explain quantum physics to a 5 year old"
| Metric | Llama 3.2 1B (4096 Context) |
|---|---|
| RAM Usage | ~80 MB (NPU Buffer) + ~2GB System RAM |
| Model Load Time | ~1.5 seconds |
| Inference Speed | ~12 - 15 tokens/sec (Real-time) |
| Device | QCS6490 (Proxy) |
| Error Message | Cause | Solution |
|---|---|---|
Unknown option: -m |
Using incorrect flags | Do not use -m or -t. Use only -c config.json |
Please provide an embedding file |
Confusion with flags | Remove -t tokenizer.json from your command. |
Permission denied (genie-t2t-run) |
File execution rights | Run chmod +x genie-t2t-run |
Permission denied (/dev/fastrpc) |
Driver access | Run the Step 1 udev commands again. |
cannot open shared object file |
Missing Library Path | Ensure export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH" is set. |