Skip to content

Instantly share code, notes, and snippets.

@Foadsf
Created January 29, 2026 11:51
Show Gist options
  • Select an option

  • Save Foadsf/ff881971f802e0fe08b367ba5b158e98 to your computer and use it in GitHub Desktop.

Select an option

Save Foadsf/ff881971f802e0fe08b367ba5b158e98 to your computer and use it in GitHub Desktop.

Radxa Dragon Q6A - NPU Quick Start Guide

Run Llama 3.2 1B (4096 Context) on the 12 TOPS Hexagon NPU.

Hardware: Radxa Dragon Q6A (QCS6490) OS: Ubuntu 24.04 Noble (T7 Image or newer) Status: ✅ Verified Working (Jan 29, 2026)


Step 1: System Prep (First Boot)

  1. Login: radxa / radxa
  2. Install NPU Drivers & Tools:
sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 python3-pip git
  1. Set Permanent NPU Permissions (Fixes "Permission denied" errors after reboot):
# Create a udev rule to auto-grant permissions on boot
sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF

# Apply rules immediately
sudo udevadm control --reload-rules
sudo udevadm trigger

Step 2: Download Llama 3.2 Model

We use the 4096 context version for longer, more coherent conversations.

# Install ModelScope downloader
pip3 install modelscope --break-system-packages

# Create directory and download
mkdir -p ~/llama-4k && cd ~/llama-4k
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .

# Verify the download (look for .bin file ~1.7GB)
ls -lh models/

Step 3: Run the AI (The Easy Way)

Manually typing the run command is error-prone. We will create a chat shortcut script.

  1. Create the script:
cd ~/llama-4k

cat << 'EOF' > chat
#!/bin/bash
# NPU Chat Wrapper for Dragon Q6A

# 1. Enter the model directory
cd ~/llama-4k

# 2. Set library path (Crucial for NPU drivers)
export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH"

# 3. Format the prompt with Llama 3 special tokens
#    <|begin_of_text|><|start_header_id|>user...
FULL_PROMPT="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n$1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

# 4. Run Inference
#    Note: Do NOT add -m or -t flags. The JSON config handles file paths.
./genie-t2t-run -c htp-model-config-llama32-1b-gqa.json -p "$FULL_PROMPT"
EOF
  1. Make it executable:
chmod +x chat
chmod +x genie-t2t-run
  1. Run it:
./chat "Explain quantum physics to a 5 year old"

Performance Benchmarks

Metric Llama 3.2 1B (4096 Context)
RAM Usage ~80 MB (NPU Buffer) + ~2GB System RAM
Model Load Time ~1.5 seconds
Inference Speed ~12 - 15 tokens/sec (Real-time)
Device QCS6490 (Proxy)

Troubleshooting

Error Message Cause Solution
Unknown option: -m Using incorrect flags Do not use -m or -t. Use only -c config.json
Please provide an embedding file Confusion with flags Remove -t tokenizer.json from your command.
Permission denied (genie-t2t-run) File execution rights Run chmod +x genie-t2t-run
Permission denied (/dev/fastrpc) Driver access Run the Step 1 udev commands again.
cannot open shared object file Missing Library Path Ensure export LD_LIBRARY_PATH="$(pwd):$LD_LIBRARY_PATH" is set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment