TL;DR: Whisper-Small runs on the QCS6490 NPU at ~506ms for 30 seconds of audio — 140x faster than CPU inference.
This guide documents how to run OpenAI's Whisper speech recognition model on the Radxa Dragon Q6A's Hexagon NPU using ONNX Runtime with QNN Execution Provider.
- Board: Radxa Dragon Q6A (8GB variant, ~$140)
- SoC: Qualcomm QCS6490
- NPU: Hexagon v68
- OS: Ubuntu 24.04 Noble (Radxa T7 image)
- Date verified: February 2026
| Component | Time | Notes |
|---|---|---|
| Encoder (30s audio) | ~506ms | NPU inference |
| Decoder (per token) | ~32ms | NPU inference |
| Total (typical utterance) | ~2-3s | Depends on output length |
| CPU baseline (HF Transformers) | ~70s | For comparison |
Realtime factor: 0.05x (20x faster than realtime)
-
qai-hub-modelsPython package directly on device- The non-quantized Whisper models (
whisper_tiny,whisper_small) require float precision - QCS6490 NPU requires quantized (INT8) I/O
- Export fails with:
Tensor 'input_features' has a floating-point type which is not supported
- The non-quantized Whisper models (
-
whisper_small_quantizedvia qai-hub-models- Requires AIMET-ONNX for quantization
- AIMET-ONNX has no aarch64 Linux wheels
- Building from source is a nightmare of dependencies
-
AI Hub website downloads
- QCS6490 wasn't listed as a download target for Whisper models (at time of writing)
Pre-compiled ONNX models with embedded EPContext nodes from HuggingFace!
The key insight (credit to Molly_Sophia on Radxa forums): The pre-compiled models on HuggingFace already contain com.microsoft.EPContext nodes. This means:
- Models are already compiled for QNN — no additional compilation needed
- AIMET-ONNX is only required for creating quantized models, not running them
- You just need ONNX Runtime with QNN EP to run inference
Check that fastrpc devices have proper permissions:
ls -la /dev/fastrpc-*Expected output (note the rw-rw-rw- permissions):
crw-rw-rw-+ 1 root render 10, 264 Feb 2 18:17 /dev/fastrpc-adsp
crw-rw-rw-+ 1 root render 10, 263 Feb 2 18:17 /dev/fastrpc-cdsp
crw-rw-rw-+ 1 root render 10, 262 Feb 2 18:17 /dev/fastrpc-cdsp-secure
If permissions are wrong, create udev rules (see 02_setup_permissions.sh).
python3 -m venv ~/whisper-npu-venv
source ~/whisper-npu-venv/bin/activate
pip install --upgrade pipRadxa provides a prebuilt wheel:
pip install https://github.com/ZIFENG278/onnxruntime/releases/download/v1.23.2/onnxruntime_qnn-1.23.2-cp312-cp312-linux_aarch64.whlVerify QNN EP is available:
python3 -c "import onnxruntime; print(onnxruntime.get_available_providers())"
# Should show: ['QNNExecutionProvider', 'CPUExecutionProvider']pip install numpy librosa soundfile huggingface_hubThe QNN libraries are needed for NPU inference:
cd ~
wget -O v2.37.1.250807.zip "https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.37.1.250807/v2.37.1.250807.zip"
unzip v2.37.1.250807.zipThis creates ~/qairt/2.37.1.250807/ with all the QNN libraries.
Use the Python script 03_download_models.py to download from HuggingFace:
cd ~/whisper-npu-models
python3 03_download_models.pyThen extract the ONNX files:
cd ~/whisper-npu-models/whisper-small-quantized/precompiled/qualcomm-qcs6490/
unzip Whisper-Small-Quantized_WhisperSmallEncoderQuantizable_w8a16.onnx.zip
unzip Whisper-Small-Quantized_WhisperSmallDecoderQuantizable_w8a16.onnx.zipBefore running inference, set these environment variables:
export QNN_SDK_ROOT=~/qairt/2.37.1.250807
export LD_LIBRARY_PATH="$QNN_SDK_ROOT/lib/aarch64-ubuntu-gcc9.4:$LD_LIBRARY_PATH"
export ADSP_LIBRARY_PATH="$QNN_SDK_ROOT/lib/hexagon-v68/unsigned"You can add these to ~/.bashrc or use the 04_whisper_npu_env.sh script.
Run the benchmark script 05_benchmark_encoder.py:
source ~/whisper-npu-venv/bin/activate
source ~/04_whisper_npu_env.sh
cd ~/whisper-npu-models
python3 05_benchmark_encoder.pyExpected output:
=== Whisper Encoder NPU Benchmark ===
Loading encoder on NPU...
✓ Model loaded
Audio duration: 10.44 seconds
Input shape: (1, 80, 3000), dtype: uint16
Running benchmark (5 iterations)...
Run 1: 504.5 ms
Run 2: 506.3 ms
...
=== Results ===
Average encoder time: 505.7 ms
Realtime factor: 0.05x
~/
├── qairt/
│ └── 2.37.1.250807/
│ ├── lib/
│ │ ├── aarch64-ubuntu-gcc9.4/
│ │ │ ├── libQnnHtp.so
│ │ │ └── ...
│ │ └── hexagon-v68/
│ │ └── unsigned/
│ │ └── ...
│ └── ...
├── whisper-npu-venv/
│ └── ... (Python virtual environment)
└── whisper-npu-models/
├── whisper-small-quantized/
│ └── precompiled/
│ └── qualcomm-qcs6490/
│ ├── job_jpek34y0p_optimized_onnx/ (Encoder)
│ │ ├── model.onnx
│ │ └── model.bin
│ └── job_jgzrkvn65_optimized_onnx/ (Decoder)
│ ├── model.onnx
│ └── model.bin
└── test_audio.flac
The Whisper model is split into two parts:
- Input: Mel spectrogram
[1, 80, 3000]as uint16 - Output: Cross-attention K/V caches for all 12 layers
- Time: ~506ms on NPU
- Input: Token IDs + encoder outputs (K/V caches)
- Output: Next token logits
- Time: ~32ms per token on NPU
For full transcription, you run the encoder once, then run the decoder in a loop until the end-of-sequence token is generated.
Run the udev setup script and reboot:
sudo ./02_setup_permissions.sh
sudo rebootMake sure environment variables are set:
source ~/04_whisper_npu_env.sh
echo $LD_LIBRARY_PATH # Should include qairt pathThis warning is harmless and can be ignored:
GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"
Ensure the .bin file is in the same directory as the .onnx file. The ONNX file references the bin file for the actual weights.
To build a complete speech-to-text pipeline:
- Implement the decoder loop — Run decoder autoregressively to generate tokens
- Add tokenizer — Use Whisper's tokenizer to convert token IDs to text
- Integrate with Llama — Combine with your Llama NPU setup for a full voice assistant
- Radxa Dragon Q6A Wiki
- Radxa QNN Execution Provider Docs
- Qualcomm AI Hub - Whisper-Small-Quantized
- HuggingFace Model Repository
- My GitHub Issue on quic/ai-hub-models
- Molly_Sophia on Radxa forums for pointing out that EPContext models don't need AIMET at runtime
- Radxa for providing the prebuilt onnxruntime-qnn wheel
- Qualcomm for making the pre-compiled models available on HuggingFace