Skip to content

Instantly share code, notes, and snippets.

@Foadsf
Created February 2, 2026 20:50
Show Gist options
  • Select an option

  • Save Foadsf/3e1356910c1f9c7382250bd05cfe2c48 to your computer and use it in GitHub Desktop.

Select an option

Save Foadsf/3e1356910c1f9c7382250bd05cfe2c48 to your computer and use it in GitHub Desktop.

Running Whisper Speech Recognition on Radxa Dragon Q6A NPU

TL;DR: Whisper-Small runs on the QCS6490 NPU at ~506ms for 30 seconds of audio — 140x faster than CPU inference.

This guide documents how to run OpenAI's Whisper speech recognition model on the Radxa Dragon Q6A's Hexagon NPU using ONNX Runtime with QNN Execution Provider.

Hardware & Software

  • Board: Radxa Dragon Q6A (8GB variant, ~$140)
  • SoC: Qualcomm QCS6490
  • NPU: Hexagon v68
  • OS: Ubuntu 24.04 Noble (Radxa T7 image)
  • Date verified: February 2026

Performance Results

Component Time Notes
Encoder (30s audio) ~506ms NPU inference
Decoder (per token) ~32ms NPU inference
Total (typical utterance) ~2-3s Depends on output length
CPU baseline (HF Transformers) ~70s For comparison

Realtime factor: 0.05x (20x faster than realtime)


The Journey: What I Tried and What Actually Works

What Doesn't Work (Save Yourself the Time)

  1. qai-hub-models Python package directly on device

    • The non-quantized Whisper models (whisper_tiny, whisper_small) require float precision
    • QCS6490 NPU requires quantized (INT8) I/O
    • Export fails with: Tensor 'input_features' has a floating-point type which is not supported
  2. whisper_small_quantized via qai-hub-models

    • Requires AIMET-ONNX for quantization
    • AIMET-ONNX has no aarch64 Linux wheels
    • Building from source is a nightmare of dependencies
  3. AI Hub website downloads

    • QCS6490 wasn't listed as a download target for Whisper models (at time of writing)

What Actually Works

Pre-compiled ONNX models with embedded EPContext nodes from HuggingFace!

The key insight (credit to Molly_Sophia on Radxa forums): The pre-compiled models on HuggingFace already contain com.microsoft.EPContext nodes. This means:

  • Models are already compiled for QNN — no additional compilation needed
  • AIMET-ONNX is only required for creating quantized models, not running them
  • You just need ONNX Runtime with QNN EP to run inference

Step-by-Step Setup

Step 1: Verify NPU Access

Check that fastrpc devices have proper permissions:

ls -la /dev/fastrpc-*

Expected output (note the rw-rw-rw- permissions):

crw-rw-rw-+ 1 root render 10, 264 Feb  2 18:17 /dev/fastrpc-adsp
crw-rw-rw-+ 1 root render 10, 263 Feb  2 18:17 /dev/fastrpc-cdsp
crw-rw-rw-+ 1 root render 10, 262 Feb  2 18:17 /dev/fastrpc-cdsp-secure

If permissions are wrong, create udev rules (see 02_setup_permissions.sh).

Step 2: Create Python Virtual Environment

python3 -m venv ~/whisper-npu-venv
source ~/whisper-npu-venv/bin/activate
pip install --upgrade pip

Step 3: Install ONNX Runtime with QNN Support

Radxa provides a prebuilt wheel:

pip install https://github.com/ZIFENG278/onnxruntime/releases/download/v1.23.2/onnxruntime_qnn-1.23.2-cp312-cp312-linux_aarch64.whl

Verify QNN EP is available:

python3 -c "import onnxruntime; print(onnxruntime.get_available_providers())"
# Should show: ['QNNExecutionProvider', 'CPUExecutionProvider']

Step 4: Install Additional Dependencies

pip install numpy librosa soundfile huggingface_hub

Step 5: Download QAIRT SDK

The QNN libraries are needed for NPU inference:

cd ~
wget -O v2.37.1.250807.zip "https://softwarecenter.qualcomm.com/api/download/software/sdks/Qualcomm_AI_Runtime_Community/All/2.37.1.250807/v2.37.1.250807.zip"
unzip v2.37.1.250807.zip

This creates ~/qairt/2.37.1.250807/ with all the QNN libraries.

Step 6: Download Pre-compiled Whisper Models

Use the Python script 03_download_models.py to download from HuggingFace:

cd ~/whisper-npu-models
python3 03_download_models.py

Then extract the ONNX files:

cd ~/whisper-npu-models/whisper-small-quantized/precompiled/qualcomm-qcs6490/
unzip Whisper-Small-Quantized_WhisperSmallEncoderQuantizable_w8a16.onnx.zip
unzip Whisper-Small-Quantized_WhisperSmallDecoderQuantizable_w8a16.onnx.zip

Step 7: Set Environment Variables

Before running inference, set these environment variables:

export QNN_SDK_ROOT=~/qairt/2.37.1.250807
export LD_LIBRARY_PATH="$QNN_SDK_ROOT/lib/aarch64-ubuntu-gcc9.4:$LD_LIBRARY_PATH"
export ADSP_LIBRARY_PATH="$QNN_SDK_ROOT/lib/hexagon-v68/unsigned"

You can add these to ~/.bashrc or use the 04_whisper_npu_env.sh script.

Step 8: Test NPU Inference

Run the benchmark script 05_benchmark_encoder.py:

source ~/whisper-npu-venv/bin/activate
source ~/04_whisper_npu_env.sh
cd ~/whisper-npu-models
python3 05_benchmark_encoder.py

Expected output:

=== Whisper Encoder NPU Benchmark ===

Loading encoder on NPU...
✓ Model loaded

Audio duration: 10.44 seconds
Input shape: (1, 80, 3000), dtype: uint16

Running benchmark (5 iterations)...
  Run 1: 504.5 ms
  Run 2: 506.3 ms
  ...

=== Results ===
Average encoder time: 505.7 ms
Realtime factor: 0.05x

File Structure After Setup

~/
├── qairt/
│   └── 2.37.1.250807/
│       ├── lib/
│       │   ├── aarch64-ubuntu-gcc9.4/
│       │   │   ├── libQnnHtp.so
│       │   │   └── ...
│       │   └── hexagon-v68/
│       │       └── unsigned/
│       │           └── ...
│       └── ...
├── whisper-npu-venv/
│   └── ... (Python virtual environment)
└── whisper-npu-models/
    ├── whisper-small-quantized/
    │   └── precompiled/
    │       └── qualcomm-qcs6490/
    │           ├── job_jpek34y0p_optimized_onnx/  (Encoder)
    │           │   ├── model.onnx
    │           │   └── model.bin
    │           └── job_jgzrkvn65_optimized_onnx/  (Decoder)
    │               ├── model.onnx
    │               └── model.bin
    └── test_audio.flac

Understanding the Model Architecture

The Whisper model is split into two parts:

Encoder (job_jpek34y0p_optimized_onnx/)

  • Input: Mel spectrogram [1, 80, 3000] as uint16
  • Output: Cross-attention K/V caches for all 12 layers
  • Time: ~506ms on NPU

Decoder (job_jgzrkvn65_optimized_onnx/)

  • Input: Token IDs + encoder outputs (K/V caches)
  • Output: Next token logits
  • Time: ~32ms per token on NPU

For full transcription, you run the encoder once, then run the decoder in a loop until the end-of-sequence token is generated.


Troubleshooting

"Permission denied" on /dev/fastrpc-*

Run the udev setup script and reboot:

sudo ./02_setup_permissions.sh
sudo reboot

"libQnnHtp.so not found"

Make sure environment variables are set:

source ~/04_whisper_npu_env.sh
echo $LD_LIBRARY_PATH  # Should include qairt path

GPU device discovery warning

This warning is harmless and can be ignored:

GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card1/device/vendor"

Model loading fails

Ensure the .bin file is in the same directory as the .onnx file. The ONNX file references the bin file for the actual weights.


Next Steps

To build a complete speech-to-text pipeline:

  1. Implement the decoder loop — Run decoder autoregressively to generate tokens
  2. Add tokenizer — Use Whisper's tokenizer to convert token IDs to text
  3. Integrate with Llama — Combine with your Llama NPU setup for a full voice assistant

References


Acknowledgments

  • Molly_Sophia on Radxa forums for pointing out that EPContext models don't need AIMET at runtime
  • Radxa for providing the prebuilt onnxruntime-qnn wheel
  • Qualcomm for making the pre-compiled models available on HuggingFace
#!/bin/bash
# 02_setup_permissions.sh
#
# Set up udev rules for NPU access on Radxa Dragon Q6A.
# Run with sudo, then reboot.
#
# Usage: sudo bash 02_setup_permissions.sh
set -e
RULES_FILE="/etc/udev/rules.d/99-fastrpc.rules"
if [ "$EUID" -ne 0 ]; then
echo "Error: This script must be run with sudo"
echo "Usage: sudo bash $0"
exit 1
fi
echo "Creating udev rules for NPU access..."
cat > "$RULES_FILE" << 'RULES'
# Qualcomm FastRPC devices for NPU access
KERNEL=="fastrpc-*", MODE="0666"
# DMA heap for memory allocation
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
RULES
echo "✓ Created $RULES_FILE"
# Reload rules
udevadm control --reload-rules
udevadm trigger
echo "✓ Reloaded udev rules"
echo ""
echo "Current permissions:"
ls -la /dev/fastrpc-* 2>/dev/null || echo " (fastrpc devices not found - may need reboot)"
echo ""
echo "Reboot recommended to ensure all permissions take effect."
#!/usr/bin/env python3
"""
03_download_models.py
Download pre-compiled Whisper models for QCS6490 from HuggingFace.
These models contain EPContext nodes and run directly on the NPU
without needing AIMET-ONNX for quantization.
Usage:
python3 03_download_models.py
Requirements:
pip install huggingface_hub
"""
import os
from pathlib import Path
try:
from huggingface_hub import hf_hub_download, list_repo_files
except ImportError:
print("Error: huggingface_hub not installed")
print("Run: pip install huggingface_hub")
exit(1)
REPO_ID = "qualcomm/Whisper-Small-Quantized"
TARGET_DEVICE = "qualcomm-qcs6490"
OUTPUT_DIR = Path.home() / "whisper-npu-models" / "whisper-small-quantized"
def main():
print("=" * 60)
print(" Whisper Model Downloader for QCS6490")
print("=" * 60)
print()
print(f"Repository: {REPO_ID}")
print(f"Target: {TARGET_DEVICE}")
print(f"Output: {OUTPUT_DIR}")
print()
# Create output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# List all files in repo
print("Fetching file list from HuggingFace...")
all_files = list_repo_files(REPO_ID)
# Filter for our target device
target_files = [f for f in all_files if TARGET_DEVICE in f]
if not target_files:
print(f"\nError: No files found for {TARGET_DEVICE}")
print("\nAvailable targets:")
targets = set()
for f in all_files:
if "precompiled/" in f:
parts = f.split("/")
if len(parts) >= 2:
targets.add(parts[1])
for t in sorted(targets):
print(f" - {t}")
return
print(f"\nFound {len(target_files)} files:")
for f in target_files:
print(f" • {f.split('/')[-1]}")
# Download each file
print("\nDownloading...")
for file_path in target_files:
filename = file_path.split("/")[-1]
print(f" ↓ {filename}")
hf_hub_download(
repo_id=REPO_ID,
filename=file_path,
local_dir=str(OUTPUT_DIR)
)
print("\n✓ Download complete!")
print()
print("Next steps:")
print(f" cd {OUTPUT_DIR}/precompiled/{TARGET_DEVICE}")
print(" unzip *.zip")
if __name__ == "__main__":
main()
#!/bin/bash
# 04_whisper_npu_env.sh
#
# Environment setup for Whisper NPU inference on Radxa Dragon Q6A.
# Source this script before running inference.
#
# Usage: source 04_whisper_npu_env.sh
# Activate Python venv
VENV_PATH="$HOME/whisper-npu-venv"
if [ -d "$VENV_PATH" ]; then
source "$VENV_PATH/bin/activate"
echo "✓ Activated: whisper-npu-venv"
else
echo "⚠ Virtual environment not found: $VENV_PATH"
echo " Create with: python3 -m venv $VENV_PATH"
fi
# QAIRT SDK version
QAIRT_VERSION="2.37.1.250807"
export QNN_SDK_ROOT="$HOME/qairt/$QAIRT_VERSION"
# QCS6490 = Hexagon v68
export DSP_ARCH=68
# Library paths for QNN
export LD_LIBRARY_PATH="$QNN_SDK_ROOT/lib/aarch64-ubuntu-gcc9.4:$LD_LIBRARY_PATH"
export ADSP_LIBRARY_PATH="$QNN_SDK_ROOT/lib/hexagon-v${DSP_ARCH}/unsigned"
# Model paths
MODEL_BASE="$HOME/whisper-npu-models/whisper-small-quantized/precompiled/qualcomm-qcs6490"
export WHISPER_ENCODER="$MODEL_BASE/job_jpek34y0p_optimized_onnx/model.onnx"
export WHISPER_DECODER="$MODEL_BASE/job_jgzrkvn65_optimized_onnx/model.onnx"
# Verify setup
echo ""
echo "Environment configured:"
echo " QNN_SDK_ROOT: $QNN_SDK_ROOT"
echo " ADSP_LIBRARY_PATH: $ADSP_LIBRARY_PATH"
echo " WHISPER_ENCODER: $WHISPER_ENCODER"
echo " WHISPER_DECODER: $WHISPER_DECODER"
echo ""
# Quick checks
if [ -f "$QNN_SDK_ROOT/lib/aarch64-ubuntu-gcc9.4/libQnnHtp.so" ]; then
echo "✓ libQnnHtp.so found"
else
echo "⚠ libQnnHtp.so not found - check QAIRT SDK installation"
fi
if [ -f "$WHISPER_ENCODER" ]; then
echo "✓ Encoder model found"
else
echo "⚠ Encoder model not found - run download script first"
fi
#!/usr/bin/env python3
"""
05_benchmark_encoder.py
Benchmark Whisper encoder inference on Radxa Dragon Q6A NPU.
Usage:
python3 05_benchmark_encoder.py [audio_file]
If no audio file specified, downloads a test file from HuggingFace.
Requirements:
- QAIRT SDK installed
- Environment variables set (source 04_whisper_npu_env.sh)
- pip install numpy librosa soundfile
- onnxruntime-qnn wheel installed
"""
import os
import sys
import time
import urllib.request
from pathlib import Path
import numpy as np
try:
import onnxruntime as ort
except ImportError:
print("Error: onnxruntime not installed")
print("Install: pip install <onnxruntime-qnn wheel URL>")
sys.exit(1)
try:
import librosa
except ImportError:
print("Error: librosa not installed")
print("Install: pip install librosa soundfile")
sys.exit(1)
# Whisper preprocessing constants
SAMPLE_RATE = 16000
N_FFT = 400
HOP_LENGTH = 160
N_MELS = 80
N_FRAMES = 3000 # 30 seconds @ 100 fps
def find_encoder():
"""Locate encoder model in common paths."""
candidates = [
os.environ.get("WHISPER_ENCODER", ""),
str(Path.home() / "whisper-npu-models/whisper-small-quantized/precompiled/qualcomm-qcs6490/job_jpek34y0p_optimized_onnx/model.onnx"),
"model.onnx",
]
for path in candidates:
if path and os.path.exists(path):
return path
return None
def download_test_audio(output_path: str):
"""Download test audio from HuggingFace."""
url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"
print(f"Downloading test audio from HuggingFace...")
urllib.request.urlretrieve(url, output_path)
print(f"✓ Saved to {output_path}")
def preprocess_audio(audio_path: str):
"""
Convert audio to Whisper input format.
Returns:
mel_input: np.ndarray of shape (1, 80, 3000), dtype uint16
duration: float, audio duration in seconds
"""
# Load at 16kHz
audio, _ = librosa.load(audio_path, sr=SAMPLE_RATE)
duration = len(audio) / SAMPLE_RATE
# Mel spectrogram
mel = librosa.feature.melspectrogram(
y=audio,
sr=SAMPLE_RATE,
n_fft=N_FFT,
hop_length=HOP_LENGTH,
n_mels=N_MELS,
fmin=0,
fmax=8000
)
# Log scale with Whisper normalization
log_mel = np.log10(np.maximum(mel, 1e-10))
log_mel = np.maximum(log_mel, log_mel.max() - 8.0)
log_mel = (log_mel + 4.0) / 4.0
# Pad/trim to 30 seconds
if log_mel.shape[1] < N_FRAMES:
log_mel = np.pad(log_mel, ((0, 0), (0, N_FRAMES - log_mel.shape[1])))
else:
log_mel = log_mel[:, :N_FRAMES]
# Quantize to uint16 (model input format)
mel_uint16 = (log_mel * 65535).astype(np.uint16)
mel_input = mel_uint16[np.newaxis, :, :]
return mel_input, duration
def load_encoder_npu(model_path: str):
"""Load encoder with QNN Execution Provider."""
providers = ort.get_available_providers()
print(f"Available providers: {providers}")
if "QNNExecutionProvider" not in providers:
print("\n⚠ QNNExecutionProvider not available, falling back to CPU")
return ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
return ort.InferenceSession(
model_path,
providers=["QNNExecutionProvider"],
provider_options=[{"backend_path": "libQnnHtp.so"}]
)
def benchmark(session, mel_input, num_runs=5):
"""Run encoder and measure timing."""
input_name = session.get_inputs()[0].name
# Warmup
print("Warm-up run...")
_ = session.run(None, {input_name: mel_input})
# Timed runs
print(f"Running benchmark ({num_runs} iterations)...")
times = []
for i in range(num_runs):
start = time.perf_counter()
outputs = session.run(None, {input_name: mel_input})
elapsed = time.perf_counter() - start
times.append(elapsed)
print(f" Run {i+1}: {elapsed*1000:.1f} ms")
return outputs, np.mean(times)
def main():
print("=" * 50)
print(" Whisper Encoder NPU Benchmark")
print(" Radxa Dragon Q6A (QCS6490)")
print("=" * 50)
print()
# Find encoder model
encoder_path = find_encoder()
if not encoder_path:
print("Error: Encoder model not found")
print("Set WHISPER_ENCODER env var or run from model directory")
sys.exit(1)
print(f"Encoder: {encoder_path}")
# Get audio file
if len(sys.argv) > 1:
audio_path = sys.argv[1]
else:
audio_path = "test_audio.flac"
if not os.path.exists(audio_path):
download_test_audio(audio_path)
print(f"Audio: {audio_path}")
print()
# Load model
print("Loading encoder on NPU...")
session = load_encoder_npu(encoder_path)
print("✓ Model loaded")
print()
# Preprocess
print("Preprocessing audio...")
mel_input, duration = preprocess_audio(audio_path)
print(f"Audio duration: {duration:.2f} seconds")
print(f"Input shape: {mel_input.shape}, dtype: {mel_input.dtype}")
print()
# Benchmark
outputs, avg_time = benchmark(session, mel_input)
# Results
print()
print("=" * 50)
print(" Results")
print("=" * 50)
print(f"Average encoder time: {avg_time*1000:.1f} ms")
print(f"Audio duration: {duration:.2f} s")
print(f"Realtime factor: {avg_time/duration:.2f}x")
print(f"Speedup vs realtime: {duration/avg_time:.0f}x faster")
print()
# Output info
print("Encoder outputs (first 4):")
for i, out in enumerate(session.get_outputs()[:4]):
print(f" {out.name}: {outputs[i].shape}")
remaining = len(session.get_outputs()) - 4
if remaining > 0:
print(f" ... and {remaining} more")
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment