Skip to content

Instantly share code, notes, and snippets.

@Foadsf
Created February 2, 2026 21:59
Show Gist options
  • Select an option

  • Save Foadsf/3cf6086dccdb016c93618a763c9306c8 to your computer and use it in GitHub Desktop.

Select an option

Save Foadsf/3cf6086dccdb016c93618a763c9306c8 to your computer and use it in GitHub Desktop.
Radxa Dragon Q6A: Offline Voice Assistant "Jarvis" (NPU Llama + CPU Whisper)

🤖 Radxa Dragon Q6A: Offline "Jarvis" Assistant

Verified: February 2026
Hardware: Radxa Dragon Q6A (Qualcomm QCS6490)
OS: Ubuntu 24.04 (Noble)
Status: ✅ Production Ready

The "Hybrid" Architecture

Building a truly responsive voice assistant on embedded hardware requires balancing workloads. We use the NPU (Hexagon DSP) for the heavy lifting (LLM) and the CPU (Kryo) for real-time sensory tasks.

Component Breakdown

  1. The Brain (NPU): Llama 3.2 1B running on the Hexagon DSP via Qualcomm's genie runtime.
    • Why NPU? It delivers ~15-20 tokens/sec at extremely low power, leaving the CPU free.
  2. The Ears (CPU): OpenAI Whisper (Tiny) running on CPU.
    • Why CPU? While NPU Whisper is possible, the CPU version is more robust for "Decoder Loops" (turning speech into text) and simpler to integrate. The 'Tiny' model is fast enough (~200ms latency).
  3. The Voice (CPU): Piper TTS (Neural Text-to-Speech).
    • Why Piper? It's highly optimized for ARM64 and generates human-like speech offline.

🛠️ Prerequisites & Setup

1. System Dependencies

Install audio and NPU libraries.

sudo apt update
sudo apt install -y fastrpc fastrpc-dev libcdsprpc1 radxa-firmware-qcs6490 \
    python3-pip python3-venv libportaudio2 ffmpeg git alsa-utils

2. NPU Permissions

Allow users to access the Hexagon DSP without sudo.

sudo tee /etc/udev/rules.d/99-fastrpc.rules << 'EOF'
KERNEL=="fastrpc-*", MODE="0666"
SUBSYSTEM=="dma_heap", KERNEL=="system", MODE="0666"
EOF
sudo udevadm control --reload-rules && sudo udevadm trigger

3. Python Environment

Create a clean environment for the AI stack.

python3 -m venv ~/jarvis-venv
source ~/jarvis-venv/bin/activate

# Install Core & UI Libraries
pip install --upgrade pip
pip install rich sounddevice soundfile numpy modelscope

# Install Transformers (Pinned for stability)
pip install "transformers==4.48.1" torch

🚀 Installation

Step 1: Download the NPU Brain

We use modelscope to fetch the pre-compiled Llama model.

mkdir -p ~/llama-npu && cd ~/llama-npu
pip install modelscope
modelscope download --model radxa/Llama3.2-1B-4096-qairt-v68 --local_dir .
chmod +x genie-t2t-run

Step 2: Download the Voice

We fetch Piper and a high-quality voice model.

mkdir -p ~/piper_tts && cd ~/piper_tts
# Download Piper binary
wget -O piper.tar.gz [https://github.com/rhasspy/piper/releases/download/2023.11.14-2/piper_linux_aarch64.tar.gz](https://github.com/rhasspy/piper/releases/download/2023.11.14-2/piper_linux_aarch64.tar.gz)
tar -xvf piper.tar.gz

# Download Voice (Ryan - Medium Quality)
wget -O en_US-ryan-medium.onnx [https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/medium/en_US-ryan-medium.onnx](https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/medium/en_US-ryan-medium.onnx)
wget -O en_US-ryan-medium.onnx.json [https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/medium/en_US-ryan-medium.onnx.json](https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/ryan/medium/en_US-ryan-medium.onnx.json)

Step 3: Run Jarvis

Place llama_engine.py and jarvis_ui.py (files below) in your home directory.

source ~/jarvis-venv/bin/activate
python3 ~/jarvis_ui.py

🧠 Lessons Learned (The "Gotchas")

  1. The "Current Working Directory" Bug: The Qualcomm genie binary looks for tokenizer.json in the folder where you run the command, not where the binary sits. We fixed this in llama_engine.py by forcing cwd=self.model_dir in the subprocess call.
  2. Audio Gibberish: Piping raw audio from Piper to aplay caused static noise due to header mismatches. The fix was to generate a temporary WAV file (temp_speech.wav) which ensures perfect playback.
  3. Log Pollution: The NPU binary loves to print debug info (Using libGenie.so...). We implemented a robust regex cleaner in Python to strip these out so the user only sees the AI's answer.
#!/usr/bin/env python3
import os
import sys
import subprocess
import warnings
import logging
import sounddevice as sd
import soundfile as sf
from rich.console import Console
from rich.panel import Panel
from rich.align import Align
from transformers import pipeline
# --- SILENCE THE MACHINE ---
warnings.filterwarnings("ignore")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
logging.getLogger("transformers").setLevel(logging.ERROR)
# --- CONFIGURATION ---
LLAMA_PATH = os.path.expanduser("~/llama-npu")
PIPER_DIR = os.path.expanduser("~/piper_tts")
PIPER_BIN = os.path.join(PIPER_DIR, "piper/piper")
PIPER_MODEL = os.path.join(PIPER_DIR, "en_US-ryan-medium.onnx")
SAMPLE_RATE = 16000
RECORD_SECONDS = 5
# Add local path for importing Llama engine
sys.path.append(os.getcwd())
try:
from llama_engine import LlamaNPU
except ImportError:
print("❌ Error: llama_engine.py not found in current folder.")
sys.exit(1)
console = Console()
class JarvisVoice:
"""Handles Neural TTS via Piper."""
def __init__(self):
if not os.path.exists(PIPER_BIN):
raise FileNotFoundError("Piper binary not found.")
def speak(self, text):
if not text or len(text.strip()) == 0: return
try:
clean_text = text.replace("\n", " ").replace('"', '').strip()
# Generate Audio to temp file to prevent gibberish/static
cmd = [PIPER_BIN, "--model", PIPER_MODEL, "--output_file", "temp_speech.wav"]
# Run silently
p = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
p.communicate(input=clean_text.encode('utf-8'))
# Play Audio
subprocess.run(["aplay", "-q", "temp_speech.wav"], stderr=subprocess.DEVNULL)
except Exception:
pass # Fail silently in UI
def record_audio(duration):
recording = sd.rec(int(duration * SAMPLE_RATE), samplerate=SAMPLE_RATE, channels=1)
sd.wait()
return recording
def main():
console.clear()
console.print(Panel(Align.center(
"[bold cyan]RADXA DRAGON NPU[/bold cyan]\n"
"[bold white]J.A.R.V.I.S. ONLINE[/bold white]",
vertical="middle"), border_style="cyan", padding=(1, 2)))
with console.status("[bold green]Booting Systems...", spinner="dots"):
try:
brain = LlamaNPU(LLAMA_PATH)
# Force English to avoid language detection latency
ears = pipeline("automatic-speech-recognition", model="openai/whisper-tiny", device="cpu")
voice = JarvisVoice()
except Exception as e:
console.print(f"[bold red]System Failure:[/bold red] {e}")
return
console.print("[bold green]✓ Ready[/bold green]\n")
voice.speak("Systems online.")
while True:
try:
console.rule("[bold cyan]STANDBY[/bold cyan]")
console.print("[dim]Press [bold white]ENTER[/bold white] to speak[/dim]", justify="center")
input()
console.print(Panel("[bold red]● LISTENING...[/bold red]", border_style="red"))
audio_data = record_audio(RECORD_SECONDS)
sf.write("temp_input.wav", audio_data, SAMPLE_RATE)
with console.status("[bold yellow]Processing...[/bold yellow]", spinner="aesthetic"):
result = ears("temp_input.wav", generate_kwargs={"language": "en"})
user_text = result["text"].strip()
if len(user_text) < 2:
console.print("[dim italic]...no speech detected...[/dim italic]")
continue
console.print(f"\n[bold cyan]USER:[/bold cyan] {user_text}")
with console.status("[bold magenta]Thinking...[/bold magenta]", spinner="earth"):
response = brain.generate(user_text)
console.print(f"[bold green]JARVIS:[/bold green] {response}\n")
voice.speak(response)
except KeyboardInterrupt:
voice.speak("Goodbye.")
break
if __name__ == "__main__":
main()
import subprocess
import os
import re
class LlamaNPU:
def __init__(self, model_dir="~/llama-npu"):
self.model_dir = os.path.expanduser(model_dir)
self.cmd_path = os.path.join(self.model_dir, "genie-t2t-run")
self.config = os.path.join(self.model_dir, "htp-model-config-llama32-1b-gqa.json")
# Validation
required = ["genie-t2t-run", "tokenizer.json"]
for f in required:
if not os.path.exists(os.path.join(self.model_dir, f)):
raise FileNotFoundError(f"Missing {f} in {self.model_dir}")
def generate(self, user_prompt):
"""Runs Llama on NPU and returns clean text."""
# 1. Prompt Engineering (Keep it concise)
full_prompt = (
f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
f"You are Jarvis. Answer concisely in one sentence.<|eot_id|>"
f"<|start_header_id|>user<|end_header_id|>\n\n{user_prompt}<|eot_id|>"
f"<|start_header_id|>assistant<|end_header_id|>\n\n"
)
# 2. Environment Setup
env = os.environ.copy()
env["LD_LIBRARY_PATH"] = f"{self.model_dir}:{env.get('LD_LIBRARY_PATH', '')}"
cmd = [self.cmd_path, "-c", self.config, "-p", full_prompt]
try:
# 3. Execution (Fix: Run inside model dir)
result = subprocess.run(
cmd,
cwd=self.model_dir, # <--- CRITICAL FIX for tokenizer.json
capture_output=True,
text=True,
env=env,
encoding='utf-8',
errors='replace'
)
raw_output = result.stdout + result.stderr
# 4. Output Cleaning
# Try to extract content between [BEGIN] and [END] tags
match = re.search(r'\[BEGIN\]:\s*(.*?)(?:\[END\]|$)', raw_output, re.DOTALL)
if match:
clean = match.group(1).strip()
if clean: return clean
# Fallback: Strip system logs line-by-line
clean_lines = []
for line in raw_output.split('\n'):
if any(x in line for x in ["libGenie", "[INFO]", "Allocated", "rpcmem", "PROMPT:", "tokenizer"]):
continue
if line.strip():
clean_lines.append(line)
return "\n".join(clean_lines).strip() or "I heard you, but I have no response."
except Exception as e:
return f"System Error: {str(e)}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment