Skip to content

Instantly share code, notes, and snippets.

@telnet2
Created February 20, 2026 07:32
Show Gist options
  • Select an option

  • Save telnet2/a29bf52b0924fb825f33dd5d6eef938d to your computer and use it in GitHub Desktop.

Select an option

Save telnet2/a29bf52b0924fb825f33dd5d6eef938d to your computer and use it in GitHub Desktop.
Qwen3-TTS on Apple Silicon - Lessons Learned

Qwen3-TTS How-To: Lessons Learned

Models

Three models are installed locally at ~/.local/share/qwen3-tts-models/:

Key Model Notes
1 Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit Default. Best quality. Supports instruct.
4 Qwen3-TTS-12Hz-0.6B-CustomVoice-8bit Faster, less RAM. No instruction control.
6 Qwen3-TTS-12Hz-0.6B-Base-8bit Voice cloning from reference audio.

Use aliases: pro-custom (1), lite-custom (4), lite-clone (6).


Speakers

For best quality, use the speaker whose native language matches the text.

Speaker Native Language Description
Vivian Chinese Bright, slightly edgy young female
Serena Chinese Warm, gentle young female
Ryan English Dynamic male with strong rhythmic drive
Aiden English Sunny American male with clear midrange
Ethan English
Chelsie English

The full 1.7B model supports 9 speakers (including Ono_Anna for Japanese and Sohee for Korean) but the installed mlx-community quantized version ships with 6.


Language Codes

Always pass --lang-code for best results. Auto-detection works but can misidentify.

auto, chinese, english, japanese, korean, german, french, russian, portuguese, spanish, italian

Instruction Control (instruct)

The 1.7B CustomVoice model supports natural language style control via instruct.

# Excited tone
python mcp_server.py --speak-text "Hello!" --voice Ryan --instruct "speak in an excited and energetic tone"

# Angry tone
python mcp_server.py --speak-text "I told you so." --voice Ryan --instruct "speak in an angry tone"

# Chinese emotional style
python mcp_server.py --speak-text "你好!" --voice Vivian --lang-code chinese --instruct "用特别愉快的语气说"

Default instruction is "normal tone" when omitted.


MCP Setup

Run via Python (recommended)

Point .mcp.json directly at the venv Python — avoids PyInstaller startup overhead:

{
  "mcpServers": {
    "qwen3-tts-mcp": {
      "type": "stdio",
      "command": "/path/to/tts/.venv/bin/python",
      "args": [
        "/path/to/tts/mcp_server.py",
        "--models-dir", "/Users/<you>/.local/share/qwen3-tts-models"
      ]
    }
  }
}

Run via compiled binary

The binary is slower on cold start (~15s) due to PyInstaller bootstrap overhead. Once warm it is fine. Build with:

.venv/bin/pyinstaller qwen3-tts-mcp.spec --noconfirm
rm -f ~/.local/bin/qwen3-tts-mcp
cp dist/qwen3-tts-mcp ~/.local/bin/qwen3-tts-mcp

Important: The binary bundles all Python dependencies. Patches to .venv packages (e.g. mlx_audio) are NOT reflected in the binary until you rebuild.


Audio Buffering

The problem

mlx_audio's AudioPlayer defaults to min_buffer_seconds = 1.5. Combined with a streaming_interval of 2.0 seconds, audio chunks arrive too slowly and gaps appear between chunks during playback.

The fix (applied in mcp_server.py)

from mlx_audio.tts.audio_player import AudioPlayer
AudioPlayer.min_buffer_seconds = 4.0  # wait for larger buffer before starting playback

And in generate_audio():

streaming_interval=5.0  # generate larger chunks before yielding

Monkey-patching AudioPlayer directly in mcp_server.py (rather than editing the library file) survives pip upgrades.

The drain bug

With min_buffer_seconds = 4.0, short phrases (< 4 seconds) never accumulate enough audio to trigger playback. wait_for_drain() blocks forever — the process hangs.

Fix: patch wait_for_drain to force-start playback if audio is buffered but not yet playing:

_original_wait_for_drain = AudioPlayer.wait_for_drain
def _patched_wait_for_drain(self):
    if not self.playing and self.buffered_samples() > 0:
        self.start_stream()
    return _original_wait_for_drain(self)
AudioPlayer.wait_for_drain = _patched_wait_for_drain

CLI Reference

python mcp_server.py \
  --speak-text "Hello world" \
  --voice Ryan \
  --speak-model 1 \
  --lang-code english \
  --instruct "speak in an excited tone" \
  --speak-keep-file \
  --speak-output-dir ./outputs
Flag Default Description
--speak-text Text to synthesize
--voice Vivian Speaker name
--speak-model 1 Model key (1-6 or alias)
--lang-code auto Language hint
--instruct None Style instruction
--speak-speed 1.0 Speed (note: not yet implemented in mlx_audio)
--speak-keep-file off Save WAV to disk
--speak-output-dir outputs/ Output directory
--speak-no-play off Disable audio playback

Known Issues

  • speed parameter has no effectmlx_audio accepts it but notes "not directly supported yet".
  • Tokenizer warningtransformers 5.0.0rc3 warns about an incorrect regex pattern in the Qwen3 tokenizer. fix_mistral_regex=True is passed in qwen3.py but doesn't fully propagate through AutoTokenizer.from_pretrained. The warning is cosmetic and audio quality is acceptable.
  • Binary vs Python path — The warning always exists in both; it's just hidden in MCP tool output (stderr not forwarded).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment