Qwen3-TTS How-To: Lessons Learned

Models

Three models are installed locally at ~/.local/share/qwen3-tts-models/:

Key	Model	Notes
`1`	`Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit`	Default. Best quality. Supports `instruct`.
`4`	`Qwen3-TTS-12Hz-0.6B-CustomVoice-8bit`	Faster, less RAM. No instruction control.
`6`	`Qwen3-TTS-12Hz-0.6B-Base-8bit`	Voice cloning from reference audio.

Use aliases: pro-custom (1), lite-custom (4), lite-clone (6).

Speakers

For best quality, use the speaker whose native language matches the text.

Speaker	Native Language	Description
Vivian	Chinese	Bright, slightly edgy young female
Serena	Chinese	Warm, gentle young female
Ryan	English	Dynamic male with strong rhythmic drive
Aiden	English	Sunny American male with clear midrange
Ethan	English	—
Chelsie	English	—

The full 1.7B model supports 9 speakers (including Ono_Anna for Japanese and Sohee for Korean) but the installed mlx-community quantized version ships with 6.

Language Codes

Always pass --lang-code for best results. Auto-detection works but can misidentify.

auto, chinese, english, japanese, korean, german, french, russian, portuguese, spanish, italian

Instruction Control (`instruct`)

The 1.7B CustomVoice model supports natural language style control via instruct.

# Excited tone
python mcp_server.py --speak-text "Hello!" --voice Ryan --instruct "speak in an excited and energetic tone"

# Angry tone
python mcp_server.py --speak-text "I told you so." --voice Ryan --instruct "speak in an angry tone"

# Chinese emotional style
python mcp_server.py --speak-text "你好！" --voice Vivian --lang-code chinese --instruct "用特别愉快的语气说"

Default instruction is "normal tone" when omitted.

MCP Setup

Run via Python (recommended)

Point .mcp.json directly at the venv Python — avoids PyInstaller startup overhead:

{
  "mcpServers": {
    "qwen3-tts-mcp": {
      "type": "stdio",
      "command": "/path/to/tts/.venv/bin/python",
      "args": [
        "/path/to/tts/mcp_server.py",
        "--models-dir", "/Users/<you>/.local/share/qwen3-tts-models"
      ]
    }
  }
}

Run via compiled binary

The binary is slower on cold start (~15s) due to PyInstaller bootstrap overhead. Once warm it is fine. Build with:

.venv/bin/pyinstaller qwen3-tts-mcp.spec --noconfirm
rm -f ~/.local/bin/qwen3-tts-mcp
cp dist/qwen3-tts-mcp ~/.local/bin/qwen3-tts-mcp

Important: The binary bundles all Python dependencies. Patches to .venv packages (e.g. mlx_audio) are NOT reflected in the binary until you rebuild.

Audio Buffering

The problem

mlx_audio's AudioPlayer defaults to min_buffer_seconds = 1.5. Combined with a streaming_interval of 2.0 seconds, audio chunks arrive too slowly and gaps appear between chunks during playback.

The fix (applied in `mcp_server.py`)

from mlx_audio.tts.audio_player import AudioPlayer
AudioPlayer.min_buffer_seconds = 4.0  # wait for larger buffer before starting playback

And in generate_audio():

streaming_interval=5.0  # generate larger chunks before yielding

Monkey-patching AudioPlayer directly in mcp_server.py (rather than editing the library file) survives pip upgrades.

The drain bug

With min_buffer_seconds = 4.0, short phrases (< 4 seconds) never accumulate enough audio to trigger playback. wait_for_drain() blocks forever — the process hangs.

Fix: patch wait_for_drain to force-start playback if audio is buffered but not yet playing:

_original_wait_for_drain = AudioPlayer.wait_for_drain
def _patched_wait_for_drain(self):
    if not self.playing and self.buffered_samples() > 0:
        self.start_stream()
    return _original_wait_for_drain(self)
AudioPlayer.wait_for_drain = _patched_wait_for_drain

CLI Reference

python mcp_server.py \
  --speak-text "Hello world" \
  --voice Ryan \
  --speak-model 1 \
  --lang-code english \
  --instruct "speak in an excited tone" \
  --speak-keep-file \
  --speak-output-dir ./outputs

Flag	Default	Description
`--speak-text`	—	Text to synthesize
`--voice`	`Vivian`	Speaker name
`--speak-model`	`1`	Model key (1-6 or alias)
`--lang-code`	`auto`	Language hint
`--instruct`	`None`	Style instruction
`--speak-speed`	`1.0`	Speed (note: not yet implemented in mlx_audio)
`--speak-keep-file`	off	Save WAV to disk
`--speak-output-dir`	`outputs/`	Output directory
`--speak-no-play`	off	Disable audio playback

Known Issues

speed parameter has no effect — mlx_audio accepts it but notes "not directly supported yet".
Tokenizer warning — transformers 5.0.0rc3 warns about an incorrect regex pattern in the Qwen3 tokenizer. fix_mistral_regex=True is passed in qwen3.py but doesn't fully propagate through AutoTokenizer.from_pretrained. The warning is cosmetic and audio quality is acceptable.
Binary vs Python path — The warning always exists in both; it's just hidden in MCP tool output (stderr not forwarded).

telnet2/HOWTO_QWEN3.md

Select an option

No results found

Select an option

No results found

Qwen3-TTS How-To: Lessons Learned

Models

Speakers

Language Codes

Instruction Control (`instruct`)

MCP Setup

Run via Python (recommended)

Run via compiled binary

Audio Buffering

The problem

The fix (applied in `mcp_server.py`)

The drain bug

CLI Reference

Known Issues

telnet2/HOWTO_QWEN3.md

Qwen3-TTS How-To: Lessons Learned

Models

Speakers

Language Codes

Instruction Control (instruct)

MCP Setup

Run via Python (recommended)

Run via compiled binary

Audio Buffering

The problem

The fix (applied in mcp_server.py)

The drain bug

CLI Reference

Known Issues

Instruction Control (`instruct`)

The fix (applied in `mcp_server.py`)