Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

When to Use

Generate speech fully offline on a Mac
Produce narration, audiobooks, podcasts, or video voiceovers
Create multilingual TTS with controllable style and emotion
Clone any voice from a short audio sample
Design custom voices from text descriptions

Quick Start

Install

pip install mlx-audio
brew install ffmpeg

Basic Usage

python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav

With Style Control

python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav

Model Variants

| Variant | Model | Size | Memory | Use Case | |---------|-------|------|--------|----------| | CustomVoice | mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit | ~1GB | ~4GB | Built-in voices + style control (recommended) | | VoiceDesign | mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit | ~2GB | ~5GB | Create voices from text descriptions | | Base | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit | ~1GB | ~4GB | Voice cloning from reference audio |

Supported Languages

| Language | Code | Notes | |----------|------|-------| | Auto-detect | auto | Default, detects from text | | Chinese | Chinese | Mandarin | | English | English | | | Japanese | Japanese | | | Korean | Korean | | | French | French | | | German | German | | | Spanish | Spanish | | | Portuguese | Portuguese | | | Italian | Italian | | | Russian | Russian | |

Built-in Voices

| Voice | Language | Character | |-------|----------|-----------| | Vivian | Chinese | Female, bright, young | | Serena | Chinese | Female, gentle, soft | | Uncle_Fu | Chinese | Male, authoritative, news anchor | | Dylan | Chinese | Male, Beijing dialect | | Eric | Chinese | Male, Sichuan dialect | | Ryan | English | Male, energetic | | Aiden | English | Male, clear, neutral | | Ono_Anna | Japanese | Female | | Sohee | Korean | Female |

Voice Selection Guide:

| Scenario | Recommended Voice | |----------|-------------------| | Chinese news/narration | Uncle_Fu | | Chinese casual/lively | Eric | | Chinese female, professional | Vivian | | Chinese female, storytelling | Serena | | English energetic content | Ryan | | English neutral/educational | Aiden | | Japanese content | Ono_Anna | | Korean content | Sohee |

Modes

1) CustomVoice

Use built-in voices with optional emotion/style control via --instruct.

python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav

Style instruction examples:

"calm and warm" - Soft, friendly delivery
"news anchor, authoritative" - Professional broadcast style
"excited and energetic" - High energy, enthusiastic
"sad and melancholic" - Emotional, somber tone
"whispering, intimate" - Quiet, close-mic feel

2) VoiceDesign

Create a completely new voice by describing it in natural language.

python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav

Voice description examples:

"young cheerful female with high pitch"
"elderly wise male with deep resonant voice"
"professional female news anchor, clear articulation"
"friendly young male, casual and relaxed"

3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav

Tips for voice cloning:

Use clean audio without background noise
5-10 seconds of speech works best
Provide accurate transcript of the reference
Reference and output language should match

CLI Parameters

| Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | --text | Yes | - | Text to synthesize | | --voice | No | Vivian | Built-in voice (CustomVoice only) | | --lang_code | No | auto | Language code | | --instruct | No | - | Style control or voice description | | --speed | No | 1.0 | Speech speed multiplier | | --temperature | No | 0.7 | Sampling temperature (higher = more variation) | | --model | No | (per mode) | Override default model | | --output | No | - | Output file path | | --out-dir | No | ./outputs | Output directory when --output not set | | --ref_audio | VoiceClone | - | Reference audio file | | --ref_text | VoiceClone | - | Reference audio transcript |

Python API

Using generate_audio (recommended)

from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)

Using Model directly

from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)

VoiceDesign

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)

VoiceClone

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)

Batch Processing

Use scripts/batch_dubbing.py for processing multiple lines:

python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs

See references/dubbing_format.md for the JSON format.

Performance

| Metric | Value | |--------|-------| | Sample rate | 24,000 Hz | | Real-time factor | ~0.7x (faster than real-time) | | Peak memory | ~4-6 GB | | First run | Downloads model (~1-2GB) |

Troubleshooting

| Issue | Solution | |-------|----------| | Slow generation | Use 4-bit CustomVoice model | | Unnatural pauses | Add punctuation, keep sentences short | | Wrong language detected | Specify --lang_code explicitly | | Voice cloning quality | Use cleaner reference audio, accurate transcript | | Tokenizer warnings | Harmless, can be ignored | | Out of memory | Close other apps, use 4-bit model |

Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

When to Use

Generate speech fully offline on a Mac
Produce narration, audiobooks, podcasts, or video voiceovers
Create multilingual TTS with controllable style and emotion
Clone any voice from a short audio sample
Design custom voices from text descriptions

Quick Start

Install

pip install mlx-audio
brew install ffmpeg

Basic Usage

python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav

With Style Control

python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav

Model Variants

Supported Languages

Built-in Voices

Voice Selection Guide:

Modes

1) CustomVoice

Use built-in voices with optional emotion/style control via --instruct.

python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav

Style instruction examples:

"calm and warm" - Soft, friendly delivery
"news anchor, authoritative" - Professional broadcast style
"excited and energetic" - High energy, enthusiastic
"sad and melancholic" - Emotional, somber tone
"whispering, intimate" - Quiet, close-mic feel

2) VoiceDesign

Create a completely new voice by describing it in natural language.

python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav

Voice description examples:

"young cheerful female with high pitch"
"elderly wise male with deep resonant voice"
"professional female news anchor, clear articulation"
"friendly young male, casual and relaxed"

3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav

Tips for voice cloning:

Use clean audio without background noise
5-10 seconds of speech works best
Provide accurate transcript of the reference
Reference and output language should match

CLI Parameters

Python API

Using generate_audio (recommended)

from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)

Using Model directly

from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)

VoiceDesign

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)

VoiceClone

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)

Batch Processing

Use scripts/batch_dubbing.py for processing multiple lines:

python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs

See references/dubbing_format.md for the JSON format.

Performance

| Metric | Value | |--------|-------| | Sample rate | 24,000 Hz | | Real-time factor | ~0.7x (faster than real-time) | | Peak memory | ~4-6 GB | | First run | Downloads model (~1-2GB) |

Adoption

agiseek/qwen3-tts-mlx

$ install --global

Security Scan Results

SKILL.md

Qwen3-TTS MLX

When to Use

Quick Start

Install

Basic Usage

With Style Control

Model Variants

Supported Languages

Built-in Voices

Modes

1) CustomVoice

2) VoiceDesign

3) VoiceClone

CLI Parameters

Python API

Using generate_audio (recommended)

Using Model directly

VoiceDesign

VoiceClone

Batch Processing

Performance

Troubleshooting

Related Skills

agiseek/libvips-image

agiseek/gemini-watermark

openclaw/taskflow-inbox-triage

steipete/taskflow-inbox-triage

agiseek/qwen3-tts-mlx

$ install --global

Security Scan Results

SKILL.md

Qwen3-TTS MLX

When to Use

Quick Start

Install

Basic Usage

With Style Control

Model Variants

Supported Languages

Built-in Voices

Modes

1) CustomVoice

2) VoiceDesign

3) VoiceClone

CLI Parameters

Python API

Using generate_audio (recommended)

Using Model directly

VoiceDesign

VoiceClone

Batch Processing

Performance

Troubleshooting

Related Skills

agiseek/libvips-image

agiseek/gemini-watermark

openclaw/taskflow-inbox-triage

steipete/taskflow-inbox-triage