skills/qwen3-tts-mlx/SKILL.md
Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.
npx skillsauth add agiseek/agent-skills qwen3-tts-mlxInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.
pip install mlx-audio
brew install ffmpeg
python scripts/run_tts.py custom-voice \
--text "Hello, welcome to local text to speech." \
--voice Ryan \
--output output.wav
python scripts/run_tts.py custom-voice \
--text "Breaking news: local AI model achieves human-level speech." \
--voice Uncle_Fu \
--instruct "news anchor tone, calm and authoritative" \
--output news.wav
| Variant | Model | Size | Memory | Use Case |
|---------|-------|------|--------|----------|
| CustomVoice | mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit | ~1GB | ~4GB | Built-in voices + style control (recommended) |
| VoiceDesign | mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit | ~2GB | ~5GB | Create voices from text descriptions |
| Base | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit | ~1GB | ~4GB | Voice cloning from reference audio |
| Language | Code | Notes |
|----------|------|-------|
| Auto-detect | auto | Default, detects from text |
| Chinese | Chinese | Mandarin |
| English | English | |
| Japanese | Japanese | |
| Korean | Korean | |
| French | French | |
| German | German | |
| Spanish | Spanish | |
| Portuguese | Portuguese | |
| Italian | Italian | |
| Russian | Russian | |
| Voice | Language | Character | |-------|----------|-----------| | Vivian | Chinese | Female, bright, young | | Serena | Chinese | Female, gentle, soft | | Uncle_Fu | Chinese | Male, authoritative, news anchor | | Dylan | Chinese | Male, Beijing dialect | | Eric | Chinese | Male, Sichuan dialect | | Ryan | English | Male, energetic | | Aiden | English | Male, clear, neutral | | Ono_Anna | Japanese | Female | | Sohee | Korean | Female |
Voice Selection Guide:
| Scenario | Recommended Voice | |----------|-------------------| | Chinese news/narration | Uncle_Fu | | Chinese casual/lively | Eric | | Chinese female, professional | Vivian | | Chinese female, storytelling | Serena | | English energetic content | Ryan | | English neutral/educational | Aiden | | Japanese content | Ono_Anna | | Korean content | Sohee |
Use built-in voices with optional emotion/style control via --instruct.
python scripts/run_tts.py custom-voice \
--text "This is amazing news!" \
--voice Vivian \
--instruct "excited and happy" \
--output excited.wav
Style instruction examples:
"calm and warm" - Soft, friendly delivery"news anchor, authoritative" - Professional broadcast style"excited and energetic" - High energy, enthusiastic"sad and melancholic" - Emotional, somber tone"whispering, intimate" - Quiet, close-mic feelCreate a completely new voice by describing it in natural language.
python scripts/run_tts.py voice-design \
--text "Welcome to our podcast." \
--instruct "warm, mature male narrator with low pitch and gentle tone" \
--output podcast_intro.wav
Voice description examples:
"young cheerful female with high pitch""elderly wise male with deep resonant voice""professional female news anchor, clear articulation""friendly young male, casual and relaxed"Clone any voice from a reference audio sample (5-10 seconds recommended).
python scripts/run_tts.py voice-clone \
--text "This is my cloned voice speaking new content." \
--ref_audio reference.wav \
--ref_text "The exact transcript of the reference audio" \
--output cloned.wav
Tips for voice cloning:
| Parameter | Required | Default | Description |
|-----------|----------|---------|-------------|
| --text | Yes | - | Text to synthesize |
| --voice | No | Vivian | Built-in voice (CustomVoice only) |
| --lang_code | No | auto | Language code |
| --instruct | No | - | Style control or voice description |
| --speed | No | 1.0 | Speech speed multiplier |
| --temperature | No | 0.7 | Sampling temperature (higher = more variation) |
| --model | No | (per mode) | Override default model |
| --output | No | - | Output file path |
| --out-dir | No | ./outputs | Output directory when --output not set |
| --ref_audio | VoiceClone | - | Reference audio file |
| --ref_text | VoiceClone | - | Reference audio transcript |
from mlx_audio.tts.generate import generate_audio
# CustomVoice with style control
generate_audio(
text="Hello from Qwen3-TTS!",
model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
voice="Ryan",
lang_code="english",
instruct="friendly and warm",
output_path=".",
file_prefix="hello",
audio_format="wav",
join_audio=True,
verbose=True,
)
from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np
# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")
# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
text="Hello from Qwen3-TTS.",
speaker="Ryan",
language="english",
instruct="clear, steady delivery"
):
if hasattr(chunk, 'audio') and chunk.audio is not None:
audio_chunks.append(chunk.audio)
# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)
from mlx_audio.tts.generate import generate_audio
generate_audio(
text="Welcome to the show.",
model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
instruct="warm, friendly female narrator with medium pitch",
lang_code="english",
output_path=".",
file_prefix="voice_design",
join_audio=True,
)
from mlx_audio.tts.generate import generate_audio
generate_audio(
text="New content in the cloned voice.",
model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
ref_audio="reference.wav",
ref_text="Transcript of the reference audio",
output_path=".",
file_prefix="cloned",
join_audio=True,
)
Use scripts/batch_dubbing.py for processing multiple lines:
python scripts/batch_dubbing.py \
--input dubbing.json \
--out-dir outputs
See references/dubbing_format.md for the JSON format.
| Metric | Value | |--------|-------| | Sample rate | 24,000 Hz | | Real-time factor | ~0.7x (faster than real-time) | | Peak memory | ~4-6 GB | | First run | Downloads model (~1-2GB) |
| Issue | Solution |
|-------|----------|
| Slow generation | Use 4-bit CustomVoice model |
| Unnatural pauses | Add punctuation, keep sentences short |
| Wrong language detected | Specify --lang_code explicitly |
| Voice cloning quality | Use cleaner reference audio, accurate transcript |
| Tokenizer warnings | Harmless, can be ignored |
| Out of memory | Close other apps, use 4-bit model |
data-ai
High-performance image processing with libvips. Use for resizing, converting, watermarking, thumbnails, and batch image operations with low memory usage.
development
Remove visible Gemini AI watermarks from images via reverse alpha blending. Use for cleaning Gemini-generated images, removing the star/sparkle logo watermark, batch watermark removal.
data-ai
Example TaskFlow authoring pattern for inbox triage. Use when messages need different treatment based on intent, with some routes notifying immediately, some waiting on outside answers, and others rolling into a later summary.
data-ai
Example TaskFlow authoring pattern for inbox triage. Use when messages need different treatment based on intent, with some routes notifying immediately, some waiting on outside answers, and others rolling into a later summary.