Skills/piper-tts-training/SKILL.md
Train custom TTS voices for Piper (ONNX format) using fine-tuning or from-scratch approaches. Use when creating new synthetic voices, fine-tuning existing Piper checkpoints, preparing audio datasets for TTS training, or deploying voice models to devices like Raspberry Pi or Home Assistant. Covers dataset preparation, Whisper-based validation, training configuration, and ONNX export.
npx skillsauth add sammcj/agentic-coding piper-tts-trainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.
Piper produces fast, offline TTS suitable for embedded devices. Training involves:
Fine-tuning vs from-scratch:
Gather 1,300-1,500+ phrases covering broad phonetic range:
Critical for non-US English: Ensure corpus uses correct regional spelling. See Localisation.
Generate or record training audio at 22050Hz mono WAV.
If using voice cloning (e.g., Chatterbox TTS):
sox -v 0.95 input.wav -r 22050 -t wav output.wav-v 0.95 prevents clipping during resamplingRecording requirements:
Automate quality checks rather than manual listening:
import whisper
from piper_phonemize import phonemize_text
model = whisper.load_model("base")
def validate_sample(audio_path, expected_text):
result = model.transcribe(audio_path)
transcribed = result["text"].strip()
# Compare phonemically to handle spelling/punctuation differences
expected_phonemes = phonemize_text(expected_text, "en-gb")
transcribed_phonemes = phonemize_text(transcribed, "en-gb")
return expected_phonemes == transcribed_phonemes
Retry failed samples up to 3 times. Target 95%+ dataset coverage.
Structure your dataset:
dataset/
├── metadata.csv
└── wavs/
├── sample_0001.wav
├── sample_0002.wav
└── ...
metadata.csv format: {id}|{text} (pipe-separated, no headers)
sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.
Convert to PyTorch tensors:
python3 -m piper_train.preprocess \
--language en-gb \
--input-dir dataset/ \
--output-dir piper_training_dir/ \
--dataset-format ljspeech
Use en-gb for Australian/NZ/UK voices (espeak-ng phoneme set).
Fine-tuning (recommended):
python3 -m piper_train \
--dataset-dir piper_training_dir/ \
--accelerator gpu \
--devices 1 \
--batch-size 12 \
--max_epochs 3000 \
--resume_from_checkpoint ljspeech-2000.ckpt \
--checkpoint-epochs 100 \
--quality high \
--precision 32
Key parameters:
--batch-size: Reduce if VRAM limited (12 works on 8GB)--resume_from_checkpoint: Start from LJSpeech high-quality checkpoint--precision 32: More stable than mixed precision--validation-split 0.0 --num-test-examples 0: Skip validation for small datasetsMonitor with TensorBoard: watch loss_disc_all for convergence.
python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx
Create metadata file output.onnx.json from training config.json.
Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.
Corpus preparation:
scripts/convert_spelling.py on corpus text before trainingen-gb or en-au espeak-ng voice for phonemisationCommon spelling conversions: | American | Australian/UK | |----------|---------------| | -ize | -ise | | -or | -our | | -er | -re | | -og | -ogue | | -ense | -ence |
Phoneme considerations:
For complete word lists and phonetic details, see references/localisation.md.
Validation: Use Whisper with language="en" and verify transcriptions match expected regional forms.
Pin versions to avoid API breakage:
pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim
Docker containerisation recommended for reproducibility.
Minimum (fine-tuning):
From scratch: Multiply time by ~200x.
| Issue | Solution | |-------|----------| | CUDA OOM | Reduce batch-size (try 8 or 4) | | Checkpoint won't load | Check pytorch-lightning version matches checkpoint | | Garbled output | Insufficient training epochs or dataset too small | | Wrong accent | Check espeak-ng language code and corpus spelling |
development
Use when answering questions from this machine-learning knowledge base. Triggers: questions about transformers, attention cost and efficiency, and long-context scaling; 'what do we know about attention', 'check the ML wiki'. Read-only querying of compiled knowledge; to add, update, supersede, lint, or audit, use the llm-wiki skill instead.
development
Use when building or maintaining a self-contained personal knowledge base (an LLM wiki) as plain markdown, optionally opened as an Obsidian vault. Triggers: ingesting sources into a wiki, querying wiki knowledge, linting wiki health, auditing article claims against their sources, superseding stale knowledge, 'add to wiki', or any mention of 'LLM wiki' or 'Karpathy wiki'.
tools
Provides guidance and tools for hardware design. Activate when using KiCAD, looking up electronic parts or designing PCBs.
testing
Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise.