skills/tts/SKILL.md
Use this skill whenever the user wants to convert text into speech, generate audio from text, or produce voiceovers. Triggers include: any mention of 'TTS', 'text to speech', 'speak', 'say', 'voice', 'read aloud', 'audio narration', 'voiceover', 'dubbing', or requests to turn written content into spoken audio. Also use when converting EPUB/PDF/SRT/articles to audio, cloning voices from reference audio, controlling emotion or speed in speech, aligning speech to subtitle timelines, or producing per-segment voice-mapped audio.
npx skillsauth add NoizAI/skills ttsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Convert any text into speech audio. Supports two backends (Kokoro local, Noiz cloud), two modes (simple or timeline-accurate), and per-segment voice control.
speak is the default — the subcommand can be omitted:
# Basic usage (speak is implicit)
python3 skills/tts/scripts/tts.py -t "Hello world" # add -o path to save
python3 skills/tts/scripts/tts.py -f article.txt -o out.mp3
# Voice cloning — local file path or URL
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio ./ref.wav
python3 skills/tts/scripts/tts.py -t "Hello" --ref-audio https://example.com/my_voice.wav -o clone.wav
# Voice message format
python3 skills/tts/scripts/tts.py -t "Hello" --format opus -o voice.opus
python3 skills/tts/scripts/tts.py -t "Hello" --format ogg -o voice.ogg
Third-party integration (Feishu/Telegram/Discord) is documented in ref_3rd_party.md.
For precise per-segment timing (dubbing, subtitles, video narration).
If the user doesn't have one, generate from text:
python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt
python3 skills/tts/scripts/tts.py to-srt -i article.txt -o article.srt --cps 15 --gap 500
--cps = characters per second (default 4, good for Chinese; ~15 for English). The agent can also write SRT manually.
JSON file controlling default + per-segment voice settings. segments keys support single index "3" or range "5-8".
Kokoro voice map:
{
"default": { "voice": "zf_xiaoni", "lang": "cmn" },
"segments": {
"1": { "voice": "zm_yunxi" },
"5-8": { "voice": "af_sarah", "lang": "en-us", "speed": 0.9 }
}
}
Noiz voice map (adds emo, reference_audio support). reference_audio can be a local path or a URL (user’s own audio; Noiz only):
{
"default": { "voice_id": "voice_123", "target_lang": "zh" },
"segments": {
"1": { "voice_id": "voice_host", "emo": { "Joy": 0.6 } },
"2-4": { "reference_audio": "./refs/guest.wav" }
}
}
Dynamic Reference Audio Slicing:
If you are translating or dubbing a video and want each sentence to automatically use the audio from the original video at the exact same timestamp as its reference audio, use the --ref-audio-track argument instead of setting reference_audio in the map:
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --ref-audio-track original_video.mp4 -o output.wav
See examples/ for full samples.
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json -o output.wav
python3 skills/tts/scripts/tts.py render --srt input.srt --voice-map vm.json --backend noiz --auto-emotion -o output.wav
| Need | Recommended |
|------|-------------|
| Just read text aloud, no fuss | Kokoro (default) |
| EPUB/PDF audiobook with chapters | Kokoro (native support) |
| Voice blending ("v1:60,v2:40") | Kokoro |
| Voice cloning from reference audio | Noiz |
| Emotion control (emo param) | Noiz |
| Exact server-side duration per segment | Noiz |
When the user needs emotion control + voice cloning + precise duration together, Noiz is the only backend that supports all three.
When no API key is configured, tts.py automatically falls back to guest mode — a limited Noiz endpoint that requires no authentication. Guest mode only supports --voice-id, --speed, and --format; voice cloning, emotion, duration, and timeline rendering are not available.
# Guest mode (auto-detected when no API key is set)
python3 skills/tts/scripts/tts.py -t "Hello" --voice-id 883b6b7c -o hello.wav
# Explicit backend override to use kokoro instead
python3 skills/tts/scripts/tts.py -t "Hello" --backend kokoro
Available guest voices (15 built-in):
| voice_id | name | lang | gender | tone |
|---|---|---|---|---|
| 063a4491 | 販売員(なおみ) | ja | F | 喜び |
| 4252b9c8 | 落ち着いた女性 | ja | F | 穏やか |
| 578b4be2 | 熱血漢(たける) | ja | M | 怒り |
| a9249ce7 | 安らぎ(みなと) | ja | M | 穏やか |
| f00e45a1 | 旅人(かいと) | ja | M | 穏やか |
| b4775100 | 悦悦|社交分享 | zh | F | Joyful |
| 77e15f2c | 婉青|情绪抚慰 | zh | F | Calm |
| ac09aeb4 | 阿豪|磁性主持 | zh | M | Calm |
| 87cb2405 | 建国|知识科普 | zh | M | Calm |
| 3b9f1e27 | 小明|科技达人 | zh | M | Joyful |
| 95814add | Science Narration | en | M | Calm |
| 883b6b7c | The Mentor (Alex) | en | M | Joyful |
| a845c7de | The Naturalist (Silas) | en | M | Calm |
| 5a68d66b | The Healer (Serena) | en | F | Calm |
| 0e4ab6ec | The Mentor (Maya) | en | F | Calm |
This skill performs the following file and network operations at runtime:
config --set-api-key, the key is saved to ~/.config/noiz/api_key (permissions 0600). The NOIZ_API_KEY environment variable is also supported as an alternative.~/.noiz_api_key exists and ~/.config/noiz/api_key does not, the key is copied (not deleted) to the new location. A message is printed; the old file is left untouched for you to remove manually.https://noiz.ai/v1/ for synthesis. No data is sent unless you invoke a Noiz command.--ref-audio is a URL, the file is downloaded to a temp file, used for the API call, then deleted. If no voice-id or ref-audio is provided, a default reference audio is downloaded from storage.googleapis.com or noiz.ai.render mode to assemble the final audio.No files outside the output path and ~/.config/noiz/ are modified. The Kokoro backend runs entirely offline with no network access.
ffmpeg in PATH (timeline mode only)requests package: uv pip install requests (required for Noiz backend)python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY (guest mode works without a key but has limited features)--backend kokoro to use the local backendUse only the base64-encoded API key as Authorization—no prefix (e.g. no APIKEY or Bearer ). Any prefix causes 401.
For backend details and full argument reference, see reference.md.
content-media
Use this skill whenever the user wants to transcribe audio to text, convert speech to text, or get a transcript from an audio or video file. Triggers include: any mention of 'transcribe', 'transcription', 'speech to text', 'STT', 'convert audio to text', 'what does this audio say', 'get transcript', 'subtitle generation', or requests to extract spoken words from a file. Also use when the user wants speaker identification from audio, timestamps for captions, or multilingual transcription.
tools
Use this skill whenever the user wants to generate sound effects, ambient audio, or short audio clips from a text description. Triggers include: any mention of 'sound effect', 'sfx', 'generate sound', 'make a sound', 'audio effect', 'ambient sound', 'foley', 'sound clip', 'noise', or requests to produce a specific sound (e.g. 'make a gunshot sound', 'generate thunder', 'create the sound of rain'). Also use when the user describes an action or scenario and wants the corresponding audio (e.g. 'someone getting spanked', 'a door slamming', 'cartoon boing'). Do NOT use for speech synthesis, music generation with melody/lyrics, or voice cloning.
testing
Translate and dub videos from one language to another, replacing the original audio with TTS while keeping the video intact.
data-ai
Reusable template for authoring new Agent Skills with clear triggers, workflow, and I/O contracts.