/SKILL.md
Convert text to speech using Kyutai's Pocket TTS. Use when the user asks to "generate speech", "text to speech", "TTS", "convert text to audio", "voice synthesis", "generate voice", "read aloud", or "create audio from text". Supports voice cloning from audio samples and multiple pre-made voices (alba, marius, javert, jean, fantine, cosette, eponine, azelma).
npx skillsauth add kenneropia/text-to-voice text-to-voiceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Convert text to natural speech using Kyutai's Pocket TTS - a lightweight 100M parameter model that runs efficiently on CPU.
pip install pocket-tts
# or use uvx to run without installing:
uvx pocket-tts generate
Requires Python 3.10+ and PyTorch 2.5+. GPU not required.
# Generate with defaults (saves to ./tts_output.wav)
uvx pocket-tts generate
# Specify text
pocket-tts generate --text "Hello, this is my message."
# Specify output file location
pocket-tts generate --text "Hello" --output-path ./audio/greeting.wav
# Full example with all common options
pocket-tts generate \
--text "Welcome to the demo." \
--voice alba \
--output-path ./output/welcome.wav
| Option | Default | Description |
|--------|---------|-------------|
| --text | "Hello world..." | Text to convert to speech |
| --voice | alba | Voice name, local file path, or HuggingFace URL |
| --output-path | ./tts_output.wav | Where to save the generated audio file |
| --temperature | 0.7 | Generation temperature (higher = more expressive) |
| --lsd-decode-steps | 1 | Quality steps (higher = better quality, slower) |
| --eos-threshold | -4.0 | End detection threshold (lower = finish earlier) |
| --frames-after-eos | auto | Extra frames after end (each frame = 80ms) |
| --device | cpu | Device to use (cpu/cuda) |
| -q, --quiet | false | Disable logging output |
# Use a pre-made voice by name
pocket-tts generate --voice alba --text "Hello"
pocket-tts generate --voice javert --text "Hello"
# Use a local audio file for voice cloning
pocket-tts generate --voice ./my_voice.wav --text "Hello"
# Use a voice from HuggingFace
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/merchant.wav" --text "Hello"
# Higher quality (more generation steps)
pocket-tts generate --lsd-decode-steps 5 --temperature 0.5 --output-path high_quality.wav
# More expressive/varied output
pocket-tts generate --temperature 1.0 --output-path expressive.wav
# Shorter output (finishes speaking earlier)
pocket-tts generate --eos-threshold -3.0 --output-path shorter.wav
For quick iteration with multiple voices/texts:
uvx pocket-tts serve
# Open http://localhost:8000
Pre-made voices (use name directly with --voice):
| Voice | Gender | License | Description |
|-------|--------|---------|-------------|
| alba | Female | CC BY 4.0 | Casual voice |
| marius | Male | CC0 | Voice donation |
| javert | Male | CC0 | Voice donation |
| jean | Male | CC-NC | EARS dataset |
| fantine | Female | CC BY 4.0 | VCTK dataset |
| cosette | Female | CC-NC | Expresso dataset |
| eponine | Female | CC BY 4.0 | VCTK dataset |
| azelma | Female | CC BY 4.0 | VCTK dataset |
Full voice catalog: https://huggingface.co/kyutai/tts-voices
For detailed voice information, see references/voices.md.
Clone any voice from an audio sample. For best results:
pocket-tts generate --voice ./my_recording.wav --text "Hello" --output-path cloned.wav
./tts_output.wavFor programmatic use:
from pocket_tts import TTSModel
import scipy.io.wavfile
tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt("alba")
audio = tts_model.generate_audio(voice_state, "Hello world!")
# Save to specific location
scipy.io.wavfile.write("./audio/output.wav", tts_model.sample_rate, audio.numpy())
model = TTSModel.load_model(
variant="b6369a24", # Model variant
temp=0.7, # Temperature (0.0-1.0)
lsd_decode_steps=1, # Generation steps
noise_clamp=None, # Max noise value
eos_threshold=-4.0 # End-of-sequence threshold
)
# Pre-made voice
voice_state = model.get_state_for_audio_prompt("alba")
# Local file
voice_state = model.get_state_for_audio_prompt("./my_voice.wav")
# HuggingFace
voice_state = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
audio = model.generate_audio(voice_state, "Text to speak")
# Returns: torch.Tensor (1D)
for chunk in model.generate_audio_stream(voice_state, "Long text..."):
# Process each chunk as it's generated
pass
model.sample_rate - 24000 Hzmodel.device - "cpu" or "cuda"data-ai
Example TaskFlow authoring pattern for inbox triage. Use when messages need different treatment based on intent, with some routes notifying immediately, some waiting on outside answers, and others rolling into a later summary.
data-ai
Example TaskFlow authoring pattern for inbox triage. Use when messages need different treatment based on intent, with some routes notifying immediately, some waiting on outside answers, and others rolling into a later summary.
data-ai
OpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
data-ai
OpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.