tools/audio/ai-voice-cloning/SKILL.md
AI voice generation, text-to-speech, and voice synthesis via inference.sh CLI. Models: Inworld TTS-2 (100+ languages, emotion/non-verbal steering), Inworld TTS 1.5 (ultra-low latency), ElevenLabs (22+ premium voices, 32 languages), Kokoro TTS, DIA, Chatterbox, Higgs, VibeVoice for natural speech. Capabilities: multiple voices, emotions, accents, long-form narration, conversation, voice transformation, delivery mode control, character voices. Use for: voiceovers, audiobooks, podcasts, video narration, accessibility, gaming NPCs, avatar audio, UGC. Triggers: voice cloning, tts, text to speech, ai voice, voice generation, voice synthesis, voice over, narration, speech synthesis, ai narrator, elevenlabs, eleven labs, natural voice, realistic speech, voice ai, voice changer, inworld, inworld tts, character voice, npc voice
npx skillsauth add inference-sh-6/skills ai-voice-cloningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
Security scan pending...
This skill is queued for security scanning. Results will appear when the scan completes.
Install the belt CLI skill:
npx skills add belt-sh/cli
Generate natural AI voices via inference.sh CLI.

Requires inference.sh CLI (
belt). Install instructions
belt login
# Generate speech
belt app run infsh/kokoro-tts --input '{
"prompt": "Hello! This is an AI-generated voice that sounds natural and engaging.",
"voice": "af_sarah"
}'
| Model | App ID | Best For |
|-------|--------|----------|
| Inworld TTS-2 | inworld/text-to-speech-2 | 100+ languages, emotion/non-verbal steering, delivery modes |
| Inworld TTS 1.5 Max | inworld/text-to-speech-1-5-max | Low latency (<200ms), 15 languages |
| Inworld TTS 1.5 Mini | inworld/text-to-speech-1-5-mini | Ultra-low latency (~120ms), 15 languages, real-time |
| ElevenLabs TTS | elevenlabs/tts | Premium quality, 22+ voices, 32 languages |
| ElevenLabs Voice Changer | elevenlabs/voice-changer | Transform existing voice recordings |
| Kokoro TTS | infsh/kokoro-tts | Natural, multiple voices |
| DIA | infsh/dia-tts | Conversational, expressive |
| Chatterbox | infsh/chatterbox | Casual, entertainment |
| Higgs | infsh/higgs-tts | Professional narration |
| VibeVoice | infsh/vibevoice | Emotional range |
| Voice ID | Gender | Style |
|----------|--------|-------|
| af_sarah | Female | Warm, friendly |
| af_nicole | Female | Professional |
| af_sky | Female | Youthful |
| am_michael | Male | Authoritative |
| am_adam | Male | Conversational |
| am_echo | Male | Clear, neutral |
| Voice ID | Gender | Style |
|----------|--------|-------|
| bf_emma | Female | Refined |
| bf_isabella | Female | Warm |
| bm_george | Male | Classic |
| bm_lewis | Male | Modern |
Inworld TTS-2 is purpose-built for character voices, gaming, and expressive speech. Use [brackets] inline for emotion, non-verbals, and delivery control:
# Expressive character voice with emotion steering
belt app run inworld/text-to-speech-2 --input '{
"text": "[excited] Oh wow, you actually found the ancient artifact! [gasp] I cannot believe it... [whisper] We need to keep this between us.",
"voice_id": "Sarah",
"delivery_mode": "CREATIVE"
}'
# Calm narrator with stable delivery
belt app run inworld/text-to-speech-2 --input '{
"text": "The sun set behind the mountains, casting long shadows across the valley. A new chapter was about to begin.",
"voice_id": "Sarah",
"delivery_mode": "STABLE"
}'
Delivery modes: STABLE (consistent, narration), BALANCED (natural, default), CREATIVE (expressive, characters)
Steering examples: [laugh], [sigh], [whisper], [excited], [sad], [angry], [pause], [gasp]
Built-in voices (271+ across 15 languages): Sarah, Alex, Ashley, Dennis, Hana, Blake, Luna, Clive, and many more. Browse all at the Inworld TTS Playground.
# Ultra-fast response for chatbots & game NPCs (~120ms)
belt app run inworld/text-to-speech-1-5-mini --input '{
"text": "Welcome, traveler. What brings you to our village?",
"voice_id": "Clive",
"speaking_rate": 0.9
}'
belt app run infsh/kokoro-tts --input '{
"prompt": "Welcome to our quarterly earnings call. Today we will discuss the financial performance and strategic initiatives for the past quarter.",
"voice": "am_michael",
"speed": 1.0
}'
belt app run infsh/dia-tts --input '{
"text": "Hey, so I was thinking about that project we discussed. What if we tried a different approach?",
"voice": "conversational"
}'
belt app run infsh/kokoro-tts --input '{
"prompt": "Chapter One. The morning mist hung low over the valley as Sarah made her way down the winding path. She had been walking for hours.",
"voice": "bf_emma",
"speed": 0.9
}'
belt app run infsh/kokoro-tts --input '{
"prompt": "Introducing the next generation of productivity. Work smarter, not harder.",
"voice": "af_nicole",
"speed": 1.1
}'
belt app run infsh/kokoro-tts --input '{
"prompt": "Welcome back to Tech Talk! Im your host, and today we are diving deep into the world of artificial intelligence.",
"voice": "am_adam"
}'
# Generate dialogue between two speakers
# Speaker 1
belt app run infsh/kokoro-tts --input '{
"prompt": "Have you seen the latest AI developments? Its incredible how fast things are moving.",
"voice": "am_michael"
}' > speaker1.json
# Speaker 2
belt app run infsh/kokoro-tts --input '{
"prompt": "I know, right? Just last week I tried that new image generator and was blown away.",
"voice": "af_sarah"
}' > speaker2.json
# Merge conversation
belt app run infsh/media-merger --input '{
"audio_files": ["<speaker1-url>", "<speaker2-url>"],
"crossfade_ms": 300
}'
For content over 5000 characters, split into chunks:
# Process long text in chunks
TEXT="Your very long text here..."
# Split and generate
# Chunk 1
belt app run infsh/kokoro-tts --input '{
"prompt": "<chunk-1>",
"voice": "bf_emma"
}' > chunk1.json
# Chunk 2
belt app run infsh/kokoro-tts --input '{
"prompt": "<chunk-2>",
"voice": "bf_emma"
}' > chunk2.json
# Merge chunks
belt app run infsh/media-merger --input '{
"audio_files": ["<chunk1-url>", "<chunk2-url>"],
"crossfade_ms": 100
}'
# 1. Generate voiceover
belt app run infsh/kokoro-tts --input '{
"prompt": "This stunning footage shows the beauty of nature in its purest form.",
"voice": "am_michael"
}' > voiceover.json
# 2. Merge with video
belt app run infsh/media-merger --input '{
"video_url": "https://your-video.mp4",
"audio_url": "<voiceover-url>"
}'
# 1. Generate speech
belt app run infsh/kokoro-tts --input '{
"prompt": "Hi, Im excited to share some updates with you today.",
"voice": "af_sarah"
}' > speech.json
# 2. Animate with avatar
belt app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://portrait.jpg",
"audio_url": "<speech-url>"
}'
| Speed | Effect | Use For | |-------|--------|---------| | 0.8 | Slow, deliberate | Audiobooks, meditation | | 0.9 | Slightly slow | Education, tutorials | | 1.0 | Normal | General purpose | | 1.1 | Slightly fast | Commercials, energy | | 1.2 | Fast | Quick announcements |
# Slow narration
belt app run infsh/kokoro-tts --input '{
"prompt": "Take a deep breath. Let yourself relax.",
"voice": "bf_emma",
"speed": 0.8
}'
Use punctuation to control speech rhythm:
| Punctuation | Effect |
|-------------|--------|
| Period . | Full pause |
| Comma , | Brief pause |
| ... | Extended pause |
| ! | Emphasis |
| ? | Question intonation |
| - | Quick break |
belt app run infsh/kokoro-tts --input '{
"prompt": "Wait... Did you hear that? Something is coming. Something big!",
"voice": "am_adam"
}'
# ElevenLabs TTS (premium, 22+ voices)
npx skills add inference-sh/skills@elevenlabs-tts
# ElevenLabs voice changer (transform recordings)
npx skills add inference-sh/skills@elevenlabs-voice-changer
# All TTS models
npx skills add inference-sh/skills@text-to-speech
# Podcast creation
npx skills add inference-sh/skills@ai-podcast-creation
# AI avatars
npx skills add inference-sh/skills@ai-avatar-video
# Video generation
npx skills add inference-sh/skills@ai-video-generation
# Full platform skill
npx skills add inference-sh/skills@infsh-cli
Browse audio apps: belt app store --category audio
data-ai
Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.
tools
Generate AI music and songs with ElevenLabs, Diffrythm, Tencent Song Generation via inference.sh CLI. Models: ElevenLabs Music (up to 10 min, commercial license), Diffrythm (fast song generation), Tencent Song Generation (full songs with vocals). Capabilities: text-to-music, song generation, instrumental, lyrics to song, soundtrack creation. Use for: background music, social media content, game soundtracks, podcasts, royalty-free music. Triggers: music generation, ai music, generate song, ai composer, text to music, song generator, create music with ai, suno alternative, udio alternative, ai song, ai soundtrack, generate soundtrack, ai jingle, music ai, beat generator, elevenlabs music, eleven labs music
tools
Run 250+ AI apps via inference.sh CLI - image generation, video creation, LLMs, search, 3D, Twitter automation. Models: FLUX, Veo, Gemini, Grok, Claude, Seedance, OmniHuman, Tavily, Exa, OpenRouter, and many more. Use when running AI apps, generating images/videos, calling LLMs, web search, or automating Twitter. Triggers: inference.sh, infsh, ai model, run ai, serverless ai, ai api, flux, veo, claude api, image generation, video generation, openrouter, tavily, exa search, twitter api, grok
tools
Python SDK for inference.sh - run AI apps, build agents, and integrate with 250+ models. Package: inferencesh (pip install inferencesh). Supports sync/async, streaming, file uploads. Build agents with template or ad-hoc patterns, tool builder API, skills, and human approval. Use for: Python integration, AI apps, agent development, RAG pipelines, automation. Triggers: python sdk, inferencesh, pip install, python api, python client, async inference, python agent, tool builder python, programmatic ai, python integration, sdk python