skills/voice-audio-engineer/SKILL.md
Expert in voice synthesis, TTS, voice cloning, podcast production, speech processing, and voice UI design via ElevenLabs integration. Specializes in vocal clarity, loudness standards (LUFS), de-essing, dialogue mixing, and voice transformation. Activate on 'TTS', 'text-to-speech', 'voice clone', 'voice synthesis', 'ElevenLabs', 'podcast', 'voice recording', 'speech-to-speech', 'voice UI', 'audiobook', 'dialogue'. NOT for spatial audio (use sound-engineer), music production (use DAW tools), game audio middleware (use sound-engineer), sound effects generation (use sound-engineer with ElevenLabs SFX), or live concert audio.
npx skillsauth add curiositech/windags-skills voice-audio-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert in voice synthesis, speech processing, and vocal production using ElevenLabs and professional audio techniques. Specializes in TTS, voice cloning, podcast production, and voice UI design.
✅ Use for:
❌ Do NOT use for:
| MCP Tool | Purpose |
|----------|---------|
| text_to_speech | Generate speech from text with voice selection |
| speech_to_speech | Transform voice recordings to different voices |
| voice_clone | Create instant voice clones from audio samples |
| search_voices | Find voices in ElevenLabs library |
| speech_to_text | Transcribe audio with speaker diarization |
| isolate_audio | Separate voice from background noise |
| create_agent | Build conversational AI agents with voice |
| Topic | Novice | Expert |
|-------|--------|--------|
| TTS quality | "Any voice works" | Matches voice to brand; considers emotion, pace, style |
| Voice cloning | "Upload any audio" | Knows 30s-3min of clean, varied speech needed; single speaker |
| Loudness | "Make it loud" | Targets -16 to -19 LUFS for podcasts; -14 for streaming |
| De-essing | "Doesn't matter" | Knows sibilance lives at 5-8kHz; frequency-selective compression |
| Compression | "Squash it" | Uses 3:1-4:1 for dialogue; slow attack (10-20ms) to preserve transients |
| High-pass | "Never use it" | Always HPF at 80-100Hz for voice; removes rumble, plosives |
| True peak | "Peak is peak" | Knows intersample peaks exceed 0dBFS; targets -1 dBTP |
| ElevenLabs models | "Use default" | eleven_multilingual_v2 for quality; eleven_flash_v2_5 for speed |
What it looks like: Voice clone from phone recording with background noise, echo
Why it's wrong: Clone learns the noise; output has artifacts
What to do instead: Use isolate_audio first; record in quiet space; provide 1-3 min of varied speech
What it looks like: Podcast at -6 LUFS, then normalized by platform → crushed dynamics Why it's wrong: Each platform normalizes differently; too loud = distortion, too quiet = inaudible What to do instead: Master to -16 LUFS for podcasts; -14 LUFS for streaming; always check true peak < -1 dBTP
What it looks like: Using default robotic voice for premium product
Why it's wrong: Voice IS brand; wrong voice = wrong emotional connection
What to do instead: search_voices to find matching tone; consider custom clone for brand consistency
What it looks like: "SSSSibilant" speech after compression and EQ boost Why it's wrong: Compression brings up sibilance; EQ boost at 3-5kHz makes it worse What to do instead: De-ess at 5-8kHz before compression; use frequency-selective compression
What it looks like: Podcast with 20 "ums", breath sounds, long pauses Why it's wrong: Listeners fatigue; unprofessional; reduces engagement What to do instead: Edit out filler words; gate or manually cut breaths; tighten pacing
Model comparison:
| Model | Quality | Latency | Languages | Use Case |
|-------|---------|---------|-----------|----------|
| eleven_multilingual_v2 | Best | Higher | 29 | Production, quality-critical |
| eleven_flash_v2_5 | Good | Lowest | 32 | Real-time, voice UI |
| eleven_turbo_v2_5 | Better | Low | 32 | Balanced |
Voice parameters:
# Stability: 0-1 (lower = more expressive, higher = more consistent)
# Similarity boost: 0-1 (higher = closer to original voice)
# Style: 0-1 (higher = more exaggerated style)
# For natural speech:
stability = 0.5 # Balanced expression
similarity = 0.75 # Close to voice but natural
style = 0.0 # Neutral (increase for dramatic)
Audio requirements:
Cloning workflow:
isolate_audio to clean source materialvoice_clone with cleaned audioStandard voice chain (order matters!):
[Raw Recording]
↓
[High-Pass Filter @ 80Hz] ← Remove rumble, plosives
↓
[De-esser @ 5-8kHz] ← Before compression!
↓
[Compressor 3:1, 10ms/100ms] ← Smooth dynamics
↓
[EQ: +2dB @ 3kHz presence] ← Clarity boost
↓
[Limiter -1 dBTP] ← Prevent clipping
↓
[Loudness Norm -16 LUFS] ← Target loudness
| Platform/Format | Target LUFS | True Peak | |-----------------|-------------|-----------| | Podcast | -16 to -19 | -1 dBTP | | Audiobook (ACX) | -18 to -23 RMS | -3 dBFS | | YouTube | -14 | -1 dBTP | | Spotify/Apple Music | -14 | -1 dBTP | | Broadcast (EBU R128) | -23 ±1 | -1 dBTP |
Measurement:
ElevenLabs agent configuration:
create_agent(
name="Support Agent",
first_message="Hi, how can I help you today?",
system_prompt="You are a helpful customer support agent...",
voice_id="your_voice_id",
language="en",
llm="gemini-2.0-flash-001", # Fast for conversation
temperature=0.5,
asr_quality="high", # Speech recognition quality
turn_timeout=7, # Seconds before agent responds
max_duration_seconds=300 # 5 minute call limit
)
Voice UI considerations:
eleven_flash_v2_5) for real-timeeleven_flash_v2_5 modeleleven_multilingual_v2 modelisolate_audio firstDe-esser: 5-8kHz, -6dB reduction, Q=2
Compressor: 3:1 ratio, -20dB threshold, 10ms attack, 100ms release
EQ presence: +2-3dB shelf at 3kHz
HPF: 80-100Hz, 12dB/oct
Limiter: -1 dBTP ceiling
| Type | Characteristics | ASR Impact | |------|-----------------|------------| | Stuttering | Repetitions ("I-I-I"), prolongations ("wwwant"), blocks (silent pauses) | Word boundaries confused; repetitions misrecognized | | Cluttering | Irregular rate, collapsed syllables, filler overload, tangential speech | Words merged; rate changes confuse timing |
Most ASR models trained on fluent speech. Disfluencies cause:
1. Model selection (best to worst for disfluencies):
2. Pre-processing:
# Normalize speech rate before ASR
# Use librosa to stretch irregular segments toward target rate
import librosa
y, sr = librosa.load("disfluent.wav")
y_stretched = librosa.effects.time_stretch(y, rate=0.9) # Slow down
3. Post-processing:
4. Fine-tuning Whisper (advanced):
# Fine-tune on disfluent speech dataset
# Datasets: FluencyBank, UCLASS, SEP-28k (stuttering)
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
# Fine-tune on your speech samples with corrected transcripts
# Training loop with disfluent audio → fluent transcript pairs
5. ElevenLabs voice cloning approach:
| Operation | Typical Time | |-----------|--------------| | TTS (100 words) | 2-5 seconds | | Voice clone creation | 10-30 seconds | | Speech-to-speech | 3-8 seconds | | Transcription (1 min audio) | 5-15 seconds | | Audio isolation | 5-20 seconds |
For detailed implementations: See /references/implementations.md
Remember: Voice is intimate—it speaks directly to the listener's brain. Match voice to brand, process for clarity not loudness, and always respect the platform's loudness standards. With ElevenLabs, you have instant access to professional voice synthesis; use it thoughtfully.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.