skills/voice-ai-development/SKILL.md
Use when implementing voice AI features: real-time voice agents, STT/TTS pipelines, WebRTC audio, voice provider integration (OpenAI Realtime, Deepgram, ElevenLabs, Vapi, LiveKit), latency optimization, barge-in detection, or voice-specific error handling. NEVER for voice agent architecture decisions without implementation (use voice-agents), general audio processing without voice AI context (use standard libraries), phone system configuration (use twilio-communications), transcription-only workflows (use transcribe).
npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit voice-ai-developmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Task | This skill | Other skill | |------|-----------|-------------| | Choose S2S vs pipeline architecture | Yes | voice-agents (strategy) | | Implement STT/TTS streaming code | Yes | -- | | Select voice provider for a project | Yes | voice-agents (broader evaluation) | | Wire up WebRTC audio transport | Yes | -- | | Optimize voice latency budget | Yes | -- | | Design multi-agent voice system | No | voice-agents | | Transcribe a recorded audio file | No | transcribe | | Configure Twilio phone numbers/IVR | No | twilio-communications | | Build non-voice audio processing | No | standard libraries |
| File | Purpose | When to Load | |---|---|---| | SKILL.md | Pipeline architecture, provider selection, latency budgets, edge cases, debugging | Always (auto-loaded) | | provider-integration-reference.md | Provider-specific API patterns (OpenAI Realtime, Deepgram, ElevenLabs, LiveKit, Vapi), auth, formats, cost comparison | When connecting to a specific provider or troubleshooting provider-specific issues | | audio-pipeline-cookbook.md | Audio format specs, sample rate conversion, VAD tuning parameters, preprocessing chains, buffering strategies, codec selection | When implementing audio processing, format conversion, or VAD tuning | | production-monitoring.md | Metrics dashboards, alerting rules, load testing, capacity planning, disaster recovery, session logging | When deploying to production or diagnosing production quality issues |
Follow this sequence for every voice AI implementation. The order matters -- later decisions depend on earlier ones.
Key mindset: The difference between a voice AI demo and a production voice agent is latency discipline. Every architectural decision must be evaluated against the latency budget. If you cannot measure per-stage latency, you cannot optimize it.
Evaluate top-down. Use the FIRST match.
| Signal | Provider | Tradeoff | |--------|----------|----------| | Need voice-to-voice with GPT intelligence, single provider | OpenAI Realtime API | Highest per-minute cost ($0.06/min audio). Audio-in audio-out only -- no text intermediary unless you enable transcription. 24kHz PCM16. | | Need phone-based voice agent deployed in hours, not days | Vapi | Hosted platform, webhook-based. Fastest time-to-deploy. Less control over pipeline internals. Per-minute pricing includes infrastructure. | | Need best STT accuracy with lowest transcription latency | Deepgram Nova-2 | ~100-200ms streaming latency. Best word error rate for English. 16kHz input. Volume pricing drops fast at scale. | | Need highest quality synthetic voice (emotional range, cloning) | ElevenLabs | Best voice quality but 150-300ms first-byte latency (turbo model). Higher cost per character. 24kHz output. | | Need self-hosted real-time infrastructure (WebRTC rooms) | LiveKit | Open-source server, agent framework for server-side bots. Most complex to deploy but full control. No per-minute API costs beyond compute. | | Need cheapest unified STT+TTS at scale | Deepgram (STT + Aura TTS) | Single provider, single bill, volume discounts. TTS quality below ElevenLabs but adequate for most agents. Lowest total cost at >10K minutes/month. | | Need browser-only prototype, no server | Web Speech API | Free, zero dependencies. Terrible accuracy, no streaming control, browser-only. Prototype use ONLY. |
Override rule: If the user has already committed to a specific provider (existing contract, integration, or team expertise), use it. Only suggest alternatives if the user reports a specific limitation that their current provider cannot address.
Voice feels "real-time" when total perceived latency stays under 800ms. Natural human turn-taking gap is 200-500ms. Anything above 1200ms feels broken.
| Pipeline Stage | Typical Range | Where Time Goes | |----------------|--------------|-----------------| | Audio capture + VAD | 50-200ms | Microphone buffer + silence detection threshold. VAD silence_duration_ms is the main knob (shorter = faster but more false triggers). | | STT processing | 100-300ms | Model inference + network. Streaming STT (Deepgram) gives interim results in ~100ms. Batch STT (Whisper) waits for full utterance. | | STT -> LLM network | 10-50ms | Negligible if same cloud region. Add 50-100ms cross-region. | | LLM first token | 200-800ms | Model size dependent. GPT-4o ~300ms. Claude ~400ms. Smaller models faster. Streaming essential -- do not wait for full completion. | | LLM -> TTS handoff | 5-20ms | Start TTS on first sentence boundary, not full LLM completion. Sentence detection on streaming tokens saves 500ms+. | | TTS first audio byte | 100-300ms | ElevenLabs turbo ~150ms. Deepgram Aura ~100ms. OpenAI Realtime ~80ms (integrated). Non-streaming TTS adds full synthesis time. | | Audio playback buffer | 20-50ms | Jitter buffer + device output latency. Mostly fixed overhead. |
Total budget example (optimized): 50 + 100 + 20 + 300 + 10 + 120 + 30 = ~630ms (acceptable) Total budget example (naive): 200 + 300 + 50 + 800 + 300 + 300 + 50 = ~2000ms (feels broken)
| Factor | Server-to-Server (S2S) | Client-Server | |--------|----------------------|---------------| | Audio path | Provider -> your server -> provider (audio never touches browser) | Browser captures audio -> your server -> providers | | Latency | Lower (server-to-server hops, no browser overhead) | Higher (extra browser-to-server hop for all audio) | | Complexity | Higher (manage audio streams server-side, need media server or raw WebSocket handling) | Lower (browser Web Audio API handles capture/playback) | | Control | Full control over pipeline, mixing, routing | Limited by browser APIs and permissions | | Use case | Production voice agents, phone bots, high-volume | Prototypes, web demos, low-volume | | Audio transport | WebSocket (simpler, unidirectional or bidirectional) or WebRTC (built-in echo cancellation, jitter buffer, NAT traversal) | WebRTC preferred (browser has native support) or WebSocket with manual audio handling |
Decision rule: If building a production voice agent that handles >100 concurrent calls or needs sub-600ms latency, use S2S. For everything else, start client-server and migrate if needed. If the user has an existing architecture, work within it -- do not propose a full pipeline migration unless they report a latency or scaling problem.
These are the problems that separate working demos from production voice apps.
Barge-in handling -- User interrupts while TTS is playing. Must: (1) detect speech via VAD during playback, (2) immediately stop TTS audio output, (3) clear any queued audio chunks, (4) cancel pending LLM response if still generating, (5) restart STT listening. If you skip step 3 or 4, the bot "talks over" the user with stale content.
VAD tuning -- Default thresholds cause two failure modes. Too sensitive (threshold < 0.3): background noise, keyboard clicks, breathing trigger false speech detection. Too lax (threshold > 0.7): clips first syllables, misses quiet speakers. Start at 0.5, tune per environment. Also tune prefix_padding_ms (audio before speech trigger to capture) and silence_duration_ms (how long silence before end-of-turn).
Audio format mismatches -- OpenAI Realtime requires PCM16 at 24kHz mono. Deepgram expects PCM16 at 16kHz. ElevenLabs outputs at 22050Hz or 24000Hz depending on format. Browser MediaRecorder often outputs Opus in WebM. Resampling between sample rates adds 5-20ms latency. Worst case: accidental double-resampling in a mixed pipeline.
Echo cancellation -- WebRTC connections include built-in Acoustic Echo Cancellation (AEC). Raw WebSocket audio does not. Without AEC, the bot hears its own TTS output through the user's speakers, creating feedback loops or false barge-in triggers. Solutions: use WebRTC when possible, or implement software AEC (SpeexDSP, WebRTC AEC module), or use headphones-only constraint.
Network jitter -- Streaming audio over unreliable connections. Jitter buffer too small = choppy audio with gaps. Jitter buffer too large = added latency. Adaptive jitter buffers (WebRTC default) work well. For WebSocket audio: implement a 40-80ms playback buffer client-side. Mobile networks are worst -- 4G jitter can hit 100ms+.
Silence and prosody -- Most TTS providers output audio without natural pauses at sentence boundaries. Concatenated TTS chunks sound robotic. Insert 150-300ms silence between sentences. Also: TTS models handle punctuation differently -- commas, ellipses, and em-dashes produce inconsistent pause lengths across providers.
Multi-language mid-conversation -- User switches language mid-conversation. Most STT models do not auto-detect language switches within a stream. Deepgram multi-language mode helps but adds latency. Solution: detect language from STT output, switch STT model if confidence drops, restart TTS with matching language voice.
Simultaneous speaker overlap -- In multi-party voice scenarios, overlapping speakers degrade STT accuracy by 30-60%. Diarization helps identify who spoke but adds latency. For real-time multi-party: separate audio channels per participant when possible.
Non-obvious browser policies that silently break voice AI in web deployments:
| Restriction | Impact | Workaround |
|---|---|---|
| Autoplay policy (all modern browsers) | TTS audio will not play until user has interacted with the page (click/tap). Bot appears completely silent on page load with no error in console. | Require a "Start conversation" button before initiating voice. Call AudioContext.resume() only after a user gesture event. |
| iOS Safari audio routing | Audio output routes to earpiece instead of speaker by default when no <audio> element is used. User can barely hear the bot. | Use an <audio> element for playback instead of raw AudioContext. Set playsInline attribute. Test on physical iOS devices -- simulators do not reproduce this. |
| MediaRecorder codec variance | Chrome outputs Opus in WebM container. Safari outputs AAC in MP4 container. Sending raw browser audio to STT providers without format detection causes silent transcription failures. | Check MediaRecorder.isTypeSupported() at startup and negotiate format. Send PCM16 (universally supported) when cross-browser compatibility is required. |
| getUserMedia permission persistence | Browser may revoke microphone permission after tab backgrounding (especially mobile Safari). Voice session breaks silently when user switches apps and returns. | Monitor MediaStreamTrack.onended event. Re-request permission and reinitialize audio capture when the track ends unexpectedly. |
| Failure | Symptom | Fix | |---------|---------|-----| | Non-streaming pipeline | 3-5 second delay per turn. User perceives bot as broken. | Stream all three stages (STT, LLM, TTS). Start TTS on first LLM sentence, not full completion. | | Ignoring interruptions | Bot finishes old response after user interrupts. Feels like talking to a recording. | Implement barge-in: VAD during playback -> stop TTS -> clear queue -> cancel LLM -> restart STT. | | Single provider for everything | Suboptimal quality per stage. Single point of failure. | Mix providers: Deepgram STT + your LLM + ElevenLabs TTS. Abstract behind interfaces for swapping. | | No graceful degradation | TTS provider outage = complete silence. STT outage = bot stops responding. | Fallback chain: ElevenLabs TTS -> Deepgram Aura TTS -> cached "I'm having trouble, please wait" audio. | | Blocking the audio thread | Synchronous database call or API request in the audio processing loop. Audio stutters or drops. | All I/O in the audio path must be async. Use queues between audio processing and business logic. | | No WebSocket reconnection | Provider WebSocket drops (common after 10-30 min). Bot goes silent permanently. | Implement reconnect with exponential backoff. Buffer audio during reconnection gap. Resume STT session. | | Missing audio preprocessing | Background noise, low gain, or clipping degrades STT accuracy by 20-40%. | Apply noise gate, gain normalization, and high-pass filter (remove below 80Hz) before STT. | | Hardcoded sample rates | Pipeline assumes 16kHz everywhere. OpenAI sends 24kHz. Audio plays at wrong speed (chipmunk or slow-motion effect). | Negotiate format per provider. Resample explicitly at each boundary. Log actual sample rates. |
When voice quality degrades in production, diagnose systematically:
Before implementing, verify the approach is justified.
| Impulse | Check First | Likely Better | |---------|-------------|---------------| | "Build custom voice pipeline from scratch" | Do you need sub-500ms latency or custom audio processing? | Use Vapi for standard voice agents. Build custom only when hosted platforms hit a wall. | | "Use OpenAI Realtime for everything" | Are you willing to pay ~$0.06/min and accept audio-only interface? | Deepgram STT + your LLM + ElevenLabs TTS gives more control and often lower cost at scale. | | "Add voice to the existing chatbot" | Does the use case genuinely benefit from voice vs text? Voice adds complexity, cost, and latency. | Text chat with optional audio messages (async) is 10x simpler and often sufficient. | | "Implement WebRTC from scratch" | Do you need NAT traversal, SRTP, DTLS, STUN/TURN? | Use LiveKit or Daily.co. Raw WebRTC is thousands of lines of edge-case handling. | | "Use Whisper for real-time STT" | Whisper is batch-only -- waits for full utterance. Adds 1-3 seconds per turn. | Use Deepgram Nova-2 or Deepgram streaming for real-time. Whisper is fine for post-call transcription. | | "Run TTS on GPU locally" | Do you have reliable GPU infrastructure and model expertise? Latency often worse than API. | Use hosted TTS APIs until you hit >50K minutes/month where self-hosted ROI makes sense. |
Stop and reassess if you encounter any of these.
development
When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.
testing
--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H
testing
Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).
testing
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.