skills/voice-agents/SKILL.md
Use when building voice AI agents, implementing speech-to-speech or pipeline (STT->LLM->TTS) architectures, optimizing voice latency, integrating voice activity detection, or designing turn-taking and barge-in handling. NEVER use for text-only chatbots, pre-recorded IVR menu trees, or music/audio processing pipelines.
npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit voice-agentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| File | Purpose | When to Load | |---|---|---| | SKILL.md | Architecture decisions (S2S vs Pipeline), latency budget, voice edge cases, turn-taking principles | Always (auto-loaded) | | conversation-design-patterns.md | Turn state machine, multi-turn context management, slot-filling, persona design, voice prompt engineering, tool calling in voice, recovery patterns | When designing conversation flows, implementing turn-taking, or building dialogue systems | | testing-debugging-guide.md | Test pyramid for voice, conversation test scenarios, latency debugging procedure, STT/LLM quality debugging, evaluation rubrics, A/B testing | When building test suites, debugging quality issues, or establishing QA processes | | platform-integration-guide.md | Platform selection (Vapi/Retell/LiveKit/Twilio), telephony integration, WebRTC patterns, CRM integration, knowledge base (RAG) for voice, cost estimation | When choosing a platform, integrating telephony, or connecting voice agents to external systems |
| Factor | Speech-to-Speech (S2S) | Pipeline (STT->LLM->TTS) | |---|---|---| | Latency | Lowest (~300-500ms) | Higher (~600-1200ms) | | Emotional nuance | Preserved end-to-end | Lost between components | | Debuggability | Opaque (audio in/out) | Each step inspectable | | Cost | Higher per minute | Lower, pay-per-component | | Controllability | Limited (model decides) | Full (inject, filter, route) | | Best for | Conversational assistants, therapy bots, empathic agents | Customer service, compliance-heavy, tool-calling flows | | Primary option | OpenAI Realtime API, Gemini Live | Deepgram STT + GPT-4o + ElevenLabs TTS |
Decision rule: Default to Pipeline unless emotional fidelity is the primary product differentiator. S2S sacrifices control for naturalness.
Sub-800ms end-to-end is the threshold for natural conversation. Budget each component.
| Component | Typical Range (ms) | Optimization Target | Primary Lever | |---|---|---|---| | Network round-trip (client -> server) | 20-80 | <50ms | Deploy regionally, WebSocket | | Voice Activity Detection (VAD) | 0-300 | <100ms | Semantic VAD, not silence-only | | STT (streaming) | 100-400 | <200ms | Streaming transcription, no wait-for-silence | | LLM first token | 150-600 | <300ms | Streaming, smaller models for fast tasks | | TTS first audio chunk | 80-300 | <150ms | Streaming TTS, pre-buffer | | Audio playback start | 20-50 | <30ms | Pre-buffer 50ms before streaming | | Total target | <800ms | <600ms ideal | Parallelize STT->LLM handoff |
Critical path: STT completion -> LLM first token -> TTS first chunk. Parallelize everything outside this path.
| Scenario | Naive Handling | Correct Handling | |---|---|---| | Background noise triggers VAD | False turn start, garbled STT | Use semantic VAD (checks for speech content, not just audio energy) | | User barges in mid-response | Agent keeps talking, frustrating user | Monitor input channel during output; interrupt and yield on VAD trigger | | Silence mid-sentence (user thinking) | Premature end-of-turn detection | Set silence threshold to 1.2-2s, not 0.5s; use semantic completeness check | | Heavy accent or non-native speaker | High STT word error rate | Add STT confidence scoring; fallback to "I didn't catch that" if WER risk is high | | Phone/PSTN audio quality (8kHz) | STT trained on 16kHz degrades | Use STT model with phone-band support (Deepgram Nova-2 Phone model) | | User speaks before TTS finishes | Audio collision, echo feedback | AEC (Acoustic Echo Cancellation) at client layer; server-side input muting during playback | | Short backchannels ("uh-huh", "yeah") | Full LLM round-trip for 200ms filler | Classify short utterances pre-LLM; emit canned acknowledgment tokens without full inference |
| Excuse | Why It's Wrong | |---|---| | "Users will wait a bit longer if the answer is good" | No they won't. >800ms feels like lag. >1200ms feels broken. Latency is UX, not just performance. | | "We'll add barge-in later once the core works" | Barge-in is architecture, not a feature. Adding it retrofit requires redesigning the audio pipeline. Build it day one. | | "Speech-to-speech is always better because it's more natural" | S2S removes your ability to inspect, filter, or route mid-conversation. For any regulated or tool-calling use case, pipeline is the right choice. | | "We can use silence detection for turn-taking" | Silence-only VAD fires on pauses, filler words, and thinking time. Semantic VAD (checking for sentence completeness) cuts false positives by 60-80%. | | "TTS voice quality doesn't matter that much" | Voice is the entire interface. Low-quality TTS destroys trust faster than a visual bug. Use streaming TTS with a tested voice from day one. | | "We'll handle noise in post-processing" | Noise must be handled before STT, not after. AEC and noise suppression are client-layer concerns. Post-processing is too late. |
Works well with: agent-tool-builder, multi-agent-orchestration, llm-architect, backend
development
When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.
testing
--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H
testing
Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).
testing
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.