Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

sharkitect-solutions/voice-agents

Name: voice-agents
Author: sharkitect-solutions

skills/voice-agents/SKILL.md

npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit voice-agents

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Voice Agents

File Index

| File | Purpose | When to Load | |---|---|---| | SKILL.md | Architecture decisions (S2S vs Pipeline), latency budget, voice edge cases, turn-taking principles | Always (auto-loaded) | | conversation-design-patterns.md | Turn state machine, multi-turn context management, slot-filling, persona design, voice prompt engineering, tool calling in voice, recovery patterns | When designing conversation flows, implementing turn-taking, or building dialogue systems | | testing-debugging-guide.md | Test pyramid for voice, conversation test scenarios, latency debugging procedure, STT/LLM quality debugging, evaluation rubrics, A/B testing | When building test suites, debugging quality issues, or establishing QA processes | | platform-integration-guide.md | Platform selection (Vapi/Retell/LiveKit/Twilio), telephony integration, WebRTC patterns, CRM integration, knowledge base (RAG) for voice, cost estimation | When choosing a platform, integrating telephony, or connecting voice agents to external systems |

Architecture Decision Table

| Factor | Speech-to-Speech (S2S) | Pipeline (STT->LLM->TTS) | |---|---|---| | Latency | Lowest (~300-500ms) | Higher (~600-1200ms) | | Emotional nuance | Preserved end-to-end | Lost between components | | Debuggability | Opaque (audio in/out) | Each step inspectable | | Cost | Higher per minute | Lower, pay-per-component | | Controllability | Limited (model decides) | Full (inject, filter, route) | | Best for | Conversational assistants, therapy bots, empathic agents | Customer service, compliance-heavy, tool-calling flows | | Primary option | OpenAI Realtime API, Gemini Live | Deepgram STT + GPT-4o + ElevenLabs TTS |

Decision rule: Default to Pipeline unless emotional fidelity is the primary product differentiator. S2S sacrifices control for naturalness.

Latency Budget

Sub-800ms end-to-end is the threshold for natural conversation. Budget each component.

| Component | Typical Range (ms) | Optimization Target | Primary Lever | |---|---|---|---| | Network round-trip (client -> server) | 20-80 | <50ms | Deploy regionally, WebSocket | | Voice Activity Detection (VAD) | 0-300 | <100ms | Semantic VAD, not silence-only | | STT (streaming) | 100-400 | <200ms | Streaming transcription, no wait-for-silence | | LLM first token | 150-600 | <300ms | Streaming, smaller models for fast tasks | | TTS first audio chunk | 80-300 | <150ms | Streaming TTS, pre-buffer | | Audio playback start | 20-50 | <30ms | Pre-buffer 50ms before streaming | | Total target | <800ms | <600ms ideal | Parallelize STT->LLM handoff |

Critical path: STT completion -> LLM first token -> TTS first chunk. Parallelize everything outside this path.

Voice-Specific Edge Cases

| Scenario | Naive Handling | Correct Handling | |---|---|---| | Background noise triggers VAD | False turn start, garbled STT | Use semantic VAD (checks for speech content, not just audio energy) | | User barges in mid-response | Agent keeps talking, frustrating user | Monitor input channel during output; interrupt and yield on VAD trigger | | Silence mid-sentence (user thinking) | Premature end-of-turn detection | Set silence threshold to 1.2-2s, not 0.5s; use semantic completeness check | | Heavy accent or non-native speaker | High STT word error rate | Add STT confidence scoring; fallback to "I didn't catch that" if WER risk is high | | Phone/PSTN audio quality (8kHz) | STT trained on 16kHz degrades | Use STT model with phone-band support (Deepgram Nova-2 Phone model) | | User speaks before TTS finishes | Audio collision, echo feedback | AEC (Acoustic Echo Cancellation) at client layer; server-side input muting during playback | | Short backchannels ("uh-huh", "yeah") | Full LLM round-trip for 200ms filler | Classify short utterances pre-LLM; emit canned acknowledgment tokens without full inference |

Rationalization Table

| Excuse | Why It's Wrong | |---|---| | "Users will wait a bit longer if the answer is good" | No they won't. >800ms feels like lag. >1200ms feels broken. Latency is UX, not just performance. | | "We'll add barge-in later once the core works" | Barge-in is architecture, not a feature. Adding it retrofit requires redesigning the audio pipeline. Build it day one. | | "Speech-to-speech is always better because it's more natural" | S2S removes your ability to inspect, filter, or route mid-conversation. For any regulated or tool-calling use case, pipeline is the right choice. | | "We can use silence detection for turn-taking" | Silence-only VAD fires on pauses, filler words, and thinking time. Semantic VAD (checking for sentence completeness) cuts false positives by 60-80%. | | "TTS voice quality doesn't matter that much" | Voice is the entire interface. Low-quality TTS destroys trust faster than a visual bug. Use streaming TTS with a tested voice from day one. | | "We'll handle noise in post-processing" | Noise must be handled before STT, not after. AEC and noise suppression are client-layer concerns. Post-processing is too late. |

Red Flags Checklist

[ ] Latency is not being measured end-to-end per component - you cannot optimize what you do not measure
[ ] Turn detection is silence-only (no semantic VAD) - will misfire on thinking pauses and filler words
[ ] No barge-in detection - users cannot interrupt the agent, which feels unnatural and frustrating
[ ] LLM responses are not length-constrained for voice - 3-sentence spoken answers feel like essays
[ ] TTS is not streaming - entire response must generate before playback begins, adding 200-800ms
[ ] No AEC or noise suppression at the client layer - echo and background noise corrupt STT
[ ] S2S chosen for a compliance or tool-calling use case - you cannot inspect or intercept audio tokens mid-stream
[ ] STT confidence scores are not monitored - silent degradation on accents or poor audio quality goes undetected

NEVER List

NEVER use silence-only turn detection in production. Semantic VAD is required. Silence thresholds alone fire on every thinking pause and "um".
NEVER wait for full LLM response before starting TTS. Stream LLM tokens to TTS in real time. First-chunk latency is what users perceive.
NEVER allow unconstrained LLM response length in voice contexts. Prompt explicitly: "Respond in 1-2 sentences. Be concise. This will be spoken aloud."
NEVER skip AEC (Acoustic Echo Cancellation). Without it, the agent's own TTS audio feeds back into the microphone and corrupts subsequent STT turns.
NEVER treat voice quality as a polish item. Voice is the entire interface. Placeholder TTS voices in demos set wrong expectations and erode stakeholder trust.

Related Skills

Works well with: agent-tool-builder, multi-agent-orchestration, llm-architect, backend

sharkitect-solutions/voice-agents

skills/voice-agents/SKILL.md

Use when building voice AI agents, implementing speech-to-speech or pipeline (STT->LLM->TTS) architectures, optimizing voice latency, integrating voice activity detection, or designing turn-taking and barge-in handling. NEVER use for text-only chatbots, pre-recorded IVR menu trees, or music/audio processing pipelines.

development

Updated Apr 29, 2026

$ install --global

skillsauth

npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit voice-agents

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 29, 2026, 8:54 AM48.4s4 files scanned

SKILL.md

name:: voice-agents
description:: Use when building voice AI agents, implementing speech-to-speech or pipeline (STT->LLM->TTS) architectures, optimizing voice latency, integrating voice activity detection, or designing turn-taking and barge-in handling. NEVER use for text-only chatbots, pre-recorded IVR menu trees, or music/audio processing pipelines.
version:: 2.0
optimized:: true
optimized_date:: 2026-03-11

Voice Agents

File Index

Architecture Decision Table

Decision rule: Default to Pipeline unless emotional fidelity is the primary product differentiator. S2S sacrifices control for naturalness.

Latency Budget

Sub-800ms end-to-end is the threshold for natural conversation. Budget each component.

Critical path: STT completion -> LLM first token -> TTS first chunk. Parallelize everything outside this path.

Voice-Specific Edge Cases

Rationalization Table

Red Flags Checklist

[ ] Latency is not being measured end-to-end per component - you cannot optimize what you do not measure
[ ] Turn detection is silence-only (no semantic VAD) - will misfire on thinking pauses and filler words
[ ] No barge-in detection - users cannot interrupt the agent, which feels unnatural and frustrating
[ ] LLM responses are not length-constrained for voice - 3-sentence spoken answers feel like essays
[ ] TTS is not streaming - entire response must generate before playback begins, adding 200-800ms
[ ] No AEC or noise suppression at the client layer - echo and background noise corrupt STT
[ ] S2S chosen for a compliance or tool-calling use case - you cannot inspect or intercept audio tokens mid-stream
[ ] STT confidence scores are not monitored - silent degradation on accents or poor audio quality goes undetected

NEVER List

NEVER use silence-only turn detection in production. Semantic VAD is required. Silence thresholds alone fire on every thinking pause and "um".
NEVER wait for full LLM response before starting TTS. Stream LLM tokens to TTS in real time. First-chunk latency is what users perceive.
NEVER allow unconstrained LLM response length in voice contexts. Prompt explicitly: "Respond in 1-2 sentences. Be concise. This will be spoken aloud."
NEVER skip AEC (Acoustic Echo Cancellation). Without it, the agent's own TTS audio feeds back into the microphone and corrupts subsequent STT turns.
NEVER treat voice quality as a polish item. Voice is the entire interface. Placeholder TTS voices in demos set wrong expectations and erode stakeholder trust.

Related Skills

Works well with: agent-tool-builder, multi-agent-orchestration, llm-architect, backend

Related Skills

sharkitect-solutions/paid-ads

development

VerifiedTrustedCommunity

When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.

SKILL.mdUpdated May 29, 2026

sharkitect-solutions/paid-ads

sharkitect-solutions/skills/using-sharkitect-methodology

testing

VerifiedTrustedCommunity

--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H

SKILL.mdUpdated May 13, 2026

sharkitect-solutions/skills/using-sharkitect-methodology

sharkitect-solutions/end-session

testing

VerifiedTrustedCommunity

Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).

SKILL.mdUpdated May 12, 2026

sharkitect-solutions/end-session

sharkitect-solutions/humanizer

testing

VerifiedTrustedCommunity

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.

SKILL.mdUpdated May 7, 2026

sharkitect-solutions/humanizer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/sharkitect-solutions/sharkitect-claude-toolkit.git

# Copy into Claude Code skills folder (global)
cp -r sharkitect-claude-toolkit/skills/voice-agents ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

sharkitect-solutions/sharkitect-claude-toolkit

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT