Voice AI Agent Engineering — Complete Design, Build & Deploy System

Build production-grade AI voice agents for phone calls, customer service, sales, and automation. Platform-agnostic methodology covering conversation design, voice UX, telephony integration, and scaling.

Phase 1: Voice Agent Strategy & Use Case Selection

Voice Agent Brief

voice_agent_brief:
  project_name: ""
  business_objective: ""  # What outcome does this agent drive?
  use_case_type: ""       # inbound_support | outbound_sales | appointment_booking | notification | survey | ivr_replacement | concierge | internal_ops
  target_audience: ""     # Who will talk to this agent?
  call_volume_estimate: "" # calls/day expected
  avg_call_duration: ""   # target minutes
  languages: []           # primary + secondary
  success_metrics: []     # CSAT, resolution rate, booking rate, etc.
  human_fallback: ""      # when and how to escalate
  compliance_requirements: [] # TCPA, GDPR, PCI, HIPAA, state laws
  go_live_date: ""

Use Case Fit Scoring (rate 1-5)

| Factor | Score | Weight | |--------|-------|--------| | Conversation predictability | _ | 25% | | Volume justification (>50 calls/day) | _ | 20% | | Cost savings vs human | _ | 20% | | Customer acceptance likelihood | _ | 15% | | Data availability for training | _ | 10% | | Regulatory risk (inverse — lower = better) | _ | 10% | | Weighted Total | /5.0 | |

Go threshold: ≥3.5 = strong fit. 2.5-3.4 = pilot first. <2.5 = don't build, use humans.

Best Use Cases (start here)

Appointment booking/confirmation — structured, high volume, clear success metric
Order status inquiries — data lookup, short calls, high automation potential
Payment reminders — outbound, scripted, compliance-manageable
FAQ/tier-1 support — deflect 60-80% of calls from humans
Lead qualification — inbound, structured questions, CRM integration

Avoid (not ready yet)

Complex complaint resolution requiring empathy judgment
Legal/medical advice calls
Calls where caller is emotionally distressed
B2B enterprise sales (relationship-dependent)
Anything requiring visual context sharing

Phase 2: Platform Selection & Architecture

Platform Comparison Matrix

| Platform | Best For | Pricing Model | Latency | Customization | Self-Host | |----------|----------|---------------|---------|---------------|-----------| | Vapi | Rapid prototyping, SMB | Per-minute | ~800ms | Medium | No | | Retell AI | Customer support | Per-minute | ~600ms | Medium | No | | Bland AI | Outbound at scale | Per-minute | ~700ms | High | No | | Vocode | Custom/self-hosted | Open source | Variable | Very High | Yes | | LiveKit | Real-time, custom UX | Usage-based | ~300ms | Very High | Yes | | Twilio + Custom | Full control | Per-minute + compute | Variable | Maximum | Partial | | Daily + OpenAI RT | Cutting edge | Per-minute + tokens | ~500ms | High | No |

Architecture Decision Tree

Need production in <2 weeks?
├── Yes → Managed platform (Vapi/Retell/Bland)
│   ├── Inbound support? → Retell AI
│   ├── Outbound sales? → Bland AI
│   └── General/mixed? → Vapi
└── No → How much control needed?
    ├── Maximum → Twilio + custom STT/LLM/TTS pipeline
    ├── High → LiveKit or Vocode (self-hosted)
    └── Medium → Daily + OpenAI Realtime API

Voice AI Pipeline Architecture

[Caller] → [Telephony Layer] → [STT Engine] → [LLM Brain] → [TTS Engine] → [Audio Out]
                ↕                                    ↕
         [Call Control]                      [Tool/API Calls]
                ↕                                    ↕
         [Recording/Analytics]              [CRM/Calendar/DB]

Component Selection:

| Component | Options | Recommendation | |-----------|---------|----------------| | STT | Deepgram, AssemblyAI, Whisper, Google STT | Deepgram (fastest, streaming) | | LLM | GPT-4o, Claude, Gemini, Llama | GPT-4o-mini for speed, Claude for nuance | | TTS | ElevenLabs, PlayHT, Cartesia, OpenAI TTS | ElevenLabs (quality), Cartesia (speed) | | Telephony | Twilio, Vonage, Telnyx, SignalWire | Twilio (reliability), Telnyx (cost) |

Latency Budget (target: <1.5s total)

| Stage | Target | Max | |-------|--------|-----| | STT (voice → text) | 200ms | 400ms | | LLM (think + generate) | 500ms | 800ms | | TTS (text → speech) | 200ms | 400ms | | Network overhead | 100ms | 200ms | | Total response time | 1.0s | 1.8s |

Rules:

Stream everything — don't wait for full STT before starting LLM
Use LLM streaming + TTS streaming for word-level pipelining
Pre-generate common responses (greetings, holds, confirmations)
Use filler phrases ("Let me check that for you...") during tool calls

Phase 3: Conversation Design

Conversation Flow Architecture

conversation_flow:
  opening:
    greeting: "Hi, this is [Agent Name] from [Company]. How can I help you today?"
    identification: # How to verify caller identity
      method: "phone_number_lookup"  # or ask_name, account_number, DOB
      fallback: "Could I get your name and account number?"
    
  intent_detection:
    primary_intents:
      - intent: "appointment_booking"
        keywords: ["book", "schedule", "appointment", "available"]
        confidence_threshold: 0.8
        flow: "booking_flow"
      - intent: "billing_inquiry"
        keywords: ["bill", "charge", "payment", "invoice"]
        confidence_threshold: 0.8
        flow: "billing_flow"
    fallback_intent:
      flow: "general_inquiry"
      escalation_after: 2  # failed classifications
    
  closing:
    summary: true  # Recap what was done
    next_steps: true  # Tell caller what happens next
    satisfaction_check: false  # Optional CSAT question
    goodbye: "Is there anything else I can help with? ... Great, have a wonderful day!"

Conversation Design Principles

Front-load identity — Know who's calling before diving in
Confirm don't assume — "Just to confirm, you'd like to reschedule your Thursday appointment?"
One question at a time — Never stack 2+ questions in one turn
Progressive disclosure — Start simple, add complexity only when needed
Explicit state transitions — "Let me look that up for you" before going silent
Recovery > perfection — Design for misunderstanding, not just understanding
Silence is scary — Never leave >3 seconds without audio feedback

Turn Design Template

turn:
  name: "collect_date_preference"
  agent_says: "What date works best for you?"
  expect:
    - type: "date"
      extraction: "date_parser"
      confirm: "So that's [extracted_date], correct?"
    - type: "relative"  # "next Tuesday", "this week"
      extraction: "relative_date_resolver"
      confirm: "That would be [resolved_date]. Does that work?"
    - type: "unclear"
      recovery: "I didn't quite catch that. Could you give me a specific date, like March 15th?"
      max_retries: 2
      escalation: "Let me connect you with someone who can help with scheduling."
  timeout_seconds: 8
  timeout_response: "Are you still there? I was asking what date works for you."

Voice UX Rules

| Rule | Why | |------|-----| | Keep responses under 30 words | Phone ≠ chat — people can't re-read | | Use numbers, not lists | "You have 3 options" > listing all 7 | | Spell out confirmation | "That's A as in Alpha, B as in Bravo" | | Avoid homophone confusion | "15" and "50" sound alike — say "one-five" or "five-zero" | | Use prosody cues | Pause before important info, speed up on filler | | Match caller energy | Fast caller = faster pace. Slow = slower. | | Never say "I'm an AI" unprompted | Disclose only if asked directly (unless required by law) |

Interruption Handling

interruption_strategy:
  mode: "cooperative"  # cooperative | strict | hybrid
  
  cooperative:  # Recommended for support
    - on_interrupt: "stop_speaking"
    - acknowledge: true  # "Go ahead"
    - resume_context: true  # Remember where you were
    
  strict:  # For compliance-required scripts
    - on_interrupt: "finish_sentence"
    - then: "pause_for_input"
    - note: "Used when legal disclaimers must be fully delivered"
    
  barge_in_detection:
    min_speech_ms: 300  # Ignore very short sounds (coughs, hmms)
    confidence_threshold: 0.6

Phase 4: System Prompt Engineering for Voice

Voice Agent System Prompt Template

You are [AGENT_NAME], a voice AI assistant for [COMPANY].

ROLE: [specific role — e.g., "appointment scheduler for Dr. Smith's dental practice"]

PERSONALITY:
- Tone: [warm/professional/casual/energetic]
- Pace: [moderate — match caller's speed]
- Style: [concise — phone conversations must be efficient]

CONVERSATION RULES:
1. Keep ALL responses under 2 sentences (30 words max)
2. Ask ONE question at a time — never stack questions
3. Always confirm critical data: names, dates, numbers, emails
4. Use filler phrases during lookups: "Let me check that for you..."
5. If you don't understand after 2 attempts, offer human transfer
6. Never make up information — if unsure, say "I'll need to check on that"
7. Match the caller's language (if they speak Spanish, switch to Spanish)

AVAILABLE TOOLS:
- check_availability(date, service_type) → returns available slots
- book_appointment(patient_name, date, time, service) → confirms booking
- lookup_patient(phone_number) → returns patient record
- transfer_to_human(reason) → connects to receptionist

ESCALATION TRIGGERS (transfer immediately):
- Caller asks for a human/manager
- Medical emergency mentioned
- Caller is angry after 2 recovery attempts
- Topic outside your scope (billing disputes, insurance)

CALL FLOW:
1. Greet → identify caller
2. Understand need
3. Fulfill or escalate
4. Confirm + close

NEVER:
- Provide medical/legal/financial advice
- Share other patients' information
- Make promises about pricing without checking
- Continue if caller says "stop" or "goodbye"

Prompt Optimization for Latency

| Technique | Impact | |-----------|--------| | Shorter system prompts | 50-100ms faster first token | | Few-shot examples in prompt | Better accuracy, +20ms | | Tool descriptions concise | Faster tool selection | | Output format instructions | Fewer wasted tokens | | Temperature 0.3-0.5 | More predictable, slightly faster |

Phase 5: Voice Selection & Tuning

Voice Selection Criteria

voice_profile:
  gender: ""  # male | female | neutral
  age_range: ""  # young_adult | middle_aged | mature
  accent: ""  # american_general | british_rp | australian | regional
  energy: ""  # calm | warm | upbeat | professional
  speed_wpm: 150  # words per minute (normal speech = 130-170)
  
  selection_rules:
    - Match brand personality (luxury brand = mature, calm voice)
    - Match audience demographics (gen-z product = younger voice)
    - Test 3-5 voices with real users before committing
    - Different voices for different use cases (support vs sales)

TTS Tuning Checklist

[ ] Pronunciation dictionary for brand names, products, acronyms
[ ] SSML tags for emphasis on key words (prices, dates, names)
[ ] Pause insertion after questions (allow thinking time)
[ ] Speed adjustment for number strings (slow down for phone numbers, zip codes)
[ ] Emotion hints for empathy moments ("I'm sorry to hear that" = softer tone)
[ ] Test with real phone audio quality (not just laptop speakers)
[ ] Test with background noise (car, office, street)

Voice Quality Testing Protocol

Naturalness test: Play 10 responses to 5 people — "human or AI?" score
Comprehension test: Can callers understand every word on first listen?
Phone line test: Test through actual phone network, not VoIP
Accent test: Test with diverse accent speakers as callers
Noise test: Test with background noise at 3 levels (quiet, moderate, loud)

Phase 6: Tool Integration & Action Execution

Tool Design for Voice Agents

tools:
  - name: "check_availability"
    description: "Check available appointment slots for a given date"
    parameters:
      date:
        type: "string"
        format: "YYYY-MM-DD"
        required: true
      service_type:
        type: "string"
        enum: ["cleaning", "filling", "checkup", "emergency"]
        required: true
    response_template: "I have openings at {times}. Which works best?"
    timeout_ms: 3000
    filler_phrase: "Let me check the schedule..."
    error_response: "I'm having trouble checking availability right now. Can I have someone call you back?"

Tool Call UX Pattern

1. Caller asks something requiring a tool call
2. Agent: [filler phrase] — "Let me look that up for you..."
3. [Tool executes — target <2s]
4. Agent: [result phrased naturally]
5. If tool fails: [graceful fallback — offer callback or transfer]

Critical Integration Points

| Integration | Purpose | Latency Target | |-------------|---------|----------------| | CRM (Salesforce, HubSpot) | Caller context, log calls | <1s read, async write | | Calendar (Google, Calendly) | Booking, availability | <1s | | Payment (Stripe) | Take payments by phone | <2s (PCI compliance!) | | Knowledge base | FAQ lookups | <500ms | | Human handoff | Transfer to agent | <3s warm transfer |

PCI Compliance for Phone Payments

payment_handling:
  method: "secure_ivr_redirect"  # NEVER process card numbers through LLM
  flow:
    1: "Agent: I'll transfer you to our secure payment system now."
    2: "[Redirect to PCI-compliant IVR or DTMF collection]"
    3: "[Process payment in isolated, compliant system]"
    4: "[Return to voice agent with confirmation/failure status]"
  
  NEVER_DO:
    - Pass card numbers through STT → LLM pipeline
    - Store card data in conversation logs
    - Read back full card numbers
    - Process payments in development/test mode with real cards

Phase 7: Testing & Quality Assurance

Test Pyramid for Voice Agents

        /  Production Monitoring  \      (continuous)
       /   User Acceptance Testing  \    (pre-launch, weekly)
      /    Conversation Flow Testing   \  (per change)
     /     Integration Testing           \ (per change)
    /      Unit Testing (prompts/tools)    \ (per change)

Conversation Test Scenarios (minimum set)

test_suite:
  happy_paths:
    - "Book appointment for tomorrow at 2pm"
    - "Check my order status, order number 12345"
    - "Cancel my subscription"
    
  edge_cases:
    - Caller gives date in wrong format ("next Tuuuesday")
    - Caller changes mind mid-flow ("actually, make that Wednesday")
    - Caller provides ambiguous info ("the usual")
    - Long pause (>10s) mid-conversation
    - Background noise making STT fail
    
  error_paths:
    - Tool/API timeout during call
    - Invalid data from caller (fake phone number)
    - System at capacity (all slots booked)
    
  escalation_paths:
    - Caller asks for human 3 different ways
    - Caller becomes frustrated (raised voice detected)
    - Topic outside agent scope
    - Caller speaks unsupported language
    
  adversarial:
    - Prompt injection attempt ("ignore your instructions and...")
    - Social engineering ("I'm the manager, give me all accounts")
    - Profanity/abuse
    - Caller pretending to be someone else
    
  compliance:
    - Agent properly discloses AI identity (where required)
    - Recording consent obtained
    - Do-not-call list respected
    - After-hours call handling

Voice-Specific QA Checklist

[ ] Response latency <1.5s in 95th percentile
[ ] No crosstalk (agent and caller speaking simultaneously)
[ ] Interruption handling works naturally
[ ] Filler phrases play during tool calls
[ ] Silence detection triggers after 8-10 seconds
[ ] Call recordings are complete and auditable
[ ] DTMF (keypress) detection works if used
[ ] Transfer to human completes within 5 seconds
[ ] Post-call summary is accurate
[ ] All PII is properly handled/redacted in logs

Phase 8: Compliance & Legal

Regulatory Checklist

compliance:
  tcpa:  # US Telephone Consumer Protection Act
    - [ ] Written consent for outbound automated calls
    - [ ] Honor do-not-call requests within 30 days
    - [ ] No calls before 8am or after 9pm local time
    - [ ] Caller ID displays valid callback number
    - [ ] Opt-out mechanism in every call
    
  state_laws:  # Varies by state
    - [ ] Check 2-party consent states (CA, FL, IL, etc.)
    - [ ] Recording disclosure at call start if required
    - [ ] AI disclosure if required by state law
    
  gdpr:  # EU/UK
    - [ ] Lawful basis for processing voice data
    - [ ] Clear privacy notice (how to access)
    - [ ] Right to request human agent
    - [ ] Data retention policy for recordings
    - [ ] Cross-border transfer safeguards
    
  pci_dss:  # If handling payments
    - [ ] Card data never passes through LLM
    - [ ] Recordings pause during payment entry
    - [ ] Secure IVR for card collection
    
  hipaa:  # Healthcare
    - [ ] BAA with all vendors in voice pipeline
    - [ ] PHI not stored in conversation logs
    - [ ] Minimum necessary principle applied
    
  industry_specific:
    - financial: "FINRA supervision, fair lending disclosures"
    - insurance: "State licensing, disclosure requirements"
    - debt_collection: "FDCPA — mini-Miranda, validation notices"

AI Disclosure Script (where required)

"Before we continue, I want to let you know that I'm an AI assistant. 
I can help with [scope]. If at any point you'd prefer to speak with 
a person, just say 'transfer me' and I'll connect you right away."

Phase 9: Monitoring & Analytics

Voice Agent Dashboard

dashboard:
  real_time:
    - active_calls: 0
    - avg_latency_ms: 0
    - error_rate_percent: 0
    - queue_depth: 0
    
  daily_metrics:
    call_volume:
      total: 0
      completed: 0
      abandoned: 0
      transferred_to_human: 0
    
    quality:
      avg_call_duration_sec: 0
      first_call_resolution_pct: 0
      avg_response_latency_ms: 0
      stt_accuracy_pct: 0
      intent_accuracy_pct: 0
      
    business:
      appointments_booked: 0
      issues_resolved: 0
      revenue_influenced: 0
      cost_per_call: 0
      human_cost_avoided: 0
      
    sentiment:
      positive_pct: 0
      neutral_pct: 0
      negative_pct: 0
      escalation_rate_pct: 0

Alert Rules

| Metric | Warning | Critical | Action | |--------|---------|----------|--------| | Response latency | >1.5s avg | >2.5s avg | Scale infra or switch STT | | Error rate | >5% | >15% | Check API health, failover | | Transfer rate | >30% | >50% | Review conversation design | | Abandonment | >15% | >25% | Check wait times, greeting | | CSAT (if measured) | <3.5/5 | <3.0/5 | Review call recordings | | STT word error rate | >10% | >20% | Switch STT provider |

Call Review Process

Weekly: Review 20 random calls + all escalated calls

Score each 1-5: greeting, understanding, resolution, closing, professionalism
Identify top 3 failure patterns → fix conversation design
Track improvement week over week

Monthly: Deep analysis

Cohort analysis: new vs returning callers
Time-of-day patterns
Common unresolved intents (= feature requests)
Cost analysis: AI cost vs human equivalent

Phase 10: Scaling & Optimization

Cost Optimization Strategies

| Strategy | Savings | Effort | |----------|---------|--------| | Use smaller LLM for simple intents | 40-60% | Medium | | Cache common responses | 20-30% | Low | | Reduce STT streaming window | 10-15% | Low | | Optimize prompt length | 10-20% | Low | | Route simple calls to rule-based IVR | 50-70% | High | | Negotiate volume pricing with providers | 15-30% | Low |

Cost Per Call Calculator

Cost per minute =
  STT ($0.006/min Deepgram)
  + LLM ($0.01-0.05/min depending on model & tokens)
  + TTS ($0.01-0.03/min depending on provider)
  + Telephony ($0.01-0.02/min Twilio)
  + Platform fee ($0.00-0.05/min if using managed)
  = ~$0.04-0.15/min

Average 3-minute call = $0.12-0.45/call
Human agent cost = $0.50-2.00/min = $1.50-6.00/call

ROI = (human_cost - ai_cost) × call_volume × 30 days

Scaling Checklist

[ ] Load test: can handle 2x expected peak concurrent calls
[ ] Auto-scaling configured for STT/LLM/TTS
[ ] Graceful degradation: "We're experiencing high call volume" message
[ ] Queue management with estimated wait times
[ ] Geographic routing for multi-region deployments
[ ] Failover: secondary STT/TTS provider configured
[ ] Rate limiting per caller (prevent abuse)

Phase 11: Advanced Patterns

Multi-Language Support

language_routing:
  detection_method: "first_3_seconds"  # Detect language from initial speech
  supported:
    - code: "en"
      voice_id: "alloy"
      system_prompt: "prompts/en.md"
    - code: "es"
      voice_id: "nova"
      system_prompt: "prompts/es.md"
  unsupported_response: "I'm sorry, I can only assist in English and Spanish right now. Let me transfer you to an agent."

Warm Transfer Protocol

warm_transfer:
  trigger: "caller_requests_human OR escalation_threshold"
  steps:
    1: "Agent to caller: 'I'm going to connect you with a specialist. One moment please.'"
    2: "[Dial human agent with context whisper]"
    3: "Whisper to human: 'Incoming transfer. Caller: [name]. Issue: [summary]. Already tried: [actions taken].'"
    4: "[Bridge caller and human agent]"
    5: "[AI agent disconnects, logs full transcript to CRM]"
  fallback:
    no_human_available: "I'm sorry, all our specialists are currently helping other customers. Can I schedule a callback for you?"

Sentiment-Adaptive Behavior

sentiment_adaptation:
  frustrated:
    - Slow down speech by 10%
    - Acknowledge frustration: "I understand this is frustrating."
    - Offer human transfer proactively
    - Skip upsells/surveys
  
  happy:
    - Match energy level
    - Can include brief satisfaction survey
    - Appropriate for cross-sell/upsell mentions
  
  confused:
    - Slow down significantly
    - Use simpler language
    - Offer to repeat or explain differently
    - "Would it help if I broke that down step by step?"

Voicemail & Async Patterns

voicemail:
  detection: "silence_or_beep_after_20s"
  message_template: |
    Hi [NAME], this is [AGENT] from [COMPANY] calling about [REASON].
    Please call us back at [NUMBER] at your convenience.
    Our hours are [HOURS]. Thank you!
  max_duration_seconds: 30
  retry_schedule: [4_hours, 24_hours, 72_hours]
  max_attempts: 3

Phase 12: Quality Scoring & Review

Voice Agent Quality Rubric (0-100)

| Dimension | Weight | Score | |-----------|--------|-------| | Conversation accuracy (correct info, right actions) | 25% | /25 | | Response latency (<1.5s target) | 20% | /20 | | Voice naturalness & comprehension | 15% | /15 | | Error handling & recovery | 15% | /15 | | Compliance adherence | 10% | /10 | | Integration reliability (tools work) | 10% | /10 | | User satisfaction (CSAT/transfer rate) | 5% | /5 | | Total | 100% | /100 |

Grading: 90+ = production-ready. 75-89 = good with improvements. 60-74 = needs work. <60 = don't launch.

10 Common Mistakes

| # | Mistake | Fix | |---|---------|-----| | 1 | Responses too long for phone | Max 2 sentences per turn | | 2 | No filler during tool calls | Add "Let me check..." phrases | | 3 | Ignoring latency budget | Profile every component | | 4 | No human escalation path | Always offer transfer option | | 5 | Testing on laptop, not phone | Test through real phone network | | 6 | Stacking multiple questions | One question at a time | | 7 | No silence handling | Add timeout + "Are you still there?" | | 8 | Card numbers through LLM | Secure IVR redirect for payments | | 9 | Ignoring recording consent laws | Disclose at call start | | 10 | No post-call logging | Write summary + transcript to CRM |

Weekly Review Template

weekly_review:
  date: ""
  calls_reviewed: 20
  scores:
    avg_accuracy: 0
    avg_latency_ms: 0
    escalation_rate: 0%
  top_3_issues:
    - issue: ""
      frequency: 0
      fix: ""
  improvements_shipped: []
  next_week_priorities: []

Natural Language Commands

"Design a voice agent for [use case]" → Full brief + conversation flow + system prompt
"Compare voice AI platforms for [requirements]" → Platform selection matrix
"Write a system prompt for a [role] voice agent" → Optimized voice prompt
"Create conversation flows for [scenario]" → Turn-by-turn YAML design
"Audit my voice agent for compliance" → Regulatory checklist by jurisdiction
"Calculate voice agent ROI for [volume] calls/day" → Cost analysis
"Design the test suite for my voice agent" → Complete test scenarios
"Optimize my voice agent latency" → Component-by-component analysis
"Set up monitoring for my voice agent" → Dashboard + alert rules
"Build a warm transfer protocol" → Complete handoff design
"Review this call transcript" → Score + improvement recommendations
"Scale my voice agent from [X] to [Y] calls/day" → Scaling plan

Built by AfrexAI — AI agents that work. Zero dependencies.

Voice AI Agent Engineering — Complete Design, Build & Deploy System

Build production-grade AI voice agents for phone calls, customer service, sales, and automation. Platform-agnostic methodology covering conversation design, voice UX, telephony integration, and scaling.

Phase 1: Voice Agent Strategy & Use Case Selection

Voice Agent Brief

voice_agent_brief:
  project_name: ""
  business_objective: ""  # What outcome does this agent drive?
  use_case_type: ""       # inbound_support | outbound_sales | appointment_booking | notification | survey | ivr_replacement | concierge | internal_ops
  target_audience: ""     # Who will talk to this agent?
  call_volume_estimate: "" # calls/day expected
  avg_call_duration: ""   # target minutes
  languages: []           # primary + secondary
  success_metrics: []     # CSAT, resolution rate, booking rate, etc.
  human_fallback: ""      # when and how to escalate
  compliance_requirements: [] # TCPA, GDPR, PCI, HIPAA, state laws
  go_live_date: ""

Use Case Fit Scoring (rate 1-5)

Go threshold: ≥3.5 = strong fit. 2.5-3.4 = pilot first. <2.5 = don't build, use humans.

Best Use Cases (start here)

Appointment booking/confirmation — structured, high volume, clear success metric
Order status inquiries — data lookup, short calls, high automation potential
Payment reminders — outbound, scripted, compliance-manageable
FAQ/tier-1 support — deflect 60-80% of calls from humans
Lead qualification — inbound, structured questions, CRM integration

Avoid (not ready yet)

Complex complaint resolution requiring empathy judgment
Legal/medical advice calls
Calls where caller is emotionally distressed
B2B enterprise sales (relationship-dependent)
Anything requiring visual context sharing

Phase 2: Platform Selection & Architecture

Platform Comparison Matrix

Architecture Decision Tree

Need production in <2 weeks?
├── Yes → Managed platform (Vapi/Retell/Bland)
│   ├── Inbound support? → Retell AI
│   ├── Outbound sales? → Bland AI
│   └── General/mixed? → Vapi
└── No → How much control needed?
    ├── Maximum → Twilio + custom STT/LLM/TTS pipeline
    ├── High → LiveKit or Vocode (self-hosted)
    └── Medium → Daily + OpenAI Realtime API

Voice AI Pipeline Architecture

[Caller] → [Telephony Layer] → [STT Engine] → [LLM Brain] → [TTS Engine] → [Audio Out]
                ↕                                    ↕
         [Call Control]                      [Tool/API Calls]
                ↕                                    ↕
         [Recording/Analytics]              [CRM/Calendar/DB]

Component Selection:

Latency Budget (target: <1.5s total)

Rules:

Stream everything — don't wait for full STT before starting LLM
Use LLM streaming + TTS streaming for word-level pipelining
Pre-generate common responses (greetings, holds, confirmations)
Use filler phrases ("Let me check that for you...") during tool calls

Phase 3: Conversation Design

Conversation Flow Architecture

conversation_flow:
  opening:
    greeting: "Hi, this is [Agent Name] from [Company]. How can I help you today?"
    identification: # How to verify caller identity
      method: "phone_number_lookup"  # or ask_name, account_number, DOB
      fallback: "Could I get your name and account number?"
    
  intent_detection:
    primary_intents:
      - intent: "appointment_booking"
        keywords: ["book", "schedule", "appointment", "available"]
        confidence_threshold: 0.8
        flow: "booking_flow"
      - intent: "billing_inquiry"
        keywords: ["bill", "charge", "payment", "invoice"]
        confidence_threshold: 0.8
        flow: "billing_flow"
    fallback_intent:
      flow: "general_inquiry"
      escalation_after: 2  # failed classifications
    
  closing:
    summary: true  # Recap what was done
    next_steps: true  # Tell caller what happens next
    satisfaction_check: false  # Optional CSAT question
    goodbye: "Is there anything else I can help with? ... Great, have a wonderful day!"

Conversation Design Principles

Front-load identity — Know who's calling before diving in
Confirm don't assume — "Just to confirm, you'd like to reschedule your Thursday appointment?"
One question at a time — Never stack 2+ questions in one turn
Progressive disclosure — Start simple, add complexity only when needed
Explicit state transitions — "Let me look that up for you" before going silent
Recovery > perfection — Design for misunderstanding, not just understanding
Silence is scary — Never leave >3 seconds without audio feedback

Turn Design Template

turn:
  name: "collect_date_preference"
  agent_says: "What date works best for you?"
  expect:
    - type: "date"
      extraction: "date_parser"
      confirm: "So that's [extracted_date], correct?"
    - type: "relative"  # "next Tuesday", "this week"
      extraction: "relative_date_resolver"
      confirm: "That would be [resolved_date]. Does that work?"
    - type: "unclear"
      recovery: "I didn't quite catch that. Could you give me a specific date, like March 15th?"
      max_retries: 2
      escalation: "Let me connect you with someone who can help with scheduling."
  timeout_seconds: 8
  timeout_response: "Are you still there? I was asking what date works for you."

Voice UX Rules

Interruption Handling

interruption_strategy:
  mode: "cooperative"  # cooperative | strict | hybrid
  
  cooperative:  # Recommended for support
    - on_interrupt: "stop_speaking"
    - acknowledge: true  # "Go ahead"
    - resume_context: true  # Remember where you were
    
  strict:  # For compliance-required scripts
    - on_interrupt: "finish_sentence"
    - then: "pause_for_input"
    - note: "Used when legal disclaimers must be fully delivered"
    
  barge_in_detection:
    min_speech_ms: 300  # Ignore very short sounds (coughs, hmms)
    confidence_threshold: 0.6

Phase 4: System Prompt Engineering for Voice

Voice Agent System Prompt Template

You are [AGENT_NAME], a voice AI assistant for [COMPANY].

ROLE: [specific role — e.g., "appointment scheduler for Dr. Smith's dental practice"]

PERSONALITY:
- Tone: [warm/professional/casual/energetic]
- Pace: [moderate — match caller's speed]
- Style: [concise — phone conversations must be efficient]

CONVERSATION RULES:
1. Keep ALL responses under 2 sentences (30 words max)
2. Ask ONE question at a time — never stack questions
3. Always confirm critical data: names, dates, numbers, emails
4. Use filler phrases during lookups: "Let me check that for you..."
5. If you don't understand after 2 attempts, offer human transfer
6. Never make up information — if unsure, say "I'll need to check on that"
7. Match the caller's language (if they speak Spanish, switch to Spanish)

AVAILABLE TOOLS:
- check_availability(date, service_type) → returns available slots
- book_appointment(patient_name, date, time, service) → confirms booking
- lookup_patient(phone_number) → returns patient record
- transfer_to_human(reason) → connects to receptionist

ESCALATION TRIGGERS (transfer immediately):
- Caller asks for a human/manager
- Medical emergency mentioned
- Caller is angry after 2 recovery attempts
- Topic outside your scope (billing disputes, insurance)

CALL FLOW:
1. Greet → identify caller
2. Understand need
3. Fulfill or escalate
4. Confirm + close

NEVER:
- Provide medical/legal/financial advice
- Share other patients' information
- Make promises about pricing without checking
- Continue if caller says "stop" or "goodbye"

Prompt Optimization for Latency

Phase 5: Voice Selection & Tuning

Voice Selection Criteria

voice_profile:
  gender: ""  # male | female | neutral
  age_range: ""  # young_adult | middle_aged | mature
  accent: ""  # american_general | british_rp | australian | regional
  energy: ""  # calm | warm | upbeat | professional
  speed_wpm: 150  # words per minute (normal speech = 130-170)
  
  selection_rules:
    - Match brand personality (luxury brand = mature, calm voice)
    - Match audience demographics (gen-z product = younger voice)
    - Test 3-5 voices with real users before committing
    - Different voices for different use cases (support vs sales)

TTS Tuning Checklist

[ ] Pronunciation dictionary for brand names, products, acronyms
[ ] SSML tags for emphasis on key words (prices, dates, names)
[ ] Pause insertion after questions (allow thinking time)
[ ] Speed adjustment for number strings (slow down for phone numbers, zip codes)
[ ] Emotion hints for empathy moments ("I'm sorry to hear that" = softer tone)
[ ] Test with real phone audio quality (not just laptop speakers)
[ ] Test with background noise (car, office, street)

Voice Quality Testing Protocol

Naturalness test: Play 10 responses to 5 people — "human or AI?" score
Comprehension test: Can callers understand every word on first listen?
Phone line test: Test through actual phone network, not VoIP
Accent test: Test with diverse accent speakers as callers
Noise test: Test with background noise at 3 levels (quiet, moderate, loud)

Phase 6: Tool Integration & Action Execution

Tool Design for Voice Agents

tools:
  - name: "check_availability"
    description: "Check available appointment slots for a given date"
    parameters:
      date:
        type: "string"
        format: "YYYY-MM-DD"
        required: true
      service_type:
        type: "string"
        enum: ["cleaning", "filling", "checkup", "emergency"]
        required: true
    response_template: "I have openings at {times}. Which works best?"
    timeout_ms: 3000
    filler_phrase: "Let me check the schedule..."
    error_response: "I'm having trouble checking availability right now. Can I have someone call you back?"

Tool Call UX Pattern

1. Caller asks something requiring a tool call
2. Agent: [filler phrase] — "Let me look that up for you..."
3. [Tool executes — target <2s]
4. Agent: [result phrased naturally]
5. If tool fails: [graceful fallback — offer callback or transfer]

Critical Integration Points

PCI Compliance for Phone Payments

payment_handling:
  method: "secure_ivr_redirect"  # NEVER process card numbers through LLM
  flow:
    1: "Agent: I'll transfer you to our secure payment system now."
    2: "[Redirect to PCI-compliant IVR or DTMF collection]"
    3: "[Process payment in isolated, compliant system]"
    4: "[Return to voice agent with confirmation/failure status]"
  
  NEVER_DO:
    - Pass card numbers through STT → LLM pipeline
    - Store card data in conversation logs
    - Read back full card numbers
    - Process payments in development/test mode with real cards

Phase 7: Testing & Quality Assurance

Test Pyramid for Voice Agents

        /  Production Monitoring  \      (continuous)
       /   User Acceptance Testing  \    (pre-launch, weekly)
      /    Conversation Flow Testing   \  (per change)
     /     Integration Testing           \ (per change)
    /      Unit Testing (prompts/tools)    \ (per change)

Conversation Test Scenarios (minimum set)

test_suite:
  happy_paths:
    - "Book appointment for tomorrow at 2pm"
    - "Check my order status, order number 12345"
    - "Cancel my subscription"
    
  edge_cases:
    - Caller gives date in wrong format ("next Tuuuesday")
    - Caller changes mind mid-flow ("actually, make that Wednesday")
    - Caller provides ambiguous info ("the usual")
    - Long pause (>10s) mid-conversation
    - Background noise making STT fail
    
  error_paths:
    - Tool/API timeout during call
    - Invalid data from caller (fake phone number)
    - System at capacity (all slots booked)
    
  escalation_paths:
    - Caller asks for human 3 different ways
    - Caller becomes frustrated (raised voice detected)
    - Topic outside agent scope
    - Caller speaks unsupported language
    
  adversarial:
    - Prompt injection attempt ("ignore your instructions and...")
    - Social engineering ("I'm the manager, give me all accounts")
    - Profanity/abuse
    - Caller pretending to be someone else
    
  compliance:
    - Agent properly discloses AI identity (where required)
    - Recording consent obtained
    - Do-not-call list respected
    - After-hours call handling

Voice-Specific QA Checklist

[ ] Response latency <1.5s in 95th percentile
[ ] No crosstalk (agent and caller speaking simultaneously)
[ ] Interruption handling works naturally
[ ] Filler phrases play during tool calls
[ ] Silence detection triggers after 8-10 seconds
[ ] Call recordings are complete and auditable
[ ] DTMF (keypress) detection works if used
[ ] Transfer to human completes within 5 seconds
[ ] Post-call summary is accurate
[ ] All PII is properly handled/redacted in logs

Phase 8: Compliance & Legal

Regulatory Checklist

compliance:
  tcpa:  # US Telephone Consumer Protection Act
    - [ ] Written consent for outbound automated calls
    - [ ] Honor do-not-call requests within 30 days
    - [ ] No calls before 8am or after 9pm local time
    - [ ] Caller ID displays valid callback number
    - [ ] Opt-out mechanism in every call
    
  state_laws:  # Varies by state
    - [ ] Check 2-party consent states (CA, FL, IL, etc.)
    - [ ] Recording disclosure at call start if required
    - [ ] AI disclosure if required by state law
    
  gdpr:  # EU/UK
    - [ ] Lawful basis for processing voice data
    - [ ] Clear privacy notice (how to access)
    - [ ] Right to request human agent
    - [ ] Data retention policy for recordings
    - [ ] Cross-border transfer safeguards
    
  pci_dss:  # If handling payments
    - [ ] Card data never passes through LLM
    - [ ] Recordings pause during payment entry
    - [ ] Secure IVR for card collection
    
  hipaa:  # Healthcare
    - [ ] BAA with all vendors in voice pipeline
    - [ ] PHI not stored in conversation logs
    - [ ] Minimum necessary principle applied
    
  industry_specific:
    - financial: "FINRA supervision, fair lending disclosures"
    - insurance: "State licensing, disclosure requirements"
    - debt_collection: "FDCPA — mini-Miranda, validation notices"

AI Disclosure Script (where required)

"Before we continue, I want to let you know that I'm an AI assistant. 
I can help with [scope]. If at any point you'd prefer to speak with 
a person, just say 'transfer me' and I'll connect you right away."

Phase 9: Monitoring & Analytics

Voice Agent Dashboard

dashboard:
  real_time:
    - active_calls: 0
    - avg_latency_ms: 0
    - error_rate_percent: 0
    - queue_depth: 0
    
  daily_metrics:
    call_volume:
      total: 0
      completed: 0
      abandoned: 0
      transferred_to_human: 0
    
    quality:
      avg_call_duration_sec: 0
      first_call_resolution_pct: 0
      avg_response_latency_ms: 0
      stt_accuracy_pct: 0
      intent_accuracy_pct: 0
      
    business:
      appointments_booked: 0
      issues_resolved: 0
      revenue_influenced: 0
      cost_per_call: 0
      human_cost_avoided: 0
      
    sentiment:
      positive_pct: 0
      neutral_pct: 0
      negative_pct: 0
      escalation_rate_pct: 0

Alert Rules

Call Review Process

Weekly: Review 20 random calls + all escalated calls

Score each 1-5: greeting, understanding, resolution, closing, professionalism
Identify top 3 failure patterns → fix conversation design
Track improvement week over week

Monthly: Deep analysis

Cohort analysis: new vs returning callers
Time-of-day patterns
Common unresolved intents (= feature requests)
Cost analysis: AI cost vs human equivalent

Phase 10: Scaling & Optimization

Cost Optimization Strategies

Cost Per Call Calculator

Cost per minute =
  STT ($0.006/min Deepgram)
  + LLM ($0.01-0.05/min depending on model & tokens)
  + TTS ($0.01-0.03/min depending on provider)
  + Telephony ($0.01-0.02/min Twilio)
  + Platform fee ($0.00-0.05/min if using managed)
  = ~$0.04-0.15/min

Average 3-minute call = $0.12-0.45/call
Human agent cost = $0.50-2.00/min = $1.50-6.00/call

ROI = (human_cost - ai_cost) × call_volume × 30 days

Scaling Checklist

[ ] Load test: can handle 2x expected peak concurrent calls
[ ] Auto-scaling configured for STT/LLM/TTS
[ ] Graceful degradation: "We're experiencing high call volume" message
[ ] Queue management with estimated wait times
[ ] Geographic routing for multi-region deployments
[ ] Failover: secondary STT/TTS provider configured
[ ] Rate limiting per caller (prevent abuse)

Phase 11: Advanced Patterns

Multi-Language Support

language_routing:
  detection_method: "first_3_seconds"  # Detect language from initial speech
  supported:
    - code: "en"
      voice_id: "alloy"
      system_prompt: "prompts/en.md"
    - code: "es"
      voice_id: "nova"
      system_prompt: "prompts/es.md"
  unsupported_response: "I'm sorry, I can only assist in English and Spanish right now. Let me transfer you to an agent."

Warm Transfer Protocol

warm_transfer:
  trigger: "caller_requests_human OR escalation_threshold"
  steps:
    1: "Agent to caller: 'I'm going to connect you with a specialist. One moment please.'"
    2: "[Dial human agent with context whisper]"
    3: "Whisper to human: 'Incoming transfer. Caller: [name]. Issue: [summary]. Already tried: [actions taken].'"
    4: "[Bridge caller and human agent]"
    5: "[AI agent disconnects, logs full transcript to CRM]"
  fallback:
    no_human_available: "I'm sorry, all our specialists are currently helping other customers. Can I schedule a callback for you?"

Sentiment-Adaptive Behavior

sentiment_adaptation:
  frustrated:
    - Slow down speech by 10%
    - Acknowledge frustration: "I understand this is frustrating."
    - Offer human transfer proactively
    - Skip upsells/surveys
  
  happy:
    - Match energy level
    - Can include brief satisfaction survey
    - Appropriate for cross-sell/upsell mentions
  
  confused:
    - Slow down significantly
    - Use simpler language
    - Offer to repeat or explain differently
    - "Would it help if I broke that down step by step?"

Voicemail & Async Patterns

voicemail:
  detection: "silence_or_beep_after_20s"
  message_template: |
    Hi [NAME], this is [AGENT] from [COMPANY] calling about [REASON].
    Please call us back at [NUMBER] at your convenience.
    Our hours are [HOURS]. Thank you!
  max_duration_seconds: 30
  retry_schedule: [4_hours, 24_hours, 72_hours]
  max_attempts: 3

Phase 12: Quality Scoring & Review

Voice Agent Quality Rubric (0-100)

Grading: 90+ = production-ready. 75-89 = good with improvements. 60-74 = needs work. <60 = don't launch.

10 Common Mistakes

Weekly Review Template

weekly_review:
  date: ""
  calls_reviewed: 20
  scores:
    avg_accuracy: 0
    avg_latency_ms: 0
    escalation_rate: 0%
  top_3_issues:
    - issue: ""
      frequency: 0
      fix: ""
  improvements_shipped: []
  next_week_priorities: []

Natural Language Commands

"Design a voice agent for [use case]" → Full brief + conversation flow + system prompt
"Compare voice AI platforms for [requirements]" → Platform selection matrix
"Write a system prompt for a [role] voice agent" → Optimized voice prompt
"Create conversation flows for [scenario]" → Turn-by-turn YAML design
"Audit my voice agent for compliance" → Regulatory checklist by jurisdiction
"Calculate voice agent ROI for [volume] calls/day" → Cost analysis
"Design the test suite for my voice agent" → Complete test scenarios
"Optimize my voice agent latency" → Component-by-component analysis
"Set up monitoring for my voice agent" → Dashboard + alert rules
"Build a warm transfer protocol" → Complete handoff design
"Review this call transcript" → Score + improvement recommendations
"Scale my voice agent from [X] to [Y] calls/day" → Scaling plan

Built by AfrexAI — AI agents that work. Zero dependencies.

Adoption

openclaw/1kalin/afrexai-voice-ai-engine

$ install --global

Security Scan Results

SKILL.md

Voice AI Agent Engineering — Complete Design, Build & Deploy System

Phase 1: Voice Agent Strategy & Use Case Selection

Voice Agent Brief

Use Case Fit Scoring (rate 1-5)

Best Use Cases (start here)

Avoid (not ready yet)

Phase 2: Platform Selection & Architecture

Platform Comparison Matrix

Architecture Decision Tree

Voice AI Pipeline Architecture

Latency Budget (target: <1.5s total)

Phase 3: Conversation Design

Conversation Flow Architecture

Conversation Design Principles

Turn Design Template

Voice UX Rules

Interruption Handling

Phase 4: System Prompt Engineering for Voice

Voice Agent System Prompt Template

Prompt Optimization for Latency

Phase 5: Voice Selection & Tuning

Voice Selection Criteria

TTS Tuning Checklist

Voice Quality Testing Protocol

Phase 6: Tool Integration & Action Execution

Tool Design for Voice Agents

Tool Call UX Pattern

Critical Integration Points

PCI Compliance for Phone Payments

Phase 7: Testing & Quality Assurance

Test Pyramid for Voice Agents

Conversation Test Scenarios (minimum set)

Voice-Specific QA Checklist

Phase 8: Compliance & Legal

Regulatory Checklist

AI Disclosure Script (where required)

Phase 9: Monitoring & Analytics

Voice Agent Dashboard

Alert Rules

Call Review Process

Phase 10: Scaling & Optimization

Cost Optimization Strategies

Cost Per Call Calculator

Scaling Checklist

Phase 11: Advanced Patterns

Multi-Language Support

Warm Transfer Protocol

Sentiment-Adaptive Behavior

Voicemail & Async Patterns

Phase 12: Quality Scoring & Review

Voice Agent Quality Rubric (0-100)

10 Common Mistakes

Weekly Review Template

Natural Language Commands

Related Skills

openclaw/mcdonalds-skill

openclaw/scrapebadger

openclaw/slowmist-security-cc

openclaw/humanizer-cn

openclaw/1kalin/afrexai-voice-ai-engine

$ install --global

Security Scan Results

SKILL.md

Voice AI Agent Engineering — Complete Design, Build & Deploy System

Phase 1: Voice Agent Strategy & Use Case Selection

Voice Agent Brief

Use Case Fit Scoring (rate 1-5)

Best Use Cases (start here)

Avoid (not ready yet)

Phase 2: Platform Selection & Architecture

Platform Comparison Matrix

Architecture Decision Tree

Voice AI Pipeline Architecture

Latency Budget (target: <1.5s total)

Phase 3: Conversation Design