skills/audio-transcription-pipeline/SKILL.md
Build audio transcription pipelines with Whisper, Deepgram, and AssemblyAI including speaker diarization and real-time streaming. Activate on: transcription, speech-to-text, diarization, audio processing, meeting transcripts. NOT for: text-to-speech synthesis (voice-audio-engineer), music generation (ai-engineer).
npx skillsauth add curiositech/windags-skills audio-transcription-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build production speech-to-text pipelines with Whisper, Deepgram, and AssemblyAI for batch and real-time transcription with speaker diarization.
If requirements include:
├─ Real-time streaming required?
│ ├─ Yes: Use Deepgram Nova-3 WebSocket (fastest)
│ └─ No: Continue to accuracy requirements
├─ Highest accuracy needed + have GPU?
│ ├─ Yes: Use Whisper large-v3 with faster-whisper
│ └─ No: Continue to cost analysis
├─ Budget < $0.0059/min AND have compute?
│ ├─ Yes: Use local Whisper
│ └─ No: Use AssemblyAI Universal-2 (best API diarization)
If audio characteristics:
├─ Duration > 30 minutes?
│ ├─ Yes: Use batch with chunking (split at silence)
│ └─ No: Continue to latency check
├─ Need results in < 10 seconds?
│ ├─ Yes: Use streaming (Deepgram/Whisper.cpp)
│ └─ No: Use batch for better accuracy
If content type:
├─ Live conversation/meeting?
│ ├─ Yes: Enable VAD (saves 40-60% compute on silence)
│ └─ No: Continue to content check
├─ Lecture/presentation with pauses?
│ ├─ Yes: Use conservative VAD (min_silence_duration_ms: 1000)
│ └─ No: Skip VAD for dense speech (audiobooks, etc.)
If language/accent:
├─ Non-English or heavy accent?
│ ├─ Yes: Use Whisper large-v3 (best multilingual)
│ └─ No: Continue to speed check
├─ Need < 1 second latency?
│ ├─ Yes: Use Deepgram Nova-3 streaming
│ └─ No: Use AssemblyAI for business/medical terminology
Detection: Transcripts show repeated phrases during quiet sections or phantom music descriptions
Diagnosis: VAD disabled or threshold too low, model generating text from background noise
Fix: Enable VAD with min_silence_duration_ms: 500, use --no_speech_threshold 0.6 in Whisper
Detection: All speakers labeled as "Speaker 1" after 10-15 minutes, or random speaker switching mid-sentence Diagnosis: Speaker embedding model saturated, overlap confusion in crowded audio Fix: Pre-segment by silence, use AssemblyAI dual_channel if stereo, limit to 8 active speakers max
Detection: Subtitles appear 2-5 seconds before/after corresponding video frames
Diagnosis: Audio preprocessing changed duration, or VAD removed segments without timestamp adjustment
Fix: Use --preserve_timing in preprocessing, sync with original audio timecode, validate against known speech events
Detection: English words transcribed as gibberish when speaker has accent, or code-switching ignored
Diagnosis: Model locked to wrong language in first 30 seconds, or multilingual content confused classifier
Fix: Force language with language="en" for accented English, use task="translate" for non-English to English
Detection: OOM crashes on files >45 minutes, or sudden quality drops after 30 minutes
Diagnosis: Model keeping full context window, GPU VRAM exhausted
Fix: Chunk at 25-minute boundaries with 30-second overlap, use --without_timestamps for RAM efficiency
Scenario: 90-minute board meeting, 6 speakers, need accurate speaker identification and timestamps for minutes
Input Analysis:
Decision Path:
Implementation:
# Step 1: Preprocess (expert catches: normalize audio levels)
ffmpeg -i meeting.mp4 -ar 16000 -ac 1 -filter:a "volume=0.8" meeting_clean.wav
# Step 2: Chunk with overlap (novice would process whole file)
from pydub import AudioSegment
audio = AudioSegment.from_wav("meeting_clean.wav")
chunks = []
for i in range(0, len(audio), 25*60*1000): # 25min chunks
chunk = audio[i:i+27*60*1000] # 27min with 2min overlap
chunks.append(chunk)
# Step 3: Transcribe with diarization
results = []
for chunk in chunks:
response = assemblyai_client.transcribe(
chunk, speaker_labels=True, auto_punctuation=True,
dual_channel=False, speaker_labels_max=8
)
results.append(response.get_paragraphs())
# Step 4: Merge overlapping segments (expert step novice misses)
merged = merge_overlapping_transcripts(results, overlap_seconds=120)
Expert vs Novice:
This skill should NOT be used for:
voice-audio-engineer insteadai-engineer for music AI modelsai-engineer for audio classificationlanguage-translator skillai-engineer for speaker recognition modelsvoice-audio-engineer for audio processingDelegate when:
voice-audio-engineerdata-pipeline-engineervoice-audio-engineer + conversation-designerdata-pipeline-engineer for orchestrationtools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.