Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/audio-transcription-pipeline

Name: audio-transcription-pipeline
Author: curiositech

skills/audio-transcription-pipeline/SKILL.md

npx skillsauth add curiositech/windags-skills audio-transcription-pipeline

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Audio Transcription Pipeline

Build production speech-to-text pipelines with Whisper, Deepgram, and AssemblyAI for batch and real-time transcription with speaker diarization.

Decision Points

Engine Selection Decision Tree

If requirements include:
├─ Real-time streaming required?
│  ├─ Yes: Use Deepgram Nova-3 WebSocket (fastest)
│  └─ No: Continue to accuracy requirements
├─ Highest accuracy needed + have GPU?
│  ├─ Yes: Use Whisper large-v3 with faster-whisper
│  └─ No: Continue to cost analysis
├─ Budget < $0.0059/min AND have compute?
│  ├─ Yes: Use local Whisper
│  └─ No: Use AssemblyAI Universal-2 (best API diarization)

Batch vs Stream Processing

If audio characteristics:
├─ Duration > 30 minutes?
│  ├─ Yes: Use batch with chunking (split at silence)
│  └─ No: Continue to latency check
├─ Need results in < 10 seconds?
│  ├─ Yes: Use streaming (Deepgram/Whisper.cpp)
│  └─ No: Use batch for better accuracy

VAD (Voice Activity Detection) Usage

If content type:
├─ Live conversation/meeting?
│  ├─ Yes: Enable VAD (saves 40-60% compute on silence)
│  └─ No: Continue to content check
├─ Lecture/presentation with pauses?
│  ├─ Yes: Use conservative VAD (min_silence_duration_ms: 1000)
│  └─ No: Skip VAD for dense speech (audiobooks, etc.)

Model Selection by Domain

If language/accent:
├─ Non-English or heavy accent?
│  ├─ Yes: Use Whisper large-v3 (best multilingual)
│  └─ No: Continue to speed check
├─ Need < 1 second latency?
│  ├─ Yes: Use Deepgram Nova-3 streaming
│  └─ No: Use AssemblyAI for business/medical terminology

Failure Modes

Hallucination on Silence

Detection: Transcripts show repeated phrases during quiet sections or phantom music descriptions Diagnosis: VAD disabled or threshold too low, model generating text from background noise Fix: Enable VAD with min_silence_duration_ms: 500, use --no_speech_threshold 0.6 in Whisper

Diarization Collapse at >10 Speakers

Detection: All speakers labeled as "Speaker 1" after 10-15 minutes, or random speaker switching mid-sentence Diagnosis: Speaker embedding model saturated, overlap confusion in crowded audio Fix: Pre-segment by silence, use AssemblyAI dual_channel if stereo, limit to 8 active speakers max

Timestamp Sync Drift with Video

Detection: Subtitles appear 2-5 seconds before/after corresponding video frames Diagnosis: Audio preprocessing changed duration, or VAD removed segments without timestamp adjustment Fix: Use --preserve_timing in preprocessing, sync with original audio timecode, validate against known speech events

Language Auto-Detection Failure

Detection: English words transcribed as gibberish when speaker has accent, or code-switching ignored Diagnosis: Model locked to wrong language in first 30 seconds, or multilingual content confused classifier Fix: Force language with language="en" for accented English, use task="translate" for non-English to English

Memory Overflow on Long Files

Detection: OOM crashes on files >45 minutes, or sudden quality drops after 30 minutes Diagnosis: Model keeping full context window, GPU VRAM exhausted Fix: Chunk at 25-minute boundaries with 30-second overlap, use --without_timestamps for RAM efficiency

Worked Examples

Complete Meeting Transcription Walkthrough

Scenario: 90-minute board meeting, 6 speakers, need accurate speaker identification and timestamps for minutes

Input Analysis:

90 minutes → requires chunking
6 speakers → within diarization limits
Business context → use AssemblyAI for terminology

Decision Path:

Duration > 30 min → batch processing required
Speaker count = 6 → diarization feasible
Business domain → AssemblyAI Universal-2 chosen
High accuracy needed → enable speaker labels + smart formatting

Implementation:

# Step 1: Preprocess (expert catches: normalize audio levels)
ffmpeg -i meeting.mp4 -ar 16000 -ac 1 -filter:a "volume=0.8" meeting_clean.wav

# Step 2: Chunk with overlap (novice would process whole file)
from pydub import AudioSegment
audio = AudioSegment.from_wav("meeting_clean.wav")
chunks = []
for i in range(0, len(audio), 25*60*1000):  # 25min chunks
    chunk = audio[i:i+27*60*1000]  # 27min with 2min overlap
    chunks.append(chunk)

# Step 3: Transcribe with diarization
results = []
for chunk in chunks:
    response = assemblyai_client.transcribe(
        chunk, speaker_labels=True, auto_punctuation=True,
        dual_channel=False, speaker_labels_max=8
    )
    results.append(response.get_paragraphs())

# Step 4: Merge overlapping segments (expert step novice misses)
merged = merge_overlapping_transcripts(results, overlap_seconds=120)

Expert vs Novice:

Novice: Processes 90-min file directly → OOM crash or poor accuracy
Expert: Chunks with overlap, validates speaker consistency across chunks, normalizes audio first

Quality Gates

[ ] WER (Word Error Rate) < 5% on test sample with known ground truth
[ ] Speaker accuracy > 90% (correct speaker ID for each utterance when ground truth available)
[ ] Real-time latency < 2 seconds end-to-end for streaming implementations
[ ] SRT/VTT files pass validation (proper timestamps, no overlapping segments)
[ ] Audio preprocessing completed: 16kHz mono WAV/FLAC format confirmed
[ ] VAD parameters tuned: < 10% false positive silence detection on sample
[ ] Memory usage < 8GB for files up to 2 hours (chunking working properly)
[ ] API error handling tested: handles timeouts, retries, and quota exceeded
[ ] Diarization speaker count matches expected (±1 speaker tolerance)
[ ] Output timestamps align with original audio within 100ms accuracy

NOT-FOR Boundaries

This skill should NOT be used for:

Text-to-speech synthesis → Use voice-audio-engineer instead
Music transcription or lyric extraction → Use ai-engineer for music AI models
Audio classification without transcription → Use ai-engineer for audio classification
Real-time translation (transcribe + translate) → Combine with language-translator skill
Voice biometrics or speaker identification → Use ai-engineer for speaker recognition models
Audio quality enhancement or noise reduction → Use voice-audio-engineer for audio processing

Delegate when:

Need custom wake word detection → voice-audio-engineer
Require audio fingerprinting → data-pipeline-engineer
Building voice assistants → voice-audio-engineer + conversation-designer
Processing >10,000 hours → data-pipeline-engineer for orchestration

curiositech/audio-transcription-pipeline

skills/audio-transcription-pipeline/SKILL.md

Build audio transcription pipelines with Whisper, Deepgram, and AssemblyAI including speaker diarization and real-time streaming. Activate on: transcription, speech-to-text, diarization, audio processing, meeting transcripts. NOT for: text-to-speech synthesis (voice-audio-engineer), music generation (ai-engineer).

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills audio-transcription-pipeline

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 1:40 PM260.5s1 file scanned

SKILL.md

license:: Apache-2.0
name:: audio-transcription-pipeline
description:: Build audio transcription pipelines with Whisper, Deepgram, and AssemblyAI including speaker diarization and real-time streaming. Activate on: transcription, speech-to-text, diarization, audio processing, meeting transcripts. NOT for: text-to-speech synthesis (voice-audio-engineer), music generation (ai-engineer).
allowed-tools:: Read,Write,Edit,Bash(python:*,pip:*,npm:*,npx:*)
category:: Video & Audio
- skill:: data-pipeline-engineer
reason:: Large-scale batch transcription needs pipeline orchestration

Audio Transcription Pipeline

Build production speech-to-text pipelines with Whisper, Deepgram, and AssemblyAI for batch and real-time transcription with speaker diarization.

Decision Points

Engine Selection Decision Tree

If requirements include:
├─ Real-time streaming required?
│  ├─ Yes: Use Deepgram Nova-3 WebSocket (fastest)
│  └─ No: Continue to accuracy requirements
├─ Highest accuracy needed + have GPU?
│  ├─ Yes: Use Whisper large-v3 with faster-whisper
│  └─ No: Continue to cost analysis
├─ Budget < $0.0059/min AND have compute?
│  ├─ Yes: Use local Whisper
│  └─ No: Use AssemblyAI Universal-2 (best API diarization)

Batch vs Stream Processing

If audio characteristics:
├─ Duration > 30 minutes?
│  ├─ Yes: Use batch with chunking (split at silence)
│  └─ No: Continue to latency check
├─ Need results in < 10 seconds?
│  ├─ Yes: Use streaming (Deepgram/Whisper.cpp)
│  └─ No: Use batch for better accuracy

VAD (Voice Activity Detection) Usage

If content type:
├─ Live conversation/meeting?
│  ├─ Yes: Enable VAD (saves 40-60% compute on silence)
│  └─ No: Continue to content check
├─ Lecture/presentation with pauses?
│  ├─ Yes: Use conservative VAD (min_silence_duration_ms: 1000)
│  └─ No: Skip VAD for dense speech (audiobooks, etc.)

Model Selection by Domain

If language/accent:
├─ Non-English or heavy accent?
│  ├─ Yes: Use Whisper large-v3 (best multilingual)
│  └─ No: Continue to speed check
├─ Need < 1 second latency?
│  ├─ Yes: Use Deepgram Nova-3 streaming
│  └─ No: Use AssemblyAI for business/medical terminology

Failure Modes

Hallucination on Silence

Diarization Collapse at >10 Speakers

Timestamp Sync Drift with Video

Language Auto-Detection Failure

Memory Overflow on Long Files

Worked Examples

Complete Meeting Transcription Walkthrough

Scenario: 90-minute board meeting, 6 speakers, need accurate speaker identification and timestamps for minutes

Input Analysis:

90 minutes → requires chunking
6 speakers → within diarization limits
Business context → use AssemblyAI for terminology

Decision Path:

Duration > 30 min → batch processing required
Speaker count = 6 → diarization feasible
Business domain → AssemblyAI Universal-2 chosen
High accuracy needed → enable speaker labels + smart formatting

Implementation:

# Step 1: Preprocess (expert catches: normalize audio levels)
ffmpeg -i meeting.mp4 -ar 16000 -ac 1 -filter:a "volume=0.8" meeting_clean.wav

# Step 2: Chunk with overlap (novice would process whole file)
from pydub import AudioSegment
audio = AudioSegment.from_wav("meeting_clean.wav")
chunks = []
for i in range(0, len(audio), 25*60*1000):  # 25min chunks
    chunk = audio[i:i+27*60*1000]  # 27min with 2min overlap
    chunks.append(chunk)

# Step 3: Transcribe with diarization
results = []
for chunk in chunks:
    response = assemblyai_client.transcribe(
        chunk, speaker_labels=True, auto_punctuation=True,
        dual_channel=False, speaker_labels_max=8
    )
    results.append(response.get_paragraphs())

# Step 4: Merge overlapping segments (expert step novice misses)
merged = merge_overlapping_transcripts(results, overlap_seconds=120)

Expert vs Novice:

Novice: Processes 90-min file directly → OOM crash or poor accuracy
Expert: Chunks with overlap, validates speaker consistency across chunks, normalizes audio first

Quality Gates

[ ] WER (Word Error Rate) < 5% on test sample with known ground truth
[ ] Speaker accuracy > 90% (correct speaker ID for each utterance when ground truth available)
[ ] Real-time latency < 2 seconds end-to-end for streaming implementations
[ ] SRT/VTT files pass validation (proper timestamps, no overlapping segments)
[ ] Audio preprocessing completed: 16kHz mono WAV/FLAC format confirmed
[ ] VAD parameters tuned: < 10% false positive silence detection on sample
[ ] Memory usage < 8GB for files up to 2 hours (chunking working properly)
[ ] API error handling tested: handles timeouts, retries, and quota exceeded
[ ] Diarization speaker count matches expected (±1 speaker tolerance)
[ ] Output timestamps align with original audio within 100ms accuracy

NOT-FOR Boundaries

This skill should NOT be used for:

Text-to-speech synthesis → Use voice-audio-engineer instead
Music transcription or lyric extraction → Use ai-engineer for music AI models
Audio classification without transcription → Use ai-engineer for audio classification
Real-time translation (transcribe + translate) → Combine with language-translator skill
Voice biometrics or speaker identification → Use ai-engineer for speaker recognition models
Audio quality enhancement or noise reduction → Use voice-audio-engineer for audio processing

Delegate when:

Need custom wake word detection → voice-audio-engineer
Require audio fingerprinting → data-pipeline-engineer
Building voice assistants → voice-audio-engineer + conversation-designer
Processing >10,000 hours → data-pipeline-engineer for orchestration

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/audio-transcription-pipeline ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT