skills/transcribe/SKILL.md
Use when transcribing audio/video files to text, speech-to-text from recordings, speaker diarization, labeling speakers in interviews/meetings/podcasts, or extracting text from audio. NEVER for real-time STT/TTS pipelines or voice agent implementation (use voice-ai-development). NEVER for voice agent architecture or multi-agent voice systems (use voice-agents). NEVER for audio editing, mixing, or processing (use standard audio tools). NEVER for phone system configuration or IVR (use twilio-communications).
npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit transcribeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transcribe audio using OpenAI's transcription API via the bundled CLI. Supports plain text, structured JSON, and speaker-diarized output with optional known-speaker identification.
| File | Purpose |
|------|---------|
| scripts/transcribe_diarize.py | Python CLI (277 lines) - transcription with diarization, known speakers, dry-run mode |
| references/api.md | API quick reference - input formats, size limits, response formats, known speaker notes |
| agents/openai.yaml | Agent interface definition - display name, icon, default prompt |
| assets/transcribe.png | Skill icon (large) |
| assets/transcribe-small.svg | Skill icon (small) |
| Task | This skill? | Use instead | |------|-------------|-------------| | Transcribe audio/video file to text | YES | - | | Label speakers in recorded meeting | YES | - | | Identify known speakers by voice sample | YES | - | | Batch transcribe multiple audio files | YES | - | | Real-time speech-to-text streaming | NO | voice-ai-development | | Voice agent with conversation flow | NO | voice-agents | | Text-to-speech synthesis | NO | voice-ai-development | | Telephony IVR or call routing | NO | twilio-communications | | Audio noise reduction or editing | NO | standard audio tools |
First match wins. Stop at the first row where Signal is true.
| Signal | Model | Response Format | CLI flags |
|--------|-------|-----------------|-----------|
| Need speaker labels | gpt-4o-transcribe-diarize | diarized_json | --model gpt-4o-transcribe-diarize --response-format diarized_json |
| Known speakers to identify | gpt-4o-transcribe-diarize | diarized_json | above + --known-speaker "Name=path.wav" per speaker |
| Need timestamps in structured output | gpt-4o-mini-transcribe | json | --response-format json |
| Fast plain-text transcription (default) | gpt-4o-mini-transcribe | text | (no extra flags needed) |
| Condition | Action | Why |
|-----------|--------|-----|
| Clean recording, single speaker | Transcribe directly with mini-transcribe | Fastest, cheapest path |
| Multiple speakers, labels needed | Use diarize model with diarized_json | Only model that produces speaker segments |
| Multiple speakers, labels NOT needed | Use mini-transcribe with text | Faster, diarization unnecessary |
| Audio >30 seconds | Keep --chunking-strategy auto (default) | Mandatory for diarize model; recommended for all long audio |
| Audio file >25MB | Split file before sending | Hard API limit, request will fail at 25MB |
| Background noise or low quality | Add --language hint for expected language | Helps model compensate; accuracy still degrades |
| Non-standard format | Convert to mp3/wav/m4a first | Supported: mp3, mp4, mpeg, mpga, m4a, wav, webm |
Follow this sequence for every transcription request. Do not skip steps.
--dry-run first on any non-trivial request (multiple files, known speakers, unfamiliar audio format). This catches configuration errors before consuming API credits.Key mindset: The most common mistake is jumping to the diarize model for any multi-speaker audio. If the user does not need speaker labels, mini-transcribe is faster, cheaper, and often more accurate for pure text output.
Script location: ~/.claude/skills/transcribe/scripts/transcribe_diarize.py
Prerequisite: uv pip install openai (or pip install openai). OPENAI_API_KEY must be set in environment.
Simple transcription (most common):
python ~/.claude/skills/transcribe/scripts/transcribe_diarize.py recording.mp3 --out transcript.txt
Speaker-labeled meeting notes:
python ~/.claude/skills/transcribe/scripts/transcribe_diarize.py meeting.m4a \
--model gpt-4o-transcribe-diarize \
--response-format diarized_json \
--out-dir output/meeting
Known speaker identification (max 4 speakers):
python ~/.claude/skills/transcribe/scripts/transcribe_diarize.py interview.wav \
--model gpt-4o-transcribe-diarize \
--response-format diarized_json \
--known-speaker "Alice=refs/alice.wav" \
--known-speaker "Bob=refs/bob.wav" \
--out-dir output/interview
Batch transcription (multiple files):
python ~/.claude/skills/transcribe/scripts/transcribe_diarize.py file1.mp3 file2.wav file3.m4a --out-dir output/batch
Dry run (validate inputs, print payload, no API call):
python ~/.claude/skills/transcribe/scripts/transcribe_diarize.py audio.mp3 --dry-run
These are non-obvious constraints that cause silent failures or unexpected results:
--prompt flag with gpt-4o-transcribe-diarize causes an error. The CLI blocks this combination. Use --language hint instead for guidance.--response-format diarized_json with gpt-4o-mini-transcribe will fail. The CLI validates this.--chunking-strategy auto handles this, but if overridden to none, audio >30s will fail with the diarize model.| Format | Extension | Use for | Contains | |--------|-----------|---------|----------| | text | .txt | Direct reading, editing, summarization | Plain transcript text only | | json | .json | Programmatic access, timestamp extraction | Structured segments with start/end times | | diarized_json | .json | Meeting notes, interview analysis, attribution | Speaker-labeled segments with timestamps |
| Concept | Why it is HERE and not in general knowledge | |---------|----------------------------------------------| | Model selection matrix (mini vs diarize) | OpenAI transcription models are new (2024-2025), selection criteria not widely documented, wrong choice = failed request | | Known speaker reference mechanics | Underdocumented API feature using extra_body with base64 data URLs -- not guessable from standard SDK docs | | diarized_json format constraints | Format-model coupling is a hard constraint that causes cryptic errors if violated | | Chunking strategy requirements | Mandatory for diarize model on long audio but optional for mini -- asymmetric requirement not obvious | | CLI script architecture | Bundled 277-line script with validation, dry-run, batch support -- must know it exists and how to invoke | | Speaker attribution confidence | Probabilistic labeling with known failure modes (short utterances, similar voices) -- critical for meeting notes accuracy |
development
When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.
testing
--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H
testing
Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).
testing
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.