skills/video-transcriber/SKILL.md
Transcribe audio from videos using Whisper (local), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (gemini-flash-lite-latest). Use when you need to convert video/audio to text for further processing, subtitle generation, or content analysis. Supports multiple languages, speaker diarization, and timestamp-accurate transcription. Gemini provides additional features like emotion detection and viral segment analysis.
npx skillsauth add akrindev/trimer-clip video-transcriberInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables AI agents to transcribe audio from video files using Whisper (local processing), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (cloud processing with advanced features).
Pros:
Cons:
Models:
tiny - Fastest, lower accuracy (~32MB)base - Fast, good accuracy (~74MB)small - Balanced speed/accuracy (~244MB)medium - Good accuracy, slower (~769MB)large-v3 - Highest accuracy, slowest (~1550MB)Local-first testing:
Use tiny when you want the fastest local run for validation.
Pros:
Cons:
Pros:
Cons:
Pros:
Cons:
scripts/transcribe.pyTranscribe audio from video file.
Usage:
python skills/video-transcriber/scripts/transcribe.py <video_path> [options]
Options:
--model, -m: Model to use (whisper, gemini, openai, google) - default: auto--whisper-model: Whisper model size (tiny, base, small, medium, large-v3) - default: medium--openai-model: OpenAI Whisper model (default: whisper-1)--google-model: Google Speech model (default: latest_long)--use-faster: Use faster-whisper for speed - default: True--output, -o: Output file path (default: <video_path>.srt)--format: Output format (srt, vtt, json) - default: srt--language: Language code (e.g., en, id) - default: auto--speaker-diarization: Enable speaker labels (Gemini only)--emotion-detection: Enable emotion detection (Gemini only)--device: Device for Whisper (auto, cpu, cuda) - default: autoExamples:
Transcribe with Whisper (default):
python skills/video-transcriber/scripts/transcribe.py video.mp4
Transcribe with Gemini API:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini
Transcribe with OpenAI Whisper API:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --model openai --format json
Note: When using --model openai, the system will try Google Speech-to-Text as a fallback if OpenAI fails and Google credentials are available.
Transcribe with Google Speech-to-Text:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --model google --format json
Transcribe with speaker diarization and emotion detection (Gemini):
python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini --speaker-diarization --emotion-detection
Transcribe with large Whisper model:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --whisper-model large-v3
Output to JSON:
python skills/video-transcriber/scripts/transcribe.py video.mp4 --format json
scripts/analyze.pyAnalyze audio content using Gemini API for viral segments, summary, or emotions.
Usage:
python skills/video-transcriber/scripts/analyze.py <video_path> [options]
Options:
--analysis-type: Type of analysis (viral, summary, emotions, questions) - default: viral--num-segments: Number of segments to identify (for viral analysis) - default: 5--model: Model to use (default: gemini)Examples:
Detect viral segments:
python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type viral
Get summary:
python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type summary
Analyze emotions:
python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type emotions
1
00:00:00,000 --> 00:00:05,000
This is the first subtitle.
2
00:00:05,500 --> 00:00:10,000
This is the second subtitle.
[
{
"index": 1,
"start": 0.0,
"end": 5.0,
"text": "This is the first subtitle.",
"speaker": "Speaker A",
"emotion": "neutral"
}
]
When --model auto, the system selects based on:
Note: openai and google models must be selected explicitly. Auto does not switch to them.
# For Gemini API
export GEMINI_API_KEY="your-api-key"
# For OpenAI Whisper API
export OPENAI_API_KEY="your-api-key"
# For Google Speech-to-Text
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# Optional: For Vertex AI
export GOOGLE_PROJECT_ID="your-project-id"
export GOOGLE_LOCATION="us-central1"
After transcription, you can use these skills:
highlight-scanner: Analyze transcript for viral momentssubtitle-overlay: Add captions to videoautocut-shorts: Full workflow for creating short clips--use-faster with Whisper for faster processing--format json for programmatic processing--analysis-type viral to identify best segments for short-form contenttesting
Download videos from YouTube URLs. Use when user wants to download a YouTube video for processing, editing, or transcription. Supports different quality options, audio-only extraction, and playlist downloads.
tools
Trim and cut videos by timestamp with precision. Supports both stream copy (fast) and re-encoding (quality) modes. Use when you need to extract specific segments from videos, create clips from highlights, or cut unwanted portions.
tools
Add burned-in subtitles/captions to video clips. Supports SRT/VTT/ASS subtitle files, customizable styling (font, size, color, position), and platform-specific presets for TikTok, YouTube Shorts, and Instagram Reels.
tools
Advanced speaker diarization using pyannote-audio. Identify who speaks when, detect multiple speakers, handle overlapping speech, and create speaker-specific segments. Use when you need accurate speaker identification, multi-speaker content analysis, or speaker-specific clip extraction. More accurate than Gemini's built-in diarization for complex scenarios.