skills/speaker-diarization/SKILL.md
Advanced speaker diarization using pyannote-audio. Identify who speaks when, detect multiple speakers, handle overlapping speech, and create speaker-specific segments. Use when you need accurate speaker identification, multi-speaker content analysis, or speaker-specific clip extraction. More accurate than Gemini's built-in diarization for complex scenarios.
npx skillsauth add akrindev/trimer-clip speaker-diarizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Advanced speaker diarization using pyannote-audio - state-of-the-art neural network models for speaker identification.
Use this skill when:
Don't use when:
Benchmarks (Diarization Error Rate - lower is better):
Advantages:
scripts/diarize.pyMain diarization script.
Usage:
python skills/speaker-diarization/scripts/diarize.py <video_path> [options]
Options:
--output, -o: Output format (json, rttm, srt) - default: json--min-speakers: Minimum number of speakers to expect--max-speakers: Maximum number of speakers to expect--num-speakers: Exact number of speakers (if known)--device: Processing device (cpu, cuda) - default: auto--huggingface-token: HuggingFace token (or use env var)Examples:
Basic diarization:
export HUGGINGFACE_TOKEN="your-token"
python skills/speaker-diarization/scripts/diarize.py podcast.mp4
Specify speaker count range:
python skills/speaker-diarization/scripts/diarize.py interview.mp4 --min-speakers 2 --max-speakers 3
Output to RTTM format:
python skills/speaker-diarization/scripts/diarize.py panel.mp4 --output rttm
Output (JSON):
{
"success": true,
"video_path": "podcast.mp4",
"num_speakers": 3,
"duration": 1200.5,
"speakers": {
"SPEAKER_00": {"duration": 450.2, "segments": 45},
"SPEAKER_01": {"duration": 380.5, "segments": 38},
"SPEAKER_02": {"duration": 369.8, "segments": 42}
},
"segments": [
{
"start": 0.0,
"end": 5.2,
"speaker": "SPEAKER_00",
"duration": 5.2
},
{
"start": 5.2,
"end": 12.8,
"speaker": "SPEAKER_01",
"duration": 7.6
}
],
"overlapping_segments": [
{
"start": 45.2,
"end": 47.8,
"speakers": ["SPEAKER_00", "SPEAKER_01"]
}
]
}
scripts/extract_speaker_segments.pyExtract video segments for specific speakers.
Usage:
python skills/speaker-diarization/scripts/extract_speaker_segments.py <video_path> <diarization_json> [options]
Options:
--speaker: Speaker ID to extract (SPEAKER_00, SPEAKER_01, etc.) - default: all--min-segment-duration: Minimum segment duration (seconds) - default: 5.0--context: Add context seconds before/after - default: 2.0--output-dir: Output directoryExamples:
Extract all speakers separately:
python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json
Extract only SPEAKER_00:
python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json --speaker SPEAKER_00
Extract with 3-second context:
python skills/speaker-diarization/scripts/extract_speaker_segments.py interview.mp4 diarization.json --context 3.0
scripts/analyze_speaker_dynamics.pyAnalyze speaker interactions and dynamics.
Usage:
python skills/speaker-diarization/scripts/analyze_speaker_dynamics.py <diarization_json> [options]
Output:
{
"speaker_dynamics": {
"total_speakers": 3,
"dominant_speaker": "SPEAKER_00",
"speaker_balance": 0.72,
"interaction_moments": [
{
"type": "debate",
"start": 120.5,
"end": 145.2,
"speakers": ["SPEAKER_00", "SPEAKER_01"],
"intensity": 0.85
},
{
"type": "overlapping_speech",
"start": 200.0,
"end": 202.5,
"speakers": ["SPEAKER_01", "SPEAKER_02"]
}
]
}
}
pip install pyannote.audio torch torchaudio speechbrain
export HUGGINGFACE_TOKEN="your-token-here"
Or use --huggingface-token flag.
When to use pyannote vs Gemini diarization:
def select_diarization_method(video_info, user_instructions):
# User explicitly wants pyannote
if "accurate" in user_instructions or "precise" in user_instructions:
return "pyannote"
# Multi-speaker content detected
if video_info.get('num_speakers', 1) > 2:
return "pyannote"
# Podcast/interview format
if any(word in user_instructions for word in ['podcast', 'interview', 'panel', 'debate']):
return "pyannote"
# Overlapping speech expected
if 'overlapping' in user_instructions or 'talk over' in user_instructions:
return "pyannote"
# Privacy requirement
if 'private' in user_instructions or 'offline' in user_instructions:
return "pyannote"
# Single speaker or simple case - use Gemini (faster)
return "gemini"
Agent decision criteria:
video-transcriber# Transcribe with pyannote diarization
python skills/video-transcriber/scripts/transcribe.py video.mp4 \
--model whisper \
--diarization pyannote \
--output-format srt-with-speakers
highlight-scanner# Find highlights considering speaker dynamics
python skills/highlight-scanner/scripts/find_highlights.py video.mp4 \
--transcript-path video.srt \
--diarization-path video_diarization.json \
--speaker-dynamics
autocut-shorts# Autocut focusing on specific speaker
python skills/autocut-shorts/scripts/autocut.py podcast.mp4 \
--use-speaker-diarization \
--focus-speaker SPEAKER_00 \
--num-clips 5
Full metadata including speaker statistics and overlapping segments.
Standard diarization format for research/annotation:
SPEAKER podcast 1 0.0 5.2 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER podcast 1 5.2 7.6 <NA> <NA> SPEAKER_01 <NA> <NA>
1
00:00:00,000 --> 00:00:05,200
[SPEAKER_00]: Welcome to the show everyone
2
00:00:05,200 --> 00:00:12,800
[SPEAKER_01]: Thanks for having me on today
Processing Speed:
Accuracy:
testing
Download videos from YouTube URLs. Use when user wants to download a YouTube video for processing, editing, or transcription. Supports different quality options, audio-only extraction, and playlist downloads.
tools
Trim and cut videos by timestamp with precision. Supports both stream copy (fast) and re-encoding (quality) modes. Use when you need to extract specific segments from videos, create clips from highlights, or cut unwanted portions.
development
Transcribe audio from videos using Whisper (local), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (gemini-flash-lite-latest). Use when you need to convert video/audio to text for further processing, subtitle generation, or content analysis. Supports multiple languages, speaker diarization, and timestamp-accurate transcription. Gemini provides additional features like emotion detection and viral segment analysis.
tools
Add burned-in subtitles/captions to video clips. Supports SRT/VTT/ASS subtitle files, customizable styling (font, size, color, position), and platform-specific presets for TikTok, YouTube Shorts, and Instagram Reels.