Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

akrindev/speaker-diarization

Name: speaker-diarization
Author: akrindev

skills/speaker-diarization/SKILL.md

npx skillsauth add akrindev/trimer-clip speaker-diarization

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Speaker Diarization

Advanced speaker diarization using pyannote-audio - state-of-the-art neural network models for speaker identification.

When to Use

Use this skill when:

Video has multiple speakers (podcasts, interviews, panels)
You need accurate speaker identification
Content has overlapping speech (people talking over each other)
You want speaker-specific clips
Gemini's diarization isn't accurate enough
Working with multi-language or mixed speakers

Don't use when:

Single speaker content (use basic transcription instead)
Real-time processing needed (this is offline/batch)
Storage is limited (models require ~2GB)

Why pyannote-audio?

Benchmarks (Diarization Error Rate - lower is better):

pyannote community-1: 17.0% on AMI dataset
Gemini/Whisper: ~25-30% error rate
35% accuracy improvement over cloud APIs!

Advantages:

✅ Local processing - Privacy, no API costs
✅ Better accuracy - State-of-the-art neural models
✅ Overlapping speech detection - Identifies when people talk simultaneously
✅ Precise timestamps - Millisecond-accurate speaker boundaries
✅ No internet required - After initial model download
✅ Works with any language - Language-agnostic

Available Scripts

`scripts/diarize.py`

Main diarization script.

Usage:

python skills/speaker-diarization/scripts/diarize.py <video_path> [options]

Options:

--output, -o: Output format (json, rttm, srt) - default: json
--min-speakers: Minimum number of speakers to expect
--max-speakers: Maximum number of speakers to expect
--num-speakers: Exact number of speakers (if known)
--device: Processing device (cpu, cuda) - default: auto
--huggingface-token: HuggingFace token (or use env var)

Examples:

Basic diarization:

export HUGGINGFACE_TOKEN="your-token"
python skills/speaker-diarization/scripts/diarize.py podcast.mp4

Specify speaker count range:

python skills/speaker-diarization/scripts/diarize.py interview.mp4 --min-speakers 2 --max-speakers 3

Output to RTTM format:

python skills/speaker-diarization/scripts/diarize.py panel.mp4 --output rttm

Output (JSON):

{
  "success": true,
  "video_path": "podcast.mp4",
  "num_speakers": 3,
  "duration": 1200.5,
  "speakers": {
    "SPEAKER_00": {"duration": 450.2, "segments": 45},
    "SPEAKER_01": {"duration": 380.5, "segments": 38},
    "SPEAKER_02": {"duration": 369.8, "segments": 42}
  },
  "segments": [
    {
      "start": 0.0,
      "end": 5.2,
      "speaker": "SPEAKER_00",
      "duration": 5.2
    },
    {
      "start": 5.2,
      "end": 12.8,
      "speaker": "SPEAKER_01",
      "duration": 7.6
    }
  ],
  "overlapping_segments": [
    {
      "start": 45.2,
      "end": 47.8,
      "speakers": ["SPEAKER_00", "SPEAKER_01"]
    }
  ]
}

`scripts/extract_speaker_segments.py`

Extract video segments for specific speakers.

Usage:

python skills/speaker-diarization/scripts/extract_speaker_segments.py <video_path> <diarization_json> [options]

Options:

--speaker: Speaker ID to extract (SPEAKER_00, SPEAKER_01, etc.) - default: all
--min-segment-duration: Minimum segment duration (seconds) - default: 5.0
--context: Add context seconds before/after - default: 2.0
--output-dir: Output directory

Examples:

Extract all speakers separately:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json

Extract only SPEAKER_00:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json --speaker SPEAKER_00

Extract with 3-second context:

python skills/speaker-diarization/scripts/extract_speaker_segments.py interview.mp4 diarization.json --context 3.0

`scripts/analyze_speaker_dynamics.py`

Analyze speaker interactions and dynamics.

Usage:

python skills/speaker-diarization/scripts/analyze_speaker_dynamics.py <diarization_json> [options]

Output:

{
  "speaker_dynamics": {
    "total_speakers": 3,
    "dominant_speaker": "SPEAKER_00",
    "speaker_balance": 0.72,
    "interaction_moments": [
      {
        "type": "debate",
        "start": 120.5,
        "end": 145.2,
        "speakers": ["SPEAKER_00", "SPEAKER_01"],
        "intensity": 0.85
      },
      {
        "type": "overlapping_speech",
        "start": 200.0,
        "end": 202.5,
        "speakers": ["SPEAKER_01", "SPEAKER_02"]
      }
    ]
  }
}

Setup

1. Install Dependencies

pip install pyannote.audio torch torchaudio speechbrain

2. Get HuggingFace Token

Create account at huggingface.co
Generate token at hf.co/settings/tokens
Accept terms at pyannote/speaker-diarization-community-1

3. Set Environment Variable

export HUGGINGFACE_TOKEN="your-token-here"

Or use --huggingface-token flag.

How AI Agents Decide

When to use pyannote vs Gemini diarization:

def select_diarization_method(video_info, user_instructions):
    # User explicitly wants pyannote
    if "accurate" in user_instructions or "precise" in user_instructions:
        return "pyannote"
    
    # Multi-speaker content detected
    if video_info.get('num_speakers', 1) > 2:
        return "pyannote"
    
    # Podcast/interview format
    if any(word in user_instructions for word in ['podcast', 'interview', 'panel', 'debate']):
        return "pyannote"
    
    # Overlapping speech expected
    if 'overlapping' in user_instructions or 'talk over' in user_instructions:
        return "pyannote"
    
    # Privacy requirement
    if 'private' in user_instructions or 'offline' in user_instructions:
        return "pyannote"
    
    # Single speaker or simple case - use Gemini (faster)
    return "gemini"

Agent decision criteria:

Use pyannote: Multi-speaker, accuracy-critical, offline needed
Use Gemini: Single speaker, speed-critical, simple scenario

Integration with Other Skills

Enhanced `video-transcriber`

# Transcribe with pyannote diarization
python skills/video-transcriber/scripts/transcribe.py video.mp4 \
  --model whisper \
  --diarization pyannote \
  --output-format srt-with-speakers

Enhanced `highlight-scanner`

# Find highlights considering speaker dynamics
python skills/highlight-scanner/scripts/find_highlights.py video.mp4 \
  --transcript-path video.srt \
  --diarization-path video_diarization.json \
  --speaker-dynamics

Enhanced `autocut-shorts`

# Autocut focusing on specific speaker
python skills/autocut-shorts/scripts/autocut.py podcast.mp4 \
  --use-speaker-diarization \
  --focus-speaker SPEAKER_00 \
  --num-clips 5

Output Formats

JSON (default)

Full metadata including speaker statistics and overlapping segments.

RTTM

Standard diarization format for research/annotation:

SPEAKER podcast 1 0.0 5.2 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER podcast 1 5.2 7.6 <NA> <NA> SPEAKER_01 <NA> <NA>

SRT with Speakers

1
00:00:00,000 --> 00:00:05,200
[SPEAKER_00]: Welcome to the show everyone

2
00:00:05,200 --> 00:00:12,800
[SPEAKER_01]: Thanks for having me on today

Performance

Processing Speed:

CPU: ~30 seconds per hour of audio (Intel i7)
GPU: ~10 seconds per hour of audio (NVIDIA RTX 3060)
First run: +60 seconds (model download)

Accuracy:

2-3 speakers: 95%+ accuracy
4-6 speakers: 85-90% accuracy
7+ speakers: 70-80% accuracy

Tips

Specify speaker count if known - improves accuracy
Use for podcasts/interviews - better than cloud APIs
Combine with transcription - diarization + Whisper = perfect
Check overlapping speech - identifies heated discussions
Export to SRT - easy to import into video editors

References

pyannote.audio: https://github.com/pyannote/pyannote-audio
Model hub: https://huggingface.co/pyannote
Paper: https://arxiv.org/abs/2310.11347

akrindev/speaker-diarization

skills/speaker-diarization/SKILL.md

Advanced speaker diarization using pyannote-audio. Identify who speaks when, detect multiple speakers, handle overlapping speech, and create speaker-specific segments. Use when you need accurate speaker identification, multi-speaker content analysis, or speaker-specific clip extraction. More accurate than Gemini's built-in diarization for complex scenarios.

tools

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add akrindev/trimer-clip speaker-diarization

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 10:14 PM256.9s4 files scanned

SKILL.md

name:: speaker-diarization
description:: Advanced speaker diarization using pyannote-audio. Identify who speaks when, detect multiple speakers, handle overlapping speech, and create speaker-specific segments. Use when you need accurate speaker identification, multi-speaker content analysis, or speaker-specific clip extraction. More accurate than Gemini's built-in diarization for complex scenarios.
allowed-tools:: Bash(ffmpeg:*) Bash(python:*)
compatibility:: Requires pyannote.audio, torch, and HuggingFace token. GPU optional but recommended.
version:: 1.0
models:: pyannote/speaker-diarization-community-1, pyannote/segmentation-3.0

Speaker Diarization

Advanced speaker diarization using pyannote-audio - state-of-the-art neural network models for speaker identification.

When to Use

Use this skill when:

Video has multiple speakers (podcasts, interviews, panels)
You need accurate speaker identification
Content has overlapping speech (people talking over each other)
You want speaker-specific clips
Gemini's diarization isn't accurate enough
Working with multi-language or mixed speakers

Don't use when:

Single speaker content (use basic transcription instead)
Real-time processing needed (this is offline/batch)
Storage is limited (models require ~2GB)

Why pyannote-audio?

Benchmarks (Diarization Error Rate - lower is better):

pyannote community-1: 17.0% on AMI dataset
Gemini/Whisper: ~25-30% error rate
35% accuracy improvement over cloud APIs!

Advantages:

✅ Local processing - Privacy, no API costs
✅ Better accuracy - State-of-the-art neural models
✅ Overlapping speech detection - Identifies when people talk simultaneously
✅ Precise timestamps - Millisecond-accurate speaker boundaries
✅ No internet required - After initial model download
✅ Works with any language - Language-agnostic

Available Scripts

`scripts/diarize.py`

Main diarization script.

Usage:

python skills/speaker-diarization/scripts/diarize.py <video_path> [options]

Options:

--output, -o: Output format (json, rttm, srt) - default: json
--min-speakers: Minimum number of speakers to expect
--max-speakers: Maximum number of speakers to expect
--num-speakers: Exact number of speakers (if known)
--device: Processing device (cpu, cuda) - default: auto
--huggingface-token: HuggingFace token (or use env var)

Examples:

Basic diarization:

export HUGGINGFACE_TOKEN="your-token"
python skills/speaker-diarization/scripts/diarize.py podcast.mp4

Specify speaker count range:

python skills/speaker-diarization/scripts/diarize.py interview.mp4 --min-speakers 2 --max-speakers 3

Output to RTTM format:

python skills/speaker-diarization/scripts/diarize.py panel.mp4 --output rttm

Output (JSON):

{
  "success": true,
  "video_path": "podcast.mp4",
  "num_speakers": 3,
  "duration": 1200.5,
  "speakers": {
    "SPEAKER_00": {"duration": 450.2, "segments": 45},
    "SPEAKER_01": {"duration": 380.5, "segments": 38},
    "SPEAKER_02": {"duration": 369.8, "segments": 42}
  },
  "segments": [
    {
      "start": 0.0,
      "end": 5.2,
      "speaker": "SPEAKER_00",
      "duration": 5.2
    },
    {
      "start": 5.2,
      "end": 12.8,
      "speaker": "SPEAKER_01",
      "duration": 7.6
    }
  ],
  "overlapping_segments": [
    {
      "start": 45.2,
      "end": 47.8,
      "speakers": ["SPEAKER_00", "SPEAKER_01"]
    }
  ]
}

`scripts/extract_speaker_segments.py`

Extract video segments for specific speakers.

Usage:

python skills/speaker-diarization/scripts/extract_speaker_segments.py <video_path> <diarization_json> [options]

Options:

--speaker: Speaker ID to extract (SPEAKER_00, SPEAKER_01, etc.) - default: all
--min-segment-duration: Minimum segment duration (seconds) - default: 5.0
--context: Add context seconds before/after - default: 2.0
--output-dir: Output directory

Examples:

Extract all speakers separately:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json

Extract only SPEAKER_00:

python skills/speaker-diarization/scripts/extract_speaker_segments.py podcast.mp4 podcast_diarization.json --speaker SPEAKER_00

Extract with 3-second context:

python skills/speaker-diarization/scripts/extract_speaker_segments.py interview.mp4 diarization.json --context 3.0

`scripts/analyze_speaker_dynamics.py`

Analyze speaker interactions and dynamics.

Usage:

python skills/speaker-diarization/scripts/analyze_speaker_dynamics.py <diarization_json> [options]

Output:

{
  "speaker_dynamics": {
    "total_speakers": 3,
    "dominant_speaker": "SPEAKER_00",
    "speaker_balance": 0.72,
    "interaction_moments": [
      {
        "type": "debate",
        "start": 120.5,
        "end": 145.2,
        "speakers": ["SPEAKER_00", "SPEAKER_01"],
        "intensity": 0.85
      },
      {
        "type": "overlapping_speech",
        "start": 200.0,
        "end": 202.5,
        "speakers": ["SPEAKER_01", "SPEAKER_02"]
      }
    ]
  }
}

Setup

1. Install Dependencies

pip install pyannote.audio torch torchaudio speechbrain

2. Get HuggingFace Token

Create account at huggingface.co
Generate token at hf.co/settings/tokens
Accept terms at pyannote/speaker-diarization-community-1

3. Set Environment Variable

export HUGGINGFACE_TOKEN="your-token-here"

Or use --huggingface-token flag.

How AI Agents Decide

When to use pyannote vs Gemini diarization:

def select_diarization_method(video_info, user_instructions):
    # User explicitly wants pyannote
    if "accurate" in user_instructions or "precise" in user_instructions:
        return "pyannote"
    
    # Multi-speaker content detected
    if video_info.get('num_speakers', 1) > 2:
        return "pyannote"
    
    # Podcast/interview format
    if any(word in user_instructions for word in ['podcast', 'interview', 'panel', 'debate']):
        return "pyannote"
    
    # Overlapping speech expected
    if 'overlapping' in user_instructions or 'talk over' in user_instructions:
        return "pyannote"
    
    # Privacy requirement
    if 'private' in user_instructions or 'offline' in user_instructions:
        return "pyannote"
    
    # Single speaker or simple case - use Gemini (faster)
    return "gemini"

Agent decision criteria:

Use pyannote: Multi-speaker, accuracy-critical, offline needed
Use Gemini: Single speaker, speed-critical, simple scenario

Integration with Other Skills

Enhanced `video-transcriber`

# Transcribe with pyannote diarization
python skills/video-transcriber/scripts/transcribe.py video.mp4 \
  --model whisper \
  --diarization pyannote \
  --output-format srt-with-speakers

Enhanced `highlight-scanner`

# Find highlights considering speaker dynamics
python skills/highlight-scanner/scripts/find_highlights.py video.mp4 \
  --transcript-path video.srt \
  --diarization-path video_diarization.json \
  --speaker-dynamics

Enhanced `autocut-shorts`

# Autocut focusing on specific speaker
python skills/autocut-shorts/scripts/autocut.py podcast.mp4 \
  --use-speaker-diarization \
  --focus-speaker SPEAKER_00 \
  --num-clips 5

Output Formats

JSON (default)

Full metadata including speaker statistics and overlapping segments.

RTTM

Standard diarization format for research/annotation:

SPEAKER podcast 1 0.0 5.2 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER podcast 1 5.2 7.6 <NA> <NA> SPEAKER_01 <NA> <NA>

SRT with Speakers

1
00:00:00,000 --> 00:00:05,200
[SPEAKER_00]: Welcome to the show everyone

2
00:00:05,200 --> 00:00:12,800
[SPEAKER_01]: Thanks for having me on today

Performance

Processing Speed:

CPU: ~30 seconds per hour of audio (Intel i7)
GPU: ~10 seconds per hour of audio (NVIDIA RTX 3060)
First run: +60 seconds (model download)

Accuracy:

2-3 speakers: 95%+ accuracy
4-6 speakers: 85-90% accuracy
7+ speakers: 70-80% accuracy

Tips

Specify speaker count if known - improves accuracy
Use for podcasts/interviews - better than cloud APIs
Combine with transcription - diarization + Whisper = perfect
Check overlapping speech - identifies heated discussions
Export to SRT - easy to import into video editors

References

pyannote.audio: https://github.com/pyannote/pyannote-audio
Model hub: https://huggingface.co/pyannote
Paper: https://arxiv.org/abs/2310.11347

Related Skills

akrindev/youtube-downloader

testing

VerifiedTrustedCommunity

Download videos from YouTube URLs. Use when user wants to download a YouTube video for processing, editing, or transcription. Supports different quality options, audio-only extraction, and playlist downloads.

SKILL.mdUpdated Apr 3, 2026

akrindev/youtube-downloader

akrindev/video-trimmer

tools

VerifiedTrustedCommunity

Trim and cut videos by timestamp with precision. Supports both stream copy (fast) and re-encoding (quality) modes. Use when you need to extract specific segments from videos, create clips from highlights, or cut unwanted portions.

SKILL.mdUpdated Apr 3, 2026

akrindev/video-trimmer

akrindev/video-transcriber

development

VerifiedTrustedCommunity

Transcribe audio from videos using Whisper (local), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (gemini-flash-lite-latest). Use when you need to convert video/audio to text for further processing, subtitle generation, or content analysis. Supports multiple languages, speaker diarization, and timestamp-accurate transcription. Gemini provides additional features like emotion detection and viral segment analysis.

SKILL.mdUpdated Apr 3, 2026

akrindev/video-transcriber

akrindev/subtitle-overlay

tools

VerifiedTrustedCommunity

Add burned-in subtitles/captions to video clips. Supports SRT/VTT/ASS subtitle files, customizable styling (font, size, color, position), and platform-specific presets for TikTok, YouTube Shorts, and Instagram Reels.

SKILL.mdUpdated Apr 3, 2026

akrindev/subtitle-overlay

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/akrindev/trimer-clip.git

# Copy into Claude Code skills folder (global)
cp -r trimer-clip/skills/speaker-diarization ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

akrindev/trimer-clip

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

akrindev/speaker-diarization

$ install --global

Security Scan Results

SKILL.md

Speaker Diarization

When to Use

Why pyannote-audio?

Available Scripts

scripts/diarize.py

scripts/extract_speaker_segments.py

scripts/analyze_speaker_dynamics.py

Setup

1. Install Dependencies

2. Get HuggingFace Token

3. Set Environment Variable

How AI Agents Decide

Integration with Other Skills

Enhanced video-transcriber

Enhanced highlight-scanner

Enhanced autocut-shorts

Output Formats

JSON (default)

RTTM

SRT with Speakers

Performance

Tips

References

Related Skills

akrindev/youtube-downloader

akrindev/video-trimmer

akrindev/video-transcriber

akrindev/subtitle-overlay

akrindev/speaker-diarization

$ install --global

Security Scan Results

SKILL.md

Speaker Diarization

When to Use

Why pyannote-audio?

Available Scripts

scripts/diarize.py

scripts/extract_speaker_segments.py

scripts/analyze_speaker_dynamics.py

Setup

1. Install Dependencies

2. Get HuggingFace Token

3. Set Environment Variable

How AI Agents Decide

Integration with Other Skills

Enhanced video-transcriber

Enhanced highlight-scanner

Enhanced autocut-shorts

Output Formats

JSON (default)

RTTM

SRT with Speakers

Performance

Tips

References

Related Skills

akrindev/youtube-downloader

akrindev/video-trimmer

akrindev/video-transcriber

akrindev/subtitle-overlay

`scripts/diarize.py`

`scripts/extract_speaker_segments.py`

`scripts/analyze_speaker_dynamics.py`

Enhanced `video-transcriber`

Enhanced `highlight-scanner`

Enhanced `autocut-shorts`

`scripts/diarize.py`

`scripts/extract_speaker_segments.py`

`scripts/analyze_speaker_dynamics.py`

Enhanced `video-transcriber`

Enhanced `highlight-scanner`

Enhanced `autocut-shorts`