Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

akrindev/video-transcriber

Name: video-transcriber
Author: akrindev

skills/video-transcriber/SKILL.md

npx skillsauth add akrindev/trimer-clip video-transcriber

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Video Transcriber

This skill enables AI agents to transcribe audio from video files using Whisper (local processing), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (cloud processing with advanced features).

When to Use

User wants to transcribe a video or audio file
User needs subtitles/captions for a video
User wants to analyze video content through transcription
User needs to identify viral-worthy segments
User wants speaker diarization or emotion detection

Model Selection

Whisper (Local)

Pros:

Free to use
100% privacy (no cloud upload)
Good for sensitive content
Lower cost for high volume

Cons:

Requires local processing power
No built-in speaker diarization
No emotion detection
Limited to 99 languages

Models:

tiny - Fastest, lower accuracy (~32MB)
base - Fast, good accuracy (~74MB)
small - Balanced speed/accuracy (~244MB)
medium - Good accuracy, slower (~769MB)
large-v3 - Highest accuracy, slowest (~1550MB)

Local-first testing: Use tiny when you want the fastest local run for validation.

Gemini API (Cloud)

Pros:

High accuracy with gemini-flash-lite-latest
Built-in speaker diarization
Emotion detection from speech
Context understanding
Can identify viral segments
125+ language support
Faster processing (cloud-based)

Cons:

Requires API key
Cloud upload (privacy consideration)
Cost per usage
Internet required

OpenAI Whisper API (Cloud)

Pros:

High accuracy with word-level timestamps
No local GPU/CPU needed
Consistent results

Cons:

Requires API key
Cloud upload (privacy consideration)
Cost per usage
Internet required

Google Speech-to-Text (Cloud)

Pros:

High accuracy with word-level timestamps
Speaker diarization support
Scales well for long audio

Cons:

Requires Google Cloud credentials
Cloud upload (privacy consideration)
Cost per usage
Internet required

Available Scripts

`scripts/transcribe.py`

Transcribe audio from video file.

Usage:

python skills/video-transcriber/scripts/transcribe.py <video_path> [options]

Options:

--model, -m: Model to use (whisper, gemini, openai, google) - default: auto
--whisper-model: Whisper model size (tiny, base, small, medium, large-v3) - default: medium
--openai-model: OpenAI Whisper model (default: whisper-1)
--google-model: Google Speech model (default: latest_long)
--use-faster: Use faster-whisper for speed - default: True
--output, -o: Output file path (default: <video_path>.srt)
--format: Output format (srt, vtt, json) - default: srt
--language: Language code (e.g., en, id) - default: auto
--speaker-diarization: Enable speaker labels (Gemini only)
--emotion-detection: Enable emotion detection (Gemini only)
--device: Device for Whisper (auto, cpu, cuda) - default: auto

Examples:

Transcribe with Whisper (default):

python skills/video-transcriber/scripts/transcribe.py video.mp4

Transcribe with Gemini API:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini

Transcribe with OpenAI Whisper API:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model openai --format json

Note: When using --model openai, the system will try Google Speech-to-Text as a fallback if OpenAI fails and Google credentials are available.

Transcribe with Google Speech-to-Text:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model google --format json

Transcribe with speaker diarization and emotion detection (Gemini):

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini --speaker-diarization --emotion-detection

Transcribe with large Whisper model:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --whisper-model large-v3

Output to JSON:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --format json

`scripts/analyze.py`

Analyze audio content using Gemini API for viral segments, summary, or emotions.

Usage:

python skills/video-transcriber/scripts/analyze.py <video_path> [options]

Options:

--analysis-type: Type of analysis (viral, summary, emotions, questions) - default: viral
--num-segments: Number of segments to identify (for viral analysis) - default: 5
--model: Model to use (default: gemini)

Examples:

Detect viral segments:

python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type viral

Get summary:

python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type summary

Analyze emotions:

python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type emotions

Output Format

SRT Format

1
00:00:00,000 --> 00:00:05,000
This is the first subtitle.

2
00:00:05,500 --> 00:00:10,000
This is the second subtitle.

JSON Format

[
  {
    "index": 1,
    "start": 0.0,
    "end": 5.0,
    "text": "This is the first subtitle.",
    "speaker": "Speaker A",
    "emotion": "neutral"
  }
]

Auto Selection Logic

When --model auto, the system selects based on:

Privacy priority: Always use Whisper
Quality needed: Use gemini for highest quality
Content length: Use faster-whisper for long content (> 1 hour)
Feature requirements: Use gemini if speaker diarization or emotion detection needed
Default: Use gemini-flash-lite-latest

Note: openai and google models must be selected explicitly. Auto does not switch to them.

Environment Variables

# For Gemini API
export GEMINI_API_KEY="your-api-key"

# For OpenAI Whisper API
export OPENAI_API_KEY="your-api-key"

# For Google Speech-to-Text
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Optional: For Vertex AI
export GOOGLE_PROJECT_ID="your-project-id"
export GOOGLE_LOCATION="us-central1"

Integration with Other Skills

After transcription, you can use these skills:

highlight-scanner: Analyze transcript for viral moments
subtitle-overlay: Add captions to video
autocut-shorts: Full workflow for creating short clips

Common Workflow

User provides video file or URL
Download if needed (youtube-downloader)
Transcribe using this skill
Analyze transcript for highlights (highlight-scanner)
Create short clips (autocut-shorts)

Tips

Use --use-faster with Whisper for faster processing
Use Gemini when you need speaker diarization
Use --format json for programmatic processing
For long videos, consider splitting into segments
Use --analysis-type viral to identify best segments for short-form content

References

Whisper documentation: https://github.com/openai/whisper
Gemini API: https://ai.google.dev/gemini-api/docs/audio
Language codes: ISO 639-1 codes (en, id, es, etc.)

akrindev/video-transcriber

skills/video-transcriber/SKILL.md

Transcribe audio from videos using Whisper (local), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (gemini-flash-lite-latest). Use when you need to convert video/audio to text for further processing, subtitle generation, or content analysis. Supports multiple languages, speaker diarization, and timestamp-accurate transcription. Gemini provides additional features like emotion detection and viral segment analysis.

development

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add akrindev/trimer-clip video-transcriber

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 10:20 PM155.4s3 files scanned

SKILL.md

name:: video-transcriber
description:: Transcribe audio from videos using Whisper (local), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (gemini-flash-lite-latest). Use when you need to convert video/audio to text for further processing, subtitle generation, or content analysis. Supports multiple languages, speaker diarization, and timestamp-accurate transcription. Gemini provides additional features like emotion detection and viral segment analysis.
allowed-tools:: Bash(ffmpeg:*) Bash(python:*)
compatibility:: Requires FFmpeg, optional OpenAI/Google Cloud API keys
version:: 1.0
models:: whisper, openai-whisper-api, google-stt, gemini-flash-lite-latest

Video Transcriber

This skill enables AI agents to transcribe audio from video files using Whisper (local processing), OpenAI Whisper API, Google Speech-to-Text, or Gemini API (cloud processing with advanced features).

When to Use

User wants to transcribe a video or audio file
User needs subtitles/captions for a video
User wants to analyze video content through transcription
User needs to identify viral-worthy segments
User wants speaker diarization or emotion detection

Model Selection

Whisper (Local)

Pros:

Free to use
100% privacy (no cloud upload)
Good for sensitive content
Lower cost for high volume

Cons:

Requires local processing power
No built-in speaker diarization
No emotion detection
Limited to 99 languages

Models:

tiny - Fastest, lower accuracy (~32MB)
base - Fast, good accuracy (~74MB)
small - Balanced speed/accuracy (~244MB)
medium - Good accuracy, slower (~769MB)
large-v3 - Highest accuracy, slowest (~1550MB)

Local-first testing: Use tiny when you want the fastest local run for validation.

Gemini API (Cloud)

Pros:

High accuracy with gemini-flash-lite-latest
Built-in speaker diarization
Emotion detection from speech
Context understanding
Can identify viral segments
125+ language support
Faster processing (cloud-based)

Cons:

Requires API key
Cloud upload (privacy consideration)
Cost per usage
Internet required

OpenAI Whisper API (Cloud)

Pros:

High accuracy with word-level timestamps
No local GPU/CPU needed
Consistent results

Cons:

Requires API key
Cloud upload (privacy consideration)
Cost per usage
Internet required

Google Speech-to-Text (Cloud)

Pros:

High accuracy with word-level timestamps
Speaker diarization support
Scales well for long audio

Cons:

Requires Google Cloud credentials
Cloud upload (privacy consideration)
Cost per usage
Internet required

Available Scripts

`scripts/transcribe.py`

Transcribe audio from video file.

Usage:

python skills/video-transcriber/scripts/transcribe.py <video_path> [options]

Options:

--model, -m: Model to use (whisper, gemini, openai, google) - default: auto
--whisper-model: Whisper model size (tiny, base, small, medium, large-v3) - default: medium
--openai-model: OpenAI Whisper model (default: whisper-1)
--google-model: Google Speech model (default: latest_long)
--use-faster: Use faster-whisper for speed - default: True
--output, -o: Output file path (default: <video_path>.srt)
--format: Output format (srt, vtt, json) - default: srt
--language: Language code (e.g., en, id) - default: auto
--speaker-diarization: Enable speaker labels (Gemini only)
--emotion-detection: Enable emotion detection (Gemini only)
--device: Device for Whisper (auto, cpu, cuda) - default: auto

Examples:

Transcribe with Whisper (default):

python skills/video-transcriber/scripts/transcribe.py video.mp4

Transcribe with Gemini API:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini

Transcribe with OpenAI Whisper API:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model openai --format json

Note: When using --model openai, the system will try Google Speech-to-Text as a fallback if OpenAI fails and Google credentials are available.

Transcribe with Google Speech-to-Text:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model google --format json

Transcribe with speaker diarization and emotion detection (Gemini):

python skills/video-transcriber/scripts/transcribe.py video.mp4 --model gemini --speaker-diarization --emotion-detection

Transcribe with large Whisper model:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --whisper-model large-v3

Output to JSON:

python skills/video-transcriber/scripts/transcribe.py video.mp4 --format json

`scripts/analyze.py`

Analyze audio content using Gemini API for viral segments, summary, or emotions.

Usage:

python skills/video-transcriber/scripts/analyze.py <video_path> [options]

Options:

--analysis-type: Type of analysis (viral, summary, emotions, questions) - default: viral
--num-segments: Number of segments to identify (for viral analysis) - default: 5
--model: Model to use (default: gemini)

Examples:

Detect viral segments:

python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type viral

Get summary:

python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type summary

Analyze emotions:

python skills/video-transcriber/scripts/analyze.py video.mp4 --analysis-type emotions

Output Format

SRT Format

1
00:00:00,000 --> 00:00:05,000
This is the first subtitle.

2
00:00:05,500 --> 00:00:10,000
This is the second subtitle.

JSON Format

[
  {
    "index": 1,
    "start": 0.0,
    "end": 5.0,
    "text": "This is the first subtitle.",
    "speaker": "Speaker A",
    "emotion": "neutral"
  }
]

Auto Selection Logic

When --model auto, the system selects based on:

Privacy priority: Always use Whisper
Quality needed: Use gemini for highest quality
Content length: Use faster-whisper for long content (> 1 hour)
Feature requirements: Use gemini if speaker diarization or emotion detection needed
Default: Use gemini-flash-lite-latest

Note: openai and google models must be selected explicitly. Auto does not switch to them.

Environment Variables

# For Gemini API
export GEMINI_API_KEY="your-api-key"

# For OpenAI Whisper API
export OPENAI_API_KEY="your-api-key"

# For Google Speech-to-Text
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Optional: For Vertex AI
export GOOGLE_PROJECT_ID="your-project-id"
export GOOGLE_LOCATION="us-central1"

Integration with Other Skills

After transcription, you can use these skills:

highlight-scanner: Analyze transcript for viral moments
subtitle-overlay: Add captions to video
autocut-shorts: Full workflow for creating short clips

Common Workflow

User provides video file or URL
Download if needed (youtube-downloader)
Transcribe using this skill
Analyze transcript for highlights (highlight-scanner)
Create short clips (autocut-shorts)

Tips

Use --use-faster with Whisper for faster processing
Use Gemini when you need speaker diarization
Use --format json for programmatic processing
For long videos, consider splitting into segments
Use --analysis-type viral to identify best segments for short-form content

References

Whisper documentation: https://github.com/openai/whisper
Gemini API: https://ai.google.dev/gemini-api/docs/audio
Language codes: ISO 639-1 codes (en, id, es, etc.)

Related Skills

akrindev/youtube-downloader

testing

VerifiedTrustedCommunity

Download videos from YouTube URLs. Use when user wants to download a YouTube video for processing, editing, or transcription. Supports different quality options, audio-only extraction, and playlist downloads.

SKILL.mdUpdated Apr 3, 2026

akrindev/youtube-downloader

akrindev/video-trimmer

tools

VerifiedTrustedCommunity

Trim and cut videos by timestamp with precision. Supports both stream copy (fast) and re-encoding (quality) modes. Use when you need to extract specific segments from videos, create clips from highlights, or cut unwanted portions.

SKILL.mdUpdated Apr 3, 2026

akrindev/video-trimmer

akrindev/subtitle-overlay

tools

VerifiedTrustedCommunity

Add burned-in subtitles/captions to video clips. Supports SRT/VTT/ASS subtitle files, customizable styling (font, size, color, position), and platform-specific presets for TikTok, YouTube Shorts, and Instagram Reels.

SKILL.mdUpdated Apr 3, 2026

akrindev/subtitle-overlay

akrindev/speaker-diarization

tools

VerifiedTrustedCommunity

Advanced speaker diarization using pyannote-audio. Identify who speaks when, detect multiple speakers, handle overlapping speech, and create speaker-specific segments. Use when you need accurate speaker identification, multi-speaker content analysis, or speaker-specific clip extraction. More accurate than Gemini's built-in diarization for complex scenarios.

SKILL.mdUpdated Apr 3, 2026

akrindev/speaker-diarization

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/akrindev/trimer-clip.git

# Copy into Claude Code skills folder (global)
cp -r trimer-clip/skills/video-transcriber ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

akrindev/trimer-clip

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT