Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

agntswrm/audio-transcribe

Name: audio-transcribe
Author: agntswrm

skills/audio-transcribe/SKILL.md

npx skillsauth add agntswrm/agent-media audio-transcribe

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

npx agent-media@latest audio transcribe --in <path> [options]

Inputs

| Option | Required | Description | |--------|----------|-------------| | --in | Yes | Input audio file path or URL (supports mp3, wav, m4a, ogg) | | --diarize | No | Enable speaker identification | | --language | No | Language code (auto-detected if not provided) | | --speakers | No | Number of speakers hint for diarization | | --out | No | Output path, filename or directory (default: ./) | | --provider | No | Provider to use (local, fal, replicate, runpod) |

Output

Returns a JSON object with transcription data:

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

npx agent-media@latest audio transcribe --in interview.mp3

Transcription with speaker identification:

npx agent-media@latest audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

npx agent-media@latest audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

npx agent-media@latest audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

# Step 1: Extract audio from video
npx agent-media@latest audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
npx agent-media@latest audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

Uses Moonshine model (5x faster than Whisper)
Models downloaded on first use (~100MB)
Does NOT support diarization — use fal or replicate for speaker identification
You may see a mutex lock failed error — ignore it, the output is correct if "ok": true

npx agent-media@latest audio transcribe --in audio.mp3 --provider local

fal

Requires FAL_API_KEY
Uses wizper model for fast transcription (2x faster) when diarization is disabled
Uses whisper model when diarization is enabled (native support)

replicate

Requires REPLICATE_API_TOKEN
Uses whisper-diarization model with Whisper Large V3 Turbo
Native diarization support with word-level timestamps

runpod

Requires RUNPOD_API_KEY
Uses pruna/whisper-v3-large model (Whisper Large V3)
Does NOT support diarization (speaker identification) - use fal or replicate for diarization

npx agent-media@latest audio transcribe --in audio.mp3 --provider runpod

agntswrm/audio-transcribe

skills/audio-transcribe/SKILL.md

Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.

3 stars

content-media

Updated Mar 28, 2026

$ install --global

skillsauth

npx skillsauth add agntswrm/agent-media audio-transcribe

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 30, 2026, 8:28 PM53.3s1 file scanned

SKILL.md

name:: audio-transcribe
description:: Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.
compatibility:: Requires Node.js 18+. Run via npx agent-media@latest or npm install -g agent-media.

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

npx agent-media@latest audio transcribe --in <path> [options]

Inputs

Output

Returns a JSON object with transcription data:

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

npx agent-media@latest audio transcribe --in interview.mp3

Transcription with speaker identification:

npx agent-media@latest audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

npx agent-media@latest audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

npx agent-media@latest audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

# Step 1: Extract audio from video
npx agent-media@latest audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
npx agent-media@latest audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

Uses Moonshine model (5x faster than Whisper)
Models downloaded on first use (~100MB)
Does NOT support diarization — use fal or replicate for speaker identification
You may see a mutex lock failed error — ignore it, the output is correct if "ok": true

npx agent-media@latest audio transcribe --in audio.mp3 --provider local

fal

Requires FAL_API_KEY
Uses wizper model for fast transcription (2x faster) when diarization is disabled
Uses whisper model when diarization is enabled (native support)

replicate

Requires REPLICATE_API_TOKEN
Uses whisper-diarization model with Whisper Large V3 Turbo
Native diarization support with word-level timestamps

runpod

Requires RUNPOD_API_KEY
Uses pruna/whisper-v3-large model (Whisper Large V3)
Does NOT support diarization (speaker identification) - use fal or replicate for diarization

npx agent-media@latest audio transcribe --in audio.mp3 --provider runpod

Related Skills

agntswrm/video-generate

data-ai

VerifiedTrustedCommunity

Generates video from text prompts or animates static images. Use when you need to create videos from descriptions, animate images, or produce video content using AI.

3SKILL.mdUpdated Mar 28, 2026

agntswrm/video-generate

agntswrm/image-upscale

development

VerifiedTrustedCommunity

Upscales an image using AI super-resolution to increase resolution with detail generation. Use when you need to enlarge images, improve low-resolution photos, or prepare images for large-format display.

3SKILL.mdUpdated Mar 28, 2026

agntswrm/image-upscale

agntswrm/image-resize

testing

VerifiedTrustedCommunity

Resizes an image to specified dimensions. Use when you need to change image size, create thumbnails, or prepare images for specific display requirements.

3SKILL.mdUpdated Mar 28, 2026

agntswrm/image-resize

agntswrm/image-remove-background

content-media

VerifiedTrustedCommunity

Removes the background from an image, leaving the foreground subject with transparency. Use when you need to isolate subjects, create cutouts, or prepare images for compositing.

3SKILL.mdUpdated Mar 28, 2026

agntswrm/image-remove-background

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/agntswrm/agent-media.git

# Copy into Claude Code skills folder (global)
cp -r agent-media/skills/audio-transcribe ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

agntswrm/agent-media

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT