skills/audio-transcribe/SKILL.md
Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.
npx skillsauth add agntswrm/agent-media audio-transcribeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.
npx agent-media@latest audio transcribe --in <path> [options]
| Option | Required | Description |
|--------|----------|-------------|
| --in | Yes | Input audio file path or URL (supports mp3, wav, m4a, ogg) |
| --diarize | No | Enable speaker identification |
| --language | No | Language code (auto-detected if not provided) |
| --speakers | No | Number of speakers hint for diarization |
| --out | No | Output path, filename or directory (default: ./) |
| --provider | No | Provider to use (local, fal, replicate, runpod) |
Returns a JSON object with transcription data:
{
"ok": true,
"media_type": "audio",
"action": "transcribe",
"provider": "fal",
"output_path": "transcription_123_abc.json",
"transcription": {
"text": "Full transcription text...",
"language": "en",
"segments": [
{ "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
{ "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
]
}
}
Basic transcription (auto-detect language):
npx agent-media@latest audio transcribe --in interview.mp3
Transcription with speaker identification:
npx agent-media@latest audio transcribe --in meeting.wav --diarize
Transcription with specific language and speaker count:
npx agent-media@latest audio transcribe --in podcast.mp3 --diarize --language en --speakers 3
Use specific provider:
npx agent-media@latest audio transcribe --in audio.wav --provider replicate
To transcribe a video file, first extract the audio:
# Step 1: Extract audio from video
npx agent-media@latest audio extract --in video.mp4 --format mp3
# Step 2: Transcribe the extracted audio
npx agent-media@latest audio transcribe --in extracted_xxx.mp3
Runs locally on CPU using Transformers.js, no API key required.
mutex lock failed error — ignore it, the output is correct if "ok": truenpx agent-media@latest audio transcribe --in audio.mp3 --provider local
FAL_API_KEYwizper model for fast transcription (2x faster) when diarization is disabledwhisper model when diarization is enabled (native support)REPLICATE_API_TOKENwhisper-diarization model with Whisper Large V3 TurboRUNPOD_API_KEYpruna/whisper-v3-large model (Whisper Large V3)npx agent-media@latest audio transcribe --in audio.mp3 --provider runpod
data-ai
Generates video from text prompts or animates static images. Use when you need to create videos from descriptions, animate images, or produce video content using AI.
development
Upscales an image using AI super-resolution to increase resolution with detail generation. Use when you need to enlarge images, improve low-resolution photos, or prepare images for large-format display.
testing
Resizes an image to specified dimensions. Use when you need to change image size, create thumbnails, or prepare images for specific display requirements.
content-media
Removes the background from an image, leaving the foreground subject with transparency. Use when you need to isolate subjects, create cutouts, or prepare images for compositing.