cli-tool/components/skills/ai-research/multimodal-whisper/SKILL.md
OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.
npx skillsauth add davila7/claude-code-templates whisperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
OpenAI's multilingual speech recognition model.
Use when:
Metrics:
Use alternatives instead:
# Requires Python 3.8-3.11
pip install -U openai-whisper
# Requires ffmpeg
# macOS: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# Windows: choco install ffmpeg
import whisper
# Load model
model = whisper.load_model("base")
# Transcribe
result = model.transcribe("audio.mp3")
# Print text
print(result["text"])
# Access segments
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
# Available models
models = ["tiny", "base", "small", "medium", "large", "turbo"]
# Load specific model
model = whisper.load_model("turbo") # Fastest, good quality
| Model | Parameters | English-only | Multilingual | Speed | VRAM | |-------|------------|--------------|--------------|-------|------| | tiny | 39M | ✓ | ✓ | ~32x | ~1 GB | | base | 74M | ✓ | ✓ | ~16x | ~1 GB | | small | 244M | ✓ | ✓ | ~6x | ~2 GB | | medium | 769M | ✓ | ✓ | ~2x | ~5 GB | | large | 1550M | ✗ | ✓ | 1x | ~10 GB | | turbo | 809M | ✗ | ✓ | ~8x | ~6 GB |
Recommendation: Use turbo for best speed/quality, base for prototyping
# Auto-detect language
result = model.transcribe("audio.mp3")
# Specify language (faster)
result = model.transcribe("audio.mp3", language="en")
# Supported: en, es, fr, de, it, pt, ru, ja, ko, zh, and 89 more
# Transcription (default)
result = model.transcribe("audio.mp3", task="transcribe")
# Translation to English
result = model.transcribe("spanish.mp3", task="translate")
# Input: Spanish audio → Output: English text
# Improve accuracy with context
result = model.transcribe(
"audio.mp3",
initial_prompt="This is a technical podcast about machine learning and AI."
)
# Helps with:
# - Technical terms
# - Proper nouns
# - Domain-specific vocabulary
# Word-level timestamps
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word in segment["words"]:
print(f"{word['word']} ({word['start']:.2f}s - {word['end']:.2f}s)")
# Retry with different temperatures if confidence low
result = model.transcribe(
"audio.mp3",
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
)
# Basic transcription
whisper audio.mp3
# Specify model
whisper audio.mp3 --model turbo
# Output formats
whisper audio.mp3 --output_format txt # Plain text
whisper audio.mp3 --output_format srt # Subtitles
whisper audio.mp3 --output_format vtt # WebVTT
whisper audio.mp3 --output_format json # JSON with timestamps
# Language
whisper audio.mp3 --language Spanish
# Translation
whisper spanish.mp3 --task translate
import os
audio_files = ["file1.mp3", "file2.mp3", "file3.mp3"]
for audio_file in audio_files:
print(f"Transcribing {audio_file}...")
result = model.transcribe(audio_file)
# Save to file
output_file = audio_file.replace(".mp3", ".txt")
with open(output_file, "w") as f:
f.write(result["text"])
# For streaming audio, use faster-whisper
# pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cuda", compute_type="float16")
# Transcribe with streaming
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
import whisper
# Automatically uses GPU if available
model = whisper.load_model("turbo")
# Force CPU
model = whisper.load_model("turbo", device="cpu")
# Force GPU
model = whisper.load_model("turbo", device="cuda")
# 10-20× faster on GPU
# Generate SRT subtitles
whisper video.mp4 --output_format srt --language English
# Output: video.srt
from langchain.document_loaders import WhisperTranscriptionLoader
loader = WhisperTranscriptionLoader(file_path="audio.mp3")
docs = loader.load()
# Use transcription in RAG
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
# Use ffmpeg to extract audio
ffmpeg -i video.mp4 -vn -acodec pcm_s16le audio.wav
# Then transcribe
whisper audio.wav
| Model | Real-time factor (CPU) | Real-time factor (GPU) | |-------|------------------------|------------------------| | tiny | ~0.32 | ~0.01 | | base | ~0.16 | ~0.01 | | turbo | ~0.08 | ~0.01 | | large | ~1.0 | ~0.05 |
Real-time factor: 0.1 = 10× faster than real-time
Top-supported languages:
Full list: 99 languages total
tools
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity - these platforms have their own patterns, pitfalls, and breaking points. This skill covers when to use which platform, how to build reliable automations, and when to graduate to code-based solutions. Key insight: Zapier optimizes for simplicity and integrations (7000+ apps), Make optimizes for power
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).
tools
Workflow automation is the infrastructure that makes AI agents reliable. Without durable execution, a network hiccup during a 10-step payment flow means lost money and angry customers. With it, workflows resume exactly where they left off. This skill covers the platforms (n8n, Temporal, Inngest) and patterns (sequential, parallel, orchestrator-worker) that turn brittle scripts into production-grade automation. Key insight: The platforms make different tradeoffs. n8n optimizes for accessibility
development
Trigger.dev expert for background jobs, AI workflows, and reliable async execution with excellent developer experience and TypeScript-first design. Use when: trigger.dev, trigger dev, background task, ai background job, long running task.