skills/transcribe-audio/SKILL.md
ASR with ~30ms timestamp precision using Qwen3-ASR + ForcedAligner
npx skillsauth add nuva-lab/vibecut transcribe-audioInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transcribe audio/video files with precise timestamps using Qwen3-ASR and Qwen3-ForcedAligner. Runs on CPU for maximum quality.
| Model | Purpose | Precision | |-------|---------|-----------| | Qwen3-ASR-1.7B | Speech recognition (52 languages) | SOTA accuracy | | Qwen3-ForcedAligner-0.6B | Timestamp alignment | ~30ms |
# Full transcription with timestamps (default)
python skills/transcribe-audio/transcribe.py audio.wav
# Save to file
python skills/transcribe-audio/transcribe.py audio.wav --output captions.json
# Specify language (auto-detects if not specified)
python skills/transcribe-audio/transcribe.py audio.wav --language Chinese
# Fast mode (no timestamps)
python skills/transcribe-audio/transcribe.py audio.wav --no-timestamps
# Align existing text to audio (ForcedAligner only)
python skills/transcribe-audio/transcribe.py audio.wav --align-text "Your transcript text here"
Compatible with Remotion captions.json:
{
"segments": [
{"text": "当", "startMs": 0, "endMs": 150},
{"text": "全", "startMs": 150, "endMs": 280},
{"text": "世界", "startMs": 280, "endMs": 520},
...
],
"full_text": "当全世界都在追AI的时候...",
"language": "Chinese",
"model": "Qwen3-ASR-1.7B",
"aligner": "Qwen3-ForcedAligner-0.6B"
}
Transcribes audio and aligns timestamps:
python transcribe.py audio.wav
Use when you already have the transcript:
python transcribe.py audio.wav --align-text "已知的文字内容"
This skips ASR and uses ForcedAligner directly for ~30ms precision.
from transcribe import transcribe_audio, transcribe_and_align
# Full transcription with timestamps
result = transcribe_audio("audio.wav", language="Chinese")
print(result["segments"]) # Word-level timestamps
# Align existing text
result = transcribe_and_align("audio.wav", text="已知的文字", language="Chinese")
print(result["segments"]) # ~30ms precision timestamps
tools
Generate voiceover scripts in Joyce's style for video clips
tools
Clone a voice using qwen3-tts and generate speech from text
development
# Validate Media Skill Pre-flight media validation and diagnostics using ffprobe. ## Purpose Check video/audio files for common issues before rendering: - Duration mismatches between video and audio tracks - Missing audio tracks - Codec compatibility - Volume levels - Potential freeze points ## Usage ```bash python skills/validate-media/validate.py <video_file> [--verbose] ``` ## Output JSON report with issues and recommendations: ```json { "file": "video.mp4", "video_duration": 35.1
tools
Transcribe a video clip using Gemini to get timestamped segments for captions