skills/align-captions/SKILL.md
Generate karaoke-style word-level timestamps by aligning script text to audio using Qwen3-ForcedAligner + jieba for Chinese word segmentation. Use when the user says 'align captions', 'karaoke timestamps', 'word timestamps', 'caption alignment', 'sync text to audio'.
npx skillsauth add nuva-lab/vibecut align-captionsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Align existing script text to audio for karaoke-style captions. Uses Qwen3-ForcedAligner-0.6B (~30ms precision) and jieba for Chinese word segmentation (groups characters into natural words).
Script + Audio
↓
Qwen3-ForcedAligner (character-level timestamps)
↓
Jieba word segmentation (characters → Chinese words)
↓
Position-based phrase matching (words → phrases)
↓
Output: phrases with embedded word timestamps
# Align script to audio (phrase-level output with word timestamps)
python skills/align-captions/align.py voiceover.wav --script "当全世界都在追AI的时候..."
# Save to file
python skills/align-captions/align.py voiceover.wav --script "..." --output captions.json
# Word-level only (no phrase grouping)
python skills/align-captions/align.py voiceover.wav --script "..." --word-level
Designed for Remotion karaoke rendering:
{
"segments": [
{
"text": "当全世界...",
"startMs": 240, "endMs": 2080,
"words": [
{"text": "当", "startMs": 240, "endMs": 400},
{"text": "全世界", "startMs": 400, "endMs": 880}
]
}
],
"word_segments": [...],
"language": "Chinese",
"model": "Qwen3-ForcedAligner-0.6B"
}
from align import align_captions
# Get phrases with embedded word timestamps
result = align_captions(
"voiceover.wav",
script="当全世界都在追AI的时候...",
language="Chinese"
)
# Each phrase has a 'words' array for karaoke highlighting
for phrase in result["segments"]:
print(f"{phrase['text']}: {len(phrase['words'])} words")
The make_video.py script automatically uses align-captions when:
caption_mode is "auto" or "asr"The output is passed to Remotion's RollingCaption component for karaoke rendering.
FileNotFoundError -- check project.json for the correct voiceover path.~/.cache/huggingface/).pip install jieba. Without it, Chinese text falls back to character-level timestamps (no word grouping).tools
Generate voiceover scripts in Joyce's style for video clips
tools
Clone a voice using qwen3-tts and generate speech from text
development
# Validate Media Skill Pre-flight media validation and diagnostics using ffprobe. ## Purpose Check video/audio files for common issues before rendering: - Duration mismatches between video and audio tracks - Missing audio tracks - Codec compatibility - Volume levels - Potential freeze points ## Usage ```bash python skills/validate-media/validate.py <video_file> [--verbose] ``` ## Output JSON report with issues and recommendations: ```json { "file": "video.mp4", "video_duration": 35.1
tools
Transcribe a video clip using Gemini to get timestamped segments for captions