skills/talking-head/SKILL.md
# Talking-Head Video Processing Process direct-to-camera videos with sentence-level precision, rolling captions, and multi-format output. ## Full Pipeline Overview ``` Raw Video (47+ min) ↓ smart_chunk.py Chunks (2.5-3.5 min) ↓ mlx_transcribe.py --word-timestamps Transcript (word-level timestamps) ↓ sentence_split.py (precise cuts, no overlap) Sentence Clips (~10s each, frame-accurate) ↓ analyze_script.py (RECALL-FIRST) Topics with complete arcs (hook → elaborate → conclude)
npx skillsauth add nuva-lab/vibecut skills/talking-headInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Process direct-to-camera videos with sentence-level precision, rolling captions, and multi-format output.
Raw Video (47+ min)
↓ smart_chunk.py
Chunks (2.5-3.5 min)
↓ mlx_transcribe.py --word-timestamps
Transcript (word-level timestamps)
↓ sentence_split.py (precise cuts, no overlap)
Sentence Clips (~10s each, frame-accurate)
↓ analyze_script.py (RECALL-FIRST)
Topics with complete arcs (hook → elaborate → conclude)
↓ [Claude Code: Select topic]
↓ stitch_clips.py (precise cuts)
Raw Topic Video (~60-180s)
↓ precision_trim.py (remove fillers, stutters)
Polished Video (~30-120s)
↓ generate_captions.py (timestamp remapping)
Captions + Section Markers
↓ render_with_captions.py (Remotion)
Final Output: 16:9 + 9:16 with rolling captions
Old approach (precision-first): Find 5-10 "highlight moments" → Often incomplete thoughts
New approach (recall-first): Find complete TOPICS with full arcs → Then trim for precision
For each topic, Gemini identifies:
Why this works:
# Full pipeline (pauses for clip selection)
python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/
# Auto-stitch top clips
python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/ --auto-stitch
from process_video import run_pipeline, stitch_selected_clips
# Run pipeline (stops at clip review)
state = run_pipeline("raw_footage.mp4", "./output/")
# Present review to user
if state.get("awaiting_user_input"):
print(state["clip_review"]["display"])
# After user says "Use clips 7, 12, 3"
result = stitch_selected_clips("./output/", [7, 12, 3])
Split at sentence boundaries, create sentence index.
python sentence_split.py video.mp4 transcript.json -o clips/
Output: clip_index.json with sentence-to-clip mapping.
Text-first highlight analysis.
python analyze_script.py clips/clip_index.json -o clip_scores.json
Gemini reads full transcript, returns highlights with:
sentence_range: Which sentences (e.g., "15-18")clip_ids: Mapped clip IDsviral_score, hook_potential, standaloneOnly needed if you want video verification of selected clips.
python lowres_convert.py clips/ -o lowres/ --resolution 480p
Video-based verification of specific clips. Use after script analysis to double-check selected clips visually.
python batch_analyze.py lowres/ -o verification.json --clips 7,12,3
Concatenate approved clips.
python stitch_clips.py clip_scores.json --approved 7,12,3 -o final.mp4
{
"analysis_type": "script_first",
"clips": [
{
"clip_id": 7,
"viral_score": 9,
"hook_potential": 10,
"key_quote": "I've raised tens of millions...",
"recommended_use": "opening"
}
],
"highlights": [
{
"sentence_range": "15-18",
"clip_ids": [7],
"quote": "I've raised tens of millions...",
"type": "hook",
"viral_score": 9,
"why": "Strong credibility opener"
}
]
}
The key data structure that enables precise mapping:
{
"sentences": [
{"id": 1, "text": "First sentence.", "clip_id": 1},
{"id": 2, "text": "Second sentence.", "clip_id": 1},
{"id": 3, "text": "Third sentence.", "clip_id": 2}
],
"sentence_to_clip": {1: 1, 2: 1, 3: 2},
"clip_to_sentences": {1: [1, 2], 2: [3]}
}
When Gemini says "sentences 15-18 are great", we look up sentence_to_clip to get the exact clip IDs.
By default, all clips are cut with frame-accurate precision using FFmpeg re-encoding and zero padding. This ensures:
# Default: precise cut (re-encode) with zero padding
python sentence_split.py video.mp4 transcript.json -o clips/
# Fast mode: stream copy (may have keyframe boundary issues)
python sentence_split.py video.mp4 transcript.json -o clips/ --fast
# Add padding only if you want overlap for crossfades
python sentence_split.py video.mp4 transcript.json -o clips/ --padding 100
Why zero padding by default?
end + 100msstart - 100msWhy re-encode (not stream copy)?
-c copy) cuts on keyframes, not exact timestamps-crf 18 maintains quality while ensuring exact cutsWhy text-first? Earlier versions uploaded 60s+ video clips to Gemini:
Why sentence index? Clips are cut at pauses, roughly 1:1 with sentences. The index handles edge cases where sentences span clips.
Why precise cuts? Stream copy creates overlapping frames:
When to use video analysis? Only for final verification of selected clips if you want to check:
After selecting a topic, remove filler words, stutters, and repetitions.
Two-stage approach: ASR for precise timestamps, Gemini for semantic decisions.
# Full pipeline
python precision_trim.py run topic_video.mp4 -o output/
# Step by step
python precision_trim.py transcribe topic_video.mp4 -o word_transcript.json
python precision_trim.py identify topic_video.mp4 word_transcript.json -o cuts.json
python precision_trim.py apply topic_video.mp4 cuts.json word_transcript.json -o trimmed.mp4
Key insight: Gemini identifies WHAT to cut (by word index), ASR provides WHERE to cut (~30ms precision).
Micro-segment fix: min_keep_duration=1.0s prevents jittery playback from tiny segments.
Add karaoke-style captions and render both horizontal and vertical formats.
Maps word timestamps from original video to trimmed video, groups into phrases.
python generate_captions.py word_transcript.json kept_segments.json -o output/
Timestamp remapping logic:
# For each word in original video:
adjusted_time = word_start - segment_start + cumulative_offset
Phrase grouping by aspect ratio: | Aspect | Max Words | Max Chars | Max Duration | |--------|-----------|-----------|--------------| | Horizontal (16:9) | 8 | 50 | 3.5s | | Vertical (9:16) | 5 | 30 | 2.5s |
Output:
captions_horizontal.json: Phrases for 16:9captions_vertical.json: Phrases for 9:16 (shorter)Uses Gemini to create meaningful section titles from the transcript.
python generate_sections.py word_transcript.json kept_segments.json -o sections.json -n 4
Output example:
{
"sections": [
{"title": "VC HYPOCRISY EXPOSED", "startMs": 0, "durationMs": 3000},
{"title": "READING VC SIGNALS", "startMs": 47500, "durationMs": 3000}
]
}
Renders both 16:9 and 9:16 formats using Remotion with cinematic overlays.
python render_with_captions.py trimmed.mp4 captions_dir/ \
--sections sections.json \
--speaker "Speaker Name" \
--speaker-title "Title/Role" \
--speaker-center 0.5 \
-o output/
Output:
| Format | Resolution | Use Case |
|--------|------------|----------|
| *_captioned.mp4 | 1920x1080 | YouTube, LinkedIn |
| *_vertical.mp4 | 1080x1920 | TikTok, Reels, Shorts |
Vertical crop: Centers on speaker position (--speaker-center 0.5 = center).
The pipeline uses these cinematic Remotion components:
| Component | Style |
|-----------|-------|
| TalkingHeadClip.tsx | Main composition with all overlays |
| RollingCaption.tsx | Karaoke captions with gold word highlighting |
| SectionTitle.tsx | Cinematic gradient text + animated lines |
| SpeakerLabel.tsx | Gradient border card with slide-in animation |
Caption styling:
marginRight: 0.3em (CSS inline-block fix)Section title styling:
Speaker label styling:
# 1. Chunk and transcribe
python skills/chunk-process/smart_chunk.py raw.mp4 -o chunks/
python skills/chunk-process/mlx_transcribe.py chunks/ --batch --word-timestamps
# 2. Split into sentences and analyze
python skills/talking-head/sentence_split.py raw.mp4 chunks/transcript.json -o clips/
python skills/talking-head/analyze_script.py clips/clip_index.json -o topics.json
# 3. Select topic and stitch (Claude Code reviews topics, user picks one)
python skills/talking-head/stitch_clips.py topics.json --topic 10 -o topic10_raw.mp4
# 4. Precision trim (removes fillers, stutters)
python skills/talking-head/precision_trim.py run topic10_raw.mp4 -o trimmed/
# 5. Generate captions (aspect-ratio aware)
python skills/talking-head/generate_captions.py \
trimmed/word_transcript.json trimmed/kept_segments.json \
--speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \
-o trimmed/
# 6. Generate section titles (Gemini)
python skills/talking-head/generate_sections.py \
trimmed/word_transcript.json trimmed/kept_segments.json \
-o trimmed/sections.json -n 4
# 7. Render both formats
python skills/talking-head/render_with_captions.py \
trimmed/trimmed.mp4 trimmed/ \
--sections trimmed/sections.json \
--speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \
-o final/
Result:
final/trimmed_captioned.mp4 (16:9, ~140MB)final/trimmed_vertical.mp4 (9:16, ~178MB)Both with rolling captions, speaker label, and cinematic section titles!
tools
Generate voiceover scripts in Joyce's style for video clips
tools
Clone a voice using qwen3-tts and generate speech from text
development
# Validate Media Skill Pre-flight media validation and diagnostics using ffprobe. ## Purpose Check video/audio files for common issues before rendering: - Duration mismatches between video and audio tracks - Missing audio tracks - Codec compatibility - Volume levels - Potential freeze points ## Usage ```bash python skills/validate-media/validate.py <video_file> [--verbose] ``` ## Output JSON report with issues and recommendations: ```json { "file": "video.mp4", "video_duration": 35.1
tools
Transcribe a video clip using Gemini to get timestamped segments for captions