Talking-Head Video Processing

Process direct-to-camera videos with sentence-level precision, rolling captions, and multi-format output.

Full Pipeline Overview

Raw Video (47+ min)
    ↓ smart_chunk.py
Chunks (2.5-3.5 min)
    ↓ mlx_transcribe.py --word-timestamps
Transcript (word-level timestamps)
    ↓ sentence_split.py (precise cuts, no overlap)
Sentence Clips (~10s each, frame-accurate)
    ↓ analyze_script.py (RECALL-FIRST)
Topics with complete arcs (hook → elaborate → conclude)
    ↓ [Claude Code: Select topic]
    ↓ stitch_clips.py (precise cuts)
Raw Topic Video (~60-180s)
    ↓ precision_trim.py (remove fillers, stutters)
Polished Video (~30-120s)
    ↓ generate_captions.py (timestamp remapping)
Captions + Section Markers
    ↓ render_with_captions.py (Remotion)
Final Output: 16:9 + 9:16 with rolling captions

Key Insight: Recall-First Topic Analysis

Old approach (precision-first): Find 5-10 "highlight moments" → Often incomplete thoughts

New approach (recall-first): Find complete TOPICS with full arcs → Then trim for precision

For each topic, Gemini identifies:

Full clip range: ALL clips that might be relevant (optimize for recall)
Arc structure: Hook → Elaboration → Conclusion
Trimming guide: Essential vs supporting vs skippable clips
Reorder suggestions: If moving clips would improve flow

Why this works:

Complete narratives: Each topic has setup, body, and resolution
No missing context: Better to include extra and trim than miss key content
Flexible editing: Full version for review, trimmed version for final cut

Quick Start

# Full pipeline (pauses for clip selection)
python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/

# Auto-stitch top clips
python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/ --auto-stitch

Claude Code Integration

from process_video import run_pipeline, stitch_selected_clips

# Run pipeline (stops at clip review)
state = run_pipeline("raw_footage.mp4", "./output/")

# Present review to user
if state.get("awaiting_user_input"):
    print(state["clip_review"]["display"])

# After user says "Use clips 7, 12, 3"
result = stitch_selected_clips("./output/", [7, 12, 3])

Skills Reference

sentence_split.py

Split at sentence boundaries, create sentence index.

python sentence_split.py video.mp4 transcript.json -o clips/

Output: clip_index.json with sentence-to-clip mapping.

analyze_script.py

Text-first highlight analysis.

python analyze_script.py clips/clip_index.json -o clip_scores.json

Gemini reads full transcript, returns highlights with:

sentence_range: Which sentences (e.g., "15-18")
clip_ids: Mapped clip IDs
viral_score, hook_potential, standalone

lowres_convert.py (Optional)

Only needed if you want video verification of selected clips.

python lowres_convert.py clips/ -o lowres/ --resolution 480p

batch_analyze.py (Optional)

Video-based verification of specific clips. Use after script analysis to double-check selected clips visually.

python batch_analyze.py lowres/ -o verification.json --clips 7,12,3

stitch_clips.py

Concatenate approved clips.

python stitch_clips.py clip_scores.json --approved 7,12,3 -o final.mp4

Output: clip_scores.json

{
  "analysis_type": "script_first",
  "clips": [
    {
      "clip_id": 7,
      "viral_score": 9,
      "hook_potential": 10,
      "key_quote": "I've raised tens of millions...",
      "recommended_use": "opening"
    }
  ],
  "highlights": [
    {
      "sentence_range": "15-18",
      "clip_ids": [7],
      "quote": "I've raised tens of millions...",
      "type": "hook",
      "viral_score": 9,
      "why": "Strong credibility opener"
    }
  ]
}

Sentence Index

The key data structure that enables precise mapping:

{
  "sentences": [
    {"id": 1, "text": "First sentence.", "clip_id": 1},
    {"id": 2, "text": "Second sentence.", "clip_id": 1},
    {"id": 3, "text": "Third sentence.", "clip_id": 2}
  ],
  "sentence_to_clip": {1: 1, 2: 1, 3: 2},
  "clip_to_sentences": {1: [1, 2], 2: [3]}
}

When Gemini says "sentences 15-18 are great", we look up sentence_to_clip to get the exact clip IDs.

Precise Cutting (Zero Overlap)

By default, all clips are cut with frame-accurate precision using FFmpeg re-encoding and zero padding. This ensures:

Zero overlapping frames between clips
No repeated words at clip boundaries
Clean audio cuts (no pops or artifacts)

# Default: precise cut (re-encode) with zero padding
python sentence_split.py video.mp4 transcript.json -o clips/

# Fast mode: stream copy (may have keyframe boundary issues)
python sentence_split.py video.mp4 transcript.json -o clips/ --fast

# Add padding only if you want overlap for crossfades
python sentence_split.py video.mp4 transcript.json -o clips/ --padding 100

Why zero padding by default?

Qwen3-ForcedAligner provides precise word timestamps (~30ms accuracy)
Natural speech has gaps between sentences (pauses)
Adding padding (e.g., 100ms) causes clips to overlap:
- Clip N extracts to end + 100ms
- Clip N+1 extracts from start - 100ms
- If gap < 200ms → overlap and repeated words!
Zero padding = clips fit together perfectly with natural pauses

Why re-encode (not stream copy)?

Stream copy (-c copy) cuts on keyframes, not exact timestamps
This causes extra frames at start (previous keyframe) or missing frames at end
Re-encoding with -crf 18 maintains quality while ensuring exact cuts

Key Learnings

Why text-first? Earlier versions uploaded 60s+ video clips to Gemini:

Slow: ~30s upload per 20MB clip
Expensive: Video processing tokens
Limited context: Gemini only saw chunks, not full narrative

Why sentence index? Clips are cut at pauses, roughly 1:1 with sentences. The index handles edge cases where sentences span clips.

Why precise cuts? Stream copy creates overlapping frames:

FFmpeg seeks to nearest keyframe (not exact timestamp)
Adjacent clips may share the same frames at boundaries
Re-encoding fixes this by decoding and re-encoding at exact timestamps

When to use video analysis? Only for final verification of selected clips if you want to check:

Speaker visibility
Audio quality
Visual issues

Phase 2: Precision Trimming

After selecting a topic, remove filler words, stutters, and repetitions.

precision_trim.py

Two-stage approach: ASR for precise timestamps, Gemini for semantic decisions.

# Full pipeline
python precision_trim.py run topic_video.mp4 -o output/

# Step by step
python precision_trim.py transcribe topic_video.mp4 -o word_transcript.json
python precision_trim.py identify topic_video.mp4 word_transcript.json -o cuts.json
python precision_trim.py apply topic_video.mp4 cuts.json word_transcript.json -o trimmed.mp4

Key insight: Gemini identifies WHAT to cut (by word index), ASR provides WHERE to cut (~30ms precision).

Micro-segment fix: min_keep_duration=1.0s prevents jittery playback from tiny segments.

Phase 3: Rolling Captions + Multi-Format

Add karaoke-style captions and render both horizontal and vertical formats.

generate_captions.py

Maps word timestamps from original video to trimmed video, groups into phrases.

python generate_captions.py word_transcript.json kept_segments.json -o output/

Timestamp remapping logic:

# For each word in original video:
adjusted_time = word_start - segment_start + cumulative_offset

Phrase grouping by aspect ratio: | Aspect | Max Words | Max Chars | Max Duration | |--------|-----------|-----------|--------------| | Horizontal (16:9) | 8 | 50 | 3.5s | | Vertical (9:16) | 5 | 30 | 2.5s |

Output:

captions_horizontal.json: Phrases for 16:9
captions_vertical.json: Phrases for 9:16 (shorter)

generate_sections.py

Uses Gemini to create meaningful section titles from the transcript.

python generate_sections.py word_transcript.json kept_segments.json -o sections.json -n 4

Output example:

{
  "sections": [
    {"title": "VC HYPOCRISY EXPOSED", "startMs": 0, "durationMs": 3000},
    {"title": "READING VC SIGNALS", "startMs": 47500, "durationMs": 3000}
  ]
}

render_with_captions.py

Renders both 16:9 and 9:16 formats using Remotion with cinematic overlays.

python render_with_captions.py trimmed.mp4 captions_dir/ \
  --sections sections.json \
  --speaker "Speaker Name" \
  --speaker-title "Title/Role" \
  --speaker-center 0.5 \
  -o output/

Output: | Format | Resolution | Use Case | |--------|------------|----------| | *_captioned.mp4 | 1920x1080 | YouTube, LinkedIn | | *_vertical.mp4 | 1080x1920 | TikTok, Reels, Shorts |

Vertical crop: Centers on speaker position (--speaker-center 0.5 = center).

Remotion Components

The pipeline uses these cinematic Remotion components:

| Component | Style | |-----------|-------| | TalkingHeadClip.tsx | Main composition with all overlays | | RollingCaption.tsx | Karaoke captions with gold word highlighting | | SectionTitle.tsx | Cinematic gradient text + animated lines | | SpeakerLabel.tsx | Gradient border card with slide-in animation |

Caption styling:

Background: Semi-transparent black with rounded corners
Text: White, gold highlight on current word
Word spacing: marginRight: 0.3em (CSS inline-block fix)

Section title styling:

Background dim overlay
Gradient text (white → gold)
Animated top/bottom lines
3-second duration with bounce animation

Speaker label styling:

Gradient border (gold → orange)
Large name (52px) with slide-in
Gradient title text
8-second display at video start

Complete Example

# 1. Chunk and transcribe
python skills/chunk-process/smart_chunk.py raw.mp4 -o chunks/
python skills/chunk-process/mlx_transcribe.py chunks/ --batch --word-timestamps

# 2. Split into sentences and analyze
python skills/talking-head/sentence_split.py raw.mp4 chunks/transcript.json -o clips/
python skills/talking-head/analyze_script.py clips/clip_index.json -o topics.json

# 3. Select topic and stitch (Claude Code reviews topics, user picks one)
python skills/talking-head/stitch_clips.py topics.json --topic 10 -o topic10_raw.mp4

# 4. Precision trim (removes fillers, stutters)
python skills/talking-head/precision_trim.py run topic10_raw.mp4 -o trimmed/

# 5. Generate captions (aspect-ratio aware)
python skills/talking-head/generate_captions.py \
  trimmed/word_transcript.json trimmed/kept_segments.json \
  --speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \
  -o trimmed/

# 6. Generate section titles (Gemini)
python skills/talking-head/generate_sections.py \
  trimmed/word_transcript.json trimmed/kept_segments.json \
  -o trimmed/sections.json -n 4

# 7. Render both formats
python skills/talking-head/render_with_captions.py \
  trimmed/trimmed.mp4 trimmed/ \
  --sections trimmed/sections.json \
  --speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \
  -o final/

Result:

final/trimmed_captioned.mp4 (16:9, ~140MB)
final/trimmed_vertical.mp4 (9:16, ~178MB)

Both with rolling captions, speaker label, and cinematic section titles!

Talking-Head Video Processing

Process direct-to-camera videos with sentence-level precision, rolling captions, and multi-format output.

Full Pipeline Overview

Raw Video (47+ min)
    ↓ smart_chunk.py
Chunks (2.5-3.5 min)
    ↓ mlx_transcribe.py --word-timestamps
Transcript (word-level timestamps)
    ↓ sentence_split.py (precise cuts, no overlap)
Sentence Clips (~10s each, frame-accurate)
    ↓ analyze_script.py (RECALL-FIRST)
Topics with complete arcs (hook → elaborate → conclude)
    ↓ [Claude Code: Select topic]
    ↓ stitch_clips.py (precise cuts)
Raw Topic Video (~60-180s)
    ↓ precision_trim.py (remove fillers, stutters)
Polished Video (~30-120s)
    ↓ generate_captions.py (timestamp remapping)
Captions + Section Markers
    ↓ render_with_captions.py (Remotion)
Final Output: 16:9 + 9:16 with rolling captions

Key Insight: Recall-First Topic Analysis

Old approach (precision-first): Find 5-10 "highlight moments" → Often incomplete thoughts

New approach (recall-first): Find complete TOPICS with full arcs → Then trim for precision

For each topic, Gemini identifies:

Full clip range: ALL clips that might be relevant (optimize for recall)
Arc structure: Hook → Elaboration → Conclusion
Trimming guide: Essential vs supporting vs skippable clips
Reorder suggestions: If moving clips would improve flow

Why this works:

Complete narratives: Each topic has setup, body, and resolution
No missing context: Better to include extra and trim than miss key content
Flexible editing: Full version for review, trimmed version for final cut

Quick Start

# Full pipeline (pauses for clip selection)
python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/

# Auto-stitch top clips
python skills/talking-head/process_video.py raw_footage.mp4 -o ./output/ --auto-stitch

Claude Code Integration

from process_video import run_pipeline, stitch_selected_clips

# Run pipeline (stops at clip review)
state = run_pipeline("raw_footage.mp4", "./output/")

# Present review to user
if state.get("awaiting_user_input"):
    print(state["clip_review"]["display"])

# After user says "Use clips 7, 12, 3"
result = stitch_selected_clips("./output/", [7, 12, 3])

Skills Reference

sentence_split.py

Split at sentence boundaries, create sentence index.

python sentence_split.py video.mp4 transcript.json -o clips/

Output: clip_index.json with sentence-to-clip mapping.

analyze_script.py

Text-first highlight analysis.

python analyze_script.py clips/clip_index.json -o clip_scores.json

Gemini reads full transcript, returns highlights with:

sentence_range: Which sentences (e.g., "15-18")
clip_ids: Mapped clip IDs
viral_score, hook_potential, standalone

lowres_convert.py (Optional)

Only needed if you want video verification of selected clips.

python lowres_convert.py clips/ -o lowres/ --resolution 480p

batch_analyze.py (Optional)

Video-based verification of specific clips. Use after script analysis to double-check selected clips visually.

python batch_analyze.py lowres/ -o verification.json --clips 7,12,3

stitch_clips.py

Concatenate approved clips.

python stitch_clips.py clip_scores.json --approved 7,12,3 -o final.mp4

Output: clip_scores.json

{
  "analysis_type": "script_first",
  "clips": [
    {
      "clip_id": 7,
      "viral_score": 9,
      "hook_potential": 10,
      "key_quote": "I've raised tens of millions...",
      "recommended_use": "opening"
    }
  ],
  "highlights": [
    {
      "sentence_range": "15-18",
      "clip_ids": [7],
      "quote": "I've raised tens of millions...",
      "type": "hook",
      "viral_score": 9,
      "why": "Strong credibility opener"
    }
  ]
}

Sentence Index

The key data structure that enables precise mapping:

{
  "sentences": [
    {"id": 1, "text": "First sentence.", "clip_id": 1},
    {"id": 2, "text": "Second sentence.", "clip_id": 1},
    {"id": 3, "text": "Third sentence.", "clip_id": 2}
  ],
  "sentence_to_clip": {1: 1, 2: 1, 3: 2},
  "clip_to_sentences": {1: [1, 2], 2: [3]}
}

When Gemini says "sentences 15-18 are great", we look up sentence_to_clip to get the exact clip IDs.

Precise Cutting (Zero Overlap)

By default, all clips are cut with frame-accurate precision using FFmpeg re-encoding and zero padding. This ensures:

Zero overlapping frames between clips
No repeated words at clip boundaries
Clean audio cuts (no pops or artifacts)

# Default: precise cut (re-encode) with zero padding
python sentence_split.py video.mp4 transcript.json -o clips/

# Fast mode: stream copy (may have keyframe boundary issues)
python sentence_split.py video.mp4 transcript.json -o clips/ --fast

# Add padding only if you want overlap for crossfades
python sentence_split.py video.mp4 transcript.json -o clips/ --padding 100

Why zero padding by default?

Qwen3-ForcedAligner provides precise word timestamps (~30ms accuracy)
Natural speech has gaps between sentences (pauses)
Adding padding (e.g., 100ms) causes clips to overlap:
- Clip N extracts to end + 100ms
- Clip N+1 extracts from start - 100ms
- If gap < 200ms → overlap and repeated words!
Zero padding = clips fit together perfectly with natural pauses

Why re-encode (not stream copy)?

Stream copy (-c copy) cuts on keyframes, not exact timestamps
This causes extra frames at start (previous keyframe) or missing frames at end
Re-encoding with -crf 18 maintains quality while ensuring exact cuts

Key Learnings

Why text-first? Earlier versions uploaded 60s+ video clips to Gemini:

Slow: ~30s upload per 20MB clip
Expensive: Video processing tokens
Limited context: Gemini only saw chunks, not full narrative

Why sentence index? Clips are cut at pauses, roughly 1:1 with sentences. The index handles edge cases where sentences span clips.

Why precise cuts? Stream copy creates overlapping frames:

FFmpeg seeks to nearest keyframe (not exact timestamp)
Adjacent clips may share the same frames at boundaries
Re-encoding fixes this by decoding and re-encoding at exact timestamps

When to use video analysis? Only for final verification of selected clips if you want to check:

Speaker visibility
Audio quality
Visual issues

Phase 2: Precision Trimming

After selecting a topic, remove filler words, stutters, and repetitions.

precision_trim.py

Two-stage approach: ASR for precise timestamps, Gemini for semantic decisions.

# Full pipeline
python precision_trim.py run topic_video.mp4 -o output/

# Step by step
python precision_trim.py transcribe topic_video.mp4 -o word_transcript.json
python precision_trim.py identify topic_video.mp4 word_transcript.json -o cuts.json
python precision_trim.py apply topic_video.mp4 cuts.json word_transcript.json -o trimmed.mp4

Key insight: Gemini identifies WHAT to cut (by word index), ASR provides WHERE to cut (~30ms precision).

Micro-segment fix: min_keep_duration=1.0s prevents jittery playback from tiny segments.

Phase 3: Rolling Captions + Multi-Format

Add karaoke-style captions and render both horizontal and vertical formats.

generate_captions.py

Maps word timestamps from original video to trimmed video, groups into phrases.

python generate_captions.py word_transcript.json kept_segments.json -o output/

Timestamp remapping logic:

# For each word in original video:
adjusted_time = word_start - segment_start + cumulative_offset

Output:

captions_horizontal.json: Phrases for 16:9
captions_vertical.json: Phrases for 9:16 (shorter)

generate_sections.py

Uses Gemini to create meaningful section titles from the transcript.

python generate_sections.py word_transcript.json kept_segments.json -o sections.json -n 4

Output example:

{
  "sections": [
    {"title": "VC HYPOCRISY EXPOSED", "startMs": 0, "durationMs": 3000},
    {"title": "READING VC SIGNALS", "startMs": 47500, "durationMs": 3000}
  ]
}

render_with_captions.py

Renders both 16:9 and 9:16 formats using Remotion with cinematic overlays.

python render_with_captions.py trimmed.mp4 captions_dir/ \
  --sections sections.json \
  --speaker "Speaker Name" \
  --speaker-title "Title/Role" \
  --speaker-center 0.5 \
  -o output/

Output: | Format | Resolution | Use Case | |--------|------------|----------| | *_captioned.mp4 | 1920x1080 | YouTube, LinkedIn | | *_vertical.mp4 | 1080x1920 | TikTok, Reels, Shorts |

Vertical crop: Centers on speaker position (--speaker-center 0.5 = center).

Remotion Components

The pipeline uses these cinematic Remotion components:

Caption styling:

Background: Semi-transparent black with rounded corners
Text: White, gold highlight on current word
Word spacing: marginRight: 0.3em (CSS inline-block fix)

Section title styling:

Background dim overlay
Gradient text (white → gold)
Animated top/bottom lines
3-second duration with bounce animation

Speaker label styling:

Gradient border (gold → orange)
Large name (52px) with slide-in
Gradient title text
8-second display at video start

Complete Example

# 1. Chunk and transcribe
python skills/chunk-process/smart_chunk.py raw.mp4 -o chunks/
python skills/chunk-process/mlx_transcribe.py chunks/ --batch --word-timestamps

# 2. Split into sentences and analyze
python skills/talking-head/sentence_split.py raw.mp4 chunks/transcript.json -o clips/
python skills/talking-head/analyze_script.py clips/clip_index.json -o topics.json

# 3. Select topic and stitch (Claude Code reviews topics, user picks one)
python skills/talking-head/stitch_clips.py topics.json --topic 10 -o topic10_raw.mp4

# 4. Precision trim (removes fillers, stutters)
python skills/talking-head/precision_trim.py run topic10_raw.mp4 -o trimmed/

# 5. Generate captions (aspect-ratio aware)
python skills/talking-head/generate_captions.py \
  trimmed/word_transcript.json trimmed/kept_segments.json \
  --speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \
  -o trimmed/

# 6. Generate section titles (Gemini)
python skills/talking-head/generate_sections.py \
  trimmed/word_transcript.json trimmed/kept_segments.json \
  -o trimmed/sections.json -n 4

# 7. Render both formats
python skills/talking-head/render_with_captions.py \
  trimmed/trimmed.mp4 trimmed/ \
  --sections trimmed/sections.json \
  --speaker "Xiaoyin Qu" --speaker-title "Founder of heyboss.ai" \
  -o final/

Result:

final/trimmed_captioned.mp4 (16:9, ~140MB)
final/trimmed_vertical.mp4 (9:16, ~178MB)

Both with rolling captions, speaker label, and cinematic section titles!

Adoption

nuva-lab/skills/talking-head

$ install --global

Security Scan Results

SKILL.md

Talking-Head Video Processing

Full Pipeline Overview

Key Insight: Recall-First Topic Analysis

Quick Start

Claude Code Integration

Skills Reference

sentence_split.py

analyze_script.py

lowres_convert.py (Optional)

batch_analyze.py (Optional)

stitch_clips.py

Output: clip_scores.json

Sentence Index

Precise Cutting (Zero Overlap)

Key Learnings

Phase 2: Precision Trimming

precision_trim.py

Phase 3: Rolling Captions + Multi-Format

generate_captions.py

generate_sections.py

render_with_captions.py

Remotion Components

Complete Example

Related Skills

nuva-lab/write-script

nuva-lab/voice-clone

nuva-lab/skills/validate-media

nuva-lab/transcribe-clip

nuva-lab/skills/talking-head

$ install --global

Security Scan Results

SKILL.md

Talking-Head Video Processing

Full Pipeline Overview

Key Insight: Recall-First Topic Analysis

Quick Start

Claude Code Integration

Skills Reference

sentence_split.py

analyze_script.py

lowres_convert.py (Optional)

batch_analyze.py (Optional)

stitch_clips.py

Output: clip_scores.json

Sentence Index

Precise Cutting (Zero Overlap)

Key Learnings

Phase 2: Precision Trimming

precision_trim.py

Phase 3: Rolling Captions + Multi-Format

generate_captions.py

generate_sections.py

render_with_captions.py

Remotion Components

Complete Example

Related Skills

nuva-lab/write-script

nuva-lab/voice-clone

nuva-lab/skills/validate-media

nuva-lab/transcribe-clip