skills/video-digest/SKILL.md
Summarize any video by analyzing both audio and visuals. Downloads via yt-dlp, extracts transcript (YouTube captions or Whisper), pulls scene-detected keyframes, and produces a multimodal summary with clickable timestamped YouTube links. Use this skill whenever the user wants to summarize a YouTube video, digest a talk or tutorial, get notes from a video, extract key points from a recording, or says things like "tl;dw", "summarize this video", "what's in this video", or pastes a YouTube URL and asks for a summary. Also triggers for non-YouTube URLs that yt-dlp supports.
npx skillsauth add catalan-adobe/skills video-digestInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Summarize a video by combining audio transcript with visual frame analysis. Produces a markdown summary with clickable timestamped links back to the source video.
Video URL (or local file)
0. Check dependencies (yt-dlp, ffmpeg)
1. Download video + metadata
2. Extract transcript (YouTube captions first, Whisper fallback)
3. Extract keyframes via scene detection
4. Segment into chapters or time-based chunks
5. Parallel subagents analyze chunks (frames + transcript)
6. Synthesize final summary
7. Write markdown file + print stdout preview
All yt-dlp, ffmpeg, and whisper operations go through the helper
script bundled with this skill at scripts/video-digest.sh.
Locating the script:
if [[ -n "${CLAUDE_SKILL_DIR:-}" ]]; then
DIGEST_SH="${CLAUDE_SKILL_DIR}/scripts/video-digest.sh"
else
DIGEST_SH="$(command -v video-digest.sh 2>/dev/null || \
find ~/.claude -path "*/video-digest/scripts/video-digest.sh" -type f 2>/dev/null | head -1)"
fi
if [[ -z "$DIGEST_SH" || ! -f "$DIGEST_SH" ]]; then
echo "Error: video-digest.sh not found. Ask the user for the path." >&2
fi
Store the result in DIGEST_SH and use it for all subsequent commands.
Commands:
deps — check and report dependency statusdownload <url> [workdir] — download video + metadata + thumbnailtranscript <workdir> [--force-whisper] [--lang LANG] — extract captions or transcribeframes <video> [threshold] [workdir] — scene-detect keyframes + contact sheetsinfo <workdir> — parse metadata, report title/duration/chaptersParse these from the user's message or ask if ambiguous:
| Flag | Default | Purpose |
|------|---------|---------|
| --depth | detailed | brief, detailed, or full |
| --force-whisper | off | Skip YouTube captions, transcribe with whisper-ctranslate2 |
| --scene-threshold | 0.3 | ffmpeg scene detection sensitivity (0.1-1.0) |
| --lang | en | Subtitle/transcription language code |
Locate the script and check dependencies:
"$DIGEST_SH" deps
If any required dependency is missing, stop and offer to install:
brew install yt-dlp ffmpeg
Do NOT proceed to Step 1 until both yt-dlp and ffmpeg are confirmed
available. Re-run deps after installation to verify.
Optional: whisper-ctranslate2 (only needed with --force-whisper).
Auto-installed on first use via uv tool install whisper-ctranslate2.
Ask the user for the video URL if not already provided. Then download:
"$DIGEST_SH" download "<url>" "<workdir>"
The work directory defaults to ./video_digest_<video_id>/ in the
current working directory. After download, report to the user:
title, channel, duration, whether chapters are available, file size.
Steps 2 and 3 are independent — run them simultaneously using two Bash tool calls in the same message. This saves significant time, especially on longer videos where both operations are slow.
"$DIGEST_SH" transcript "<workdir>" [--force-whisper] [--lang en]
Default path: Extracts YouTube captions (manual preferred over auto-generated). Parses the VTT file into timestamped segments.
--force-whisper path: Extracts audio as WAV, transcribes with
whisper-ctranslate2 (faster-whisper backend). Produces timestamped SRT
which is converted to our format automatically.
Auto-fallback: If no YouTube captions exist and --force-whisper
was not specified, the script auto-engages Whisper and prints a
notice. Inform the user this is happening and that it takes longer.
The transcript is saved as <workdir>/transcript.txt with timestamps:
[00:00:05] Welcome to this talk about...
[00:00:12] Today we'll cover three topics...
"$DIGEST_SH" frames "<workdir>/<video_file>" [threshold] "<workdir>"
Default scene threshold is 0.3 (tunable by user). The script:
<workdir>/frames/timecodes.txtReport: number of keyframes extracted, number of contact sheets.
If very few frames are extracted (< 5 for a video > 2 minutes),
suggest the user lower the threshold: "Only N frames detected.
Try --scene-threshold 0.2 for more granularity?"
Parse the metadata JSON for chapter information:
"$DIGEST_SH" info "<workdir>"
Short videos (< 10 minutes): Skip chunking entirely. Read the full transcript AND every contact sheet image (using the Read tool) in a single analysis pass (Step 5 without subagents). This avoids unnecessary overhead for content that fits in one context window.
Longer videos with chapters: Use chapters as segment boundaries.
Longer videos without chapters: Split into ~10-minute chunks, aligning boundaries to the nearest keyframe timecode.
For each chunk, prepare the transcript segment, contact sheet(s) covering that time window, and a title (chapter name or "Part N: MM:SS - MM:SS").
For short videos (< 10 min), read the transcript and all contact sheets yourself (Read tool) and produce the summary inline. Flag notable frames for the screenshot gallery. Skip to Step 6.
For longer videos, spawn one Agent per chunk, ALL IN PARALLEL. Each subagent receives both the transcript segment and contact sheet path(s). Never skip frames — on-screen text, graphics, and UI states add context absent from audio.
Mapping contact sheets to chunks: Use burned-in timestamps to determine which sheet(s) cover each chunk. A chunk may span two sheets — include both.
Spawn one Agent per chunk using the template in the subagent prompt. Fill in the placeholders with each chunk's title, time range, transcript segment, and contact sheet path(s).
After all agents complete, read their outputs. If any failed, re-run individually — the pipeline tolerates partial results but flag gaps to the user.
Combine all chunk summaries into a cohesive document. The video URL is needed for timestamp links — extract the video ID from the metadata JSON.
Build YouTube deep links using the format:
https://youtube.com/watch?v=<VIDEO_ID>&t=<SECONDS>
Prepare assets directory:
Create <workdir>/assets/ and populate it:
Thumbnail: Find the downloaded thumbnail in workdir (usually
<id>.webp or <id>.jpg from --write-thumbnail). Copy to
assets/<video_id>_thumbnail.jpg, converting if needed:
ffmpeg -y -loglevel error -i "<workdir>/<thumbnail>" "<workdir>/assets/<video_id>_thumbnail.jpg"
Screenshots: Collect notable frames flagged by subagents (or
from your own analysis for short videos). Match their timestamps
against frames/timecodes.txt (line N = frame_<N zero-padded to 4>.jpg) to find the closest frame file. Copy selected frames
to assets/<video_id>_screenshot_01.jpg, <video_id>_screenshot_02.jpg, etc.
Only include frames with genuine visual importance (diagrams, slides, code, charts, UI states). Aim for 3-8 screenshots. If none are notable, omit the Screenshots section.
For local files (no yt-dlp download), omit the URL line and thumbnail if no thumbnail was downloaded.
Structure the final markdown:
# Video Digest: <Title>
**Channel:** <uploader> | **Duration:** <duration> | **Date:** <date>\
**URL:** <source-url>

## tl;dw
<2-3 sentence overview — always present regardless of depth>
## Contents
- [Chapter Title](#section-anchor) ([MM:SS](youtube-deep-link))
- ...
## <Chapter Title>
<section summary with inline timestamp links>
...
## Key Moments
- [MM:SS](youtube-deep-link) — <description>
- ...
## Screenshots

*[MM:SS](youtube-deep-link) — <description>*
...
For brief depth: tl;dw + Contents with one-line descriptions only.
For detailed depth: full structure as above.
For full depth: full structure plus exhaustive notes per section.
Screenshots are included at all depth levels when notable frames exist.
Save the full summary to <workdir>/digest.md.
Print a condensed preview to stdout:
Video Digest: <Title>
Duration: MM:SS | Sections: N | Frames analyzed: N | Screenshots: N
tl;dw
<2-3 sentence overview>
Sections
- [00:00 - Introduction](https://youtube.com/watch?v=xxx&t=0)
- [05:23 - Setting up the project](https://youtube.com/watch?v=xxx&t=323)
- ...
Full summary: <workdir>/digest.md
Assets: <workdir>/assets/
SKILL.md to ~/.claude/commands/video-digest.mdscripts/video-digest.sh to ~/.local/bin/video-digest.sh
and chmod +x itcommand -v video-digest.shtools
Reduce a webpage to a structural skeleton with semantic tokens. Two-phase pipeline: Phase 1 injects a browser script that tokenizes content ({TEXT}, {HEADING:n}, {IMAGE:WxH}, {CTA:label}, {LINK:label}, {INPUT:type}, {VIDEO}, {ICON}). Phase 2 applies LLM structural reasoning to collapse repeated patterns ({REPEAT:N}), remove decorative wrappers, strip utility classes, and produce skeleton.html + manifest.json. Use when migrating pages to EDS, analyzing page structure, extracting page blueprints, or preparing input for GenAI block generation. Triggers on: reduce page, page skeleton, page blueprint, extract structure, tokenize page, page reduction, structural skeleton, reduce URL.
tools
Capture a spatial hierarchy of rendered DOM elements from any webpage. Injects a pre-built script via playwright-cli that walks the DOM, detects layout grids, extracts backgrounds, prunes invisible nodes, promotes elements rendered outside their DOM parent (overlays, fixed navs, modals), and tags overlay nodes with occlusion metadata. Returns three outputs: LLM-friendly indented text, structured JSON tree, and a nodeMap mapping positional IDs to CSS selectors with background and overlay data. Use before page decomposition, overlay detection, brand extraction, or any workflow that needs structured page analysis. Triggers on: visual tree, capture tree, page structure, page hierarchy, DOM tree, capture visual, page analysis, extract tree.
development
Design and build web UIs with Adobe Spectrum 2 design system. Applies S2 layout principles, visual hierarchy, spacing, and component composition to produce accessible interfaces. Outputs vanilla CSS with Spectrum tokens (static pages) or Spectrum Web Components (interactive apps). Recommends tier based on complexity. Covers sp-theme setup, side-effect imports, overlay system, form patterns, --mod-* token customization, and 14 critical gotchas. Use for: spectrum 2 web, SWC, sp-button, sp-theme, build UI with spectrum, S2 layout, spectrum application, adobe design system, web component form, spectrum overlay.
development
Control Slack via CDP or headless API tokens. Navigate channels, read/send messages, search conversations, check unreads, and manage status. Two modes: CDP (Slack desktop with --remote-debugging-port) for full UI control, or headless (xoxp/xoxb token) for data operations without Slack running. Triggers on: slack, read slack, search slack, slack unreads, send slack message, slack status, navigate slack, check slack, slack messages, go to channel, slack DM.