skills/video-frame-extraction-analysis/SKILL.md
Extract keyframes, detect scenes, and build CLIP-indexed temporal search over video content. Activate on: keyframe extraction, scene detection, video search, video indexing, temporal analysis. NOT for: video editing/rendering (video-processing-editing), video generation (ai-video-production-master).
npx skillsauth add curiositech/windags-skills video-frame-extraction-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract keyframes, detect scene boundaries, and build CLIP-indexed temporal search systems for video content analysis and retrieval.
Activate on: "keyframe extraction", "scene detection", "video search", "video indexing", "find frame in video", "temporal search", "video content analysis", "scene boundary detection", "CLIP video search"
NOT for: Video editing, trimming, or rendering (video-processing-editing), video generation from text/images (ai-video-production-master), or face recognition in video (face-recognition-system-builder)
| Domain | Technologies | Notes | |--------|-------------|-------| | Frame Extraction | ffmpeg, decord, OpenCV, PyAV | decord is fastest for random access | | Scene Detection | PySceneDetect, TransNetV2 | Cut detection + gradual transition detection | | Visual Embedding | CLIP, SigLIP, InternVideo2 | Per-frame or pooled-scene embeddings | | Temporal Search | Vector DB + timestamp metadata | "Find the frame where X happens" | | Shot Analysis | Shot type classification, motion estimation | Wide/medium/close-up, camera movement | | OCR on Frames | PaddleOCR, EasyOCR, Tesseract | Extract text from slides, titles, signage |
Video File ──→ [Scene Detector] ──→ [Keyframe Selector] ──→ [CLIP Embed] ──→ [Vector DB]
│ │ │ │ │
input PySceneDetect 1 frame per scene SigLIP-large store with
mp4/mkv threshold=27 at scene midpoint 384-dim timestamp
detect cuts + every 5 sec in + scene_id
+ transitions long scenes (>30s) + metadata
# Scene detection + keyframe extraction
from scenedetect import detect, ContentDetector
import decord
import numpy as np
def extract_keyframes(video_path: str) -> list[dict]:
"""Extract one keyframe per scene with timestamps."""
# Step 1: Detect scene boundaries
scene_list = detect(video_path, ContentDetector(threshold=27))
# Step 2: Extract keyframe at midpoint of each scene
vr = decord.VideoReader(video_path)
fps = vr.get_avg_fps()
keyframes = []
for i, scene in enumerate(scene_list):
start_frame = scene[0].get_frames()
end_frame = scene[1].get_frames()
mid_frame = (start_frame + end_frame) // 2
frame = vr[mid_frame].asnumpy() # RGB numpy array
timestamp = mid_frame / fps
keyframes.append({
"frame": frame,
"frame_index": mid_frame,
"timestamp_sec": timestamp,
"scene_index": i,
"scene_duration": (end_frame - start_frame) / fps,
})
# For long scenes (>30s), add extra keyframes every 5 sec
scene_dur = (end_frame - start_frame) / fps
if scene_dur > 30:
for t in np.arange(start_frame + int(5*fps), end_frame, int(5*fps)):
extra = vr[int(t)].asnumpy()
keyframes.append({
"frame": extra,
"frame_index": int(t),
"timestamp_sec": int(t) / fps,
"scene_index": i,
"scene_duration": scene_dur,
})
return keyframes
Indexing (offline):
Video ──→ [Extract Keyframes] ──→ [SigLIP Embed] ──→ [Vector DB]
│
metadata per frame:
video_id, timestamp,
scene_id, thumbnail_path
Querying (online):
Text: "person walking through rain" ──→ [SigLIP Text Embed] ──→ [Vector Search]
│
top-k frames
with timestamps
│
"video_3.mp4 @ 01:23:45"
"video_7.mp4 @ 00:45:12"
Video ──┬──→ [Keyframes] ──→ [CLIP Embed] ──→ [Visual Index]──┐
│ │
├──→ [Audio Track] ──→ [Whisper] ──→ [Text Embed] ──→ [Text Index]──┤──→ [Fusion Search]
│ │
└──→ [Frame OCR] ──→ [Text Extract] ──→ [Text Embed]──┘
Fusion search: query hits all three indexes, reciprocal rank fusion combines results
"Explain the sales chart" →
Visual: frame with chart → timestamp 15:30
Audio: "our Q3 numbers show..." → timestamp 15:28
OCR: "Q3 Revenue: $4.2M" → timestamp 15:30
Fused result: 15:28-15:35 segment with high confidence
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.