Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

etanhey/qa-video

Name: qa-video
Author: etanhey

skills/golem-powers/qa-video/SKILL.md

npx skillsauth add etanhey/golems qa-video

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/qa-video — Video-Based QA + Gems Pipeline

Record your screen while narrating, or provide a YouTube/local video for knowledge extraction. The pipeline extracts speech, pulls visual context, and produces either structured QA findings or durable gems.

How It Works

Screen Recording (.mov)
  → ffmpeg audio extraction
    → whisper-cli transcription (SRT + TXT)
      → LLM reads SRT, identifies QA-relevant segments
        → Frame extraction at hotspot timestamps + regular intervals
          → Claude Vision reads frames + correlates with transcript
            → Structured QA findings document
              → Agent handoff (Codex/Claude worker via cmux)

YouTube / gems request
  → yt-dlp audio + metadata
    → whisper-cli transcription (SRT + TXT)
      → keyword hotspot detection for insights, claims, data, and examples
        → yt-dlp/ffmpeg frame extraction at hotspot timestamps
          → vision pass over frames + transcript context
            → brain_digest full content
              → brain_store structured gems

The Cardinal Rule: Narrate Before You Act

Tell the user BEFORE every QA recording session:

Narrate your intentions BEFORE clicking. Say "I'm about to click Spin Rare on Sarah" → click → describe what happened. This aligns speech timestamps with actions, making the pipeline 3x more accurate at identifying what you were pointing at.

The 2-5 second gap between clicking and narrating is the #1 accuracy killer. Coaching the user to narrate-first is more impactful than any technical fix.

Workflow Detection & Routing

Read the user's request and route to the right workflow:

| User says | Route to | |-----------|----------| | "let's do QA", "test this", "QA round" | workflows/record.md — Pre-QA checklist + recording setup | | "process this video", "I recorded QA", path to .mov file | workflows/process.md — Stalker pipeline processing | | "send fixes to Codex", "hand off findings" | workflows/handoff.md — Agent handoff pattern | | "next round", "retest", "QA round N" | workflows/iterate.md — Multi-round QA cycle | | "set up click capture", "qa-record" | references/click-capture.md — CGEventTap + qa-record.sh | | YouTube URL, "video gems", "extract from video", "insights/takeaways from this video" | workflows/gems.md — YouTube/local-video knowledge extraction with transcript + frames |

Override signals: If the user says "gems", "insights", or "takeaways", use the gems workflow regardless of source. If they say "QA", "bugs", or "findings", use the QA workflow regardless of source.

If ambiguous: Ask whether this is a QA recording to process or a video to extract gems from.

Key Design Decisions (learned from real usage)

LLM reads the SRT directly — no automated hotspot detection (sox/ImageMagick). Claude reading the transcript is a better hotspot detector than volume spikes or frame diffs. The automated signals (from the original Twitch stalker pipeline) are unnecessary for QA narration.
Regular interval + hotspot frames — Extract one frame every 30 seconds for full visual coverage, PLUS frames at each identified hotspot. Hotspot-only extraction misses visual bugs described after the fact.
Before/after context frames — At each hotspot, extract frames at -5s and +5s in addition to the hotspot timestamp. The user may explain an issue THEN show it, or show it THEN explain.
Whisper model: ggml-small — Fast on Apple Silicon (~14s for 7min video), accurate enough for English QA narration. The ggml-large-v3 is better but 5x slower — not worth it for QA.
Findings live in the PROJECT repo — docs/qa-session-YYYY-MM-DD/ in the project being tested, not in the orchestrator. Each round gets a suffix: qa-findings-round2.md.
BrainLayer storage is mandatory — After every video processing run, brain_store the findings summary with tags ["qa", "<project>", "round-N"].
QA is iterative — Expect 3-6 rounds per feature. The skill supports multi-round workflows with proper round numbering and delta tracking (what was fixed vs. what persists).
Gems use the same media primitives with a different analysis target — QA hotspots look for bugs and UX issues. Gems hotspots look for surprising insights, strong opinions, technical revelations, numbers, examples, and reusable advice.
BrainLayer is the destination for gems — Files are intermediate artifacts. Use brain_digest for full transcripts/notes, then brain_store the structured gems. If BrainLayer is unavailable, write the full output to docs.local/qa-video/[date]-[title].md and flag that persistence failed.
Gemini handles visual-heavy frame batches — Per /agent-routing, use Gemini for bulk frame/OCR/visual reads. Claude wraps up with synthesis, brain_digest, brain_store, ledger updates, and Drive archival.

Prerequisites

| Tool | Check | Install | |------|-------|---------| | ffmpeg | which ffmpeg | brew install ffmpeg | | whisper-cli | which whisper-cli | brew install whisper-cpp | | whisper model | ls ~/.cache/whisper/ggml-small.bin | whisper-cli --download-model small | | yt-dlp | which yt-dlp | pip3 install yt-dlp |

Optional (click capture — Phase 2):

| Tool | Check | Install | |------|-------|---------| | pyobjc | python3 -c "import Quartz" | pip3 install pyobjc-framework-Quartz pyobjc-framework-ApplicationServices pyobjc-framework-Cocoa | | qa_click_logger.py | ls ~/Gits/orchestrator/scripts/qa/qa_click_logger.py | Already exists in orchestrator repo | | qa-record.sh | ls ~/Gits/orchestrator/scripts/qa/qa-record.sh | Already exists in orchestrator repo |

Quick Reference

Start a recording session:

# With click capture (recommended):
bash ~/Gits/orchestrator/scripts/qa/qa-record.sh ~/Gits/<project>/docs/

# Manual (just screen recording):
# Cmd+Shift+5 → Record Selected Portion → narrate while testing

Process a video:

VIDEO="/path/to/recording.mov"
WORKDIR="~/Gits/<project>/docs/qa-session-$(date +%Y-%m-%d)"
mkdir -p "$WORKDIR/frames"

# 1. Extract audio
ffmpeg -i "$VIDEO" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$WORKDIR/audio.wav"

# 2. Transcribe
whisper-cli -m ~/.cache/whisper/ggml-small.bin -f "$WORKDIR/audio.wav" \
  --output-srt --output-txt -of "$WORKDIR/transcript" -l auto

# 3. Read transcript.srt → identify hotspot timestamps
# 4. Extract frames (hotspots + every 30s)
# 5. Read frames with Claude Vision
# 6. Compile findings doc

For the full step-by-step, load workflows/process.md.

Extract YouTube gems:

# Give /qa-video the URL and load workflows/gems.md
/qa-video https://www.youtube.com/watch?v=VIDEO_ID

etanhey/qa-video

skills/golem-powers/qa-video/SKILL.md

Video QA and knowledge extraction pipeline for screen recordings, local video, and YouTube URLs. Use for narrated QA sessions, bugs/UX issues, QA checklists, handoffs, iterative QA rounds, qa-record/click capture, stalker pipeline, video gems, YouTube insights, extract from video, process this recording, or frame-based analysis. Routes QA recordings to record/process/handoff/iterate workflows; routes YouTube/gems requests through transcript → keyword hotspots → frames → BrainLayer/Drive archival.

3 stars

tools

Updated Jun 7, 2026

$ install --global

skillsauth

npx skillsauth add etanhey/golems qa-video

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 7, 2026, 3:07 AM56.7s2 files scanned

SKILL.md

name:: qa-video
description:: Video QA and knowledge extraction pipeline for screen recordings, local video, and YouTube URLs. Use for narrated QA sessions, bugs/UX issues, QA checklists, handoffs, iterative QA rounds, qa-record/click capture, stalker pipeline, video gems, YouTube insights, extract from video, process this recording, or frame-based analysis. Routes QA recordings to record/process/handoff/iterate workflows; routes YouTube/gems requests through transcript → keyword hotspots → frames → BrainLayer/Drive archival.
execute:: scripts/default.sh

/qa-video — Video-Based QA + Gems Pipeline

Record your screen while narrating, or provide a YouTube/local video for knowledge extraction. The pipeline extracts speech, pulls visual context, and produces either structured QA findings or durable gems.

How It Works

Screen Recording (.mov)
  → ffmpeg audio extraction
    → whisper-cli transcription (SRT + TXT)
      → LLM reads SRT, identifies QA-relevant segments
        → Frame extraction at hotspot timestamps + regular intervals
          → Claude Vision reads frames + correlates with transcript
            → Structured QA findings document
              → Agent handoff (Codex/Claude worker via cmux)

YouTube / gems request
  → yt-dlp audio + metadata
    → whisper-cli transcription (SRT + TXT)
      → keyword hotspot detection for insights, claims, data, and examples
        → yt-dlp/ffmpeg frame extraction at hotspot timestamps
          → vision pass over frames + transcript context
            → brain_digest full content
              → brain_store structured gems

The Cardinal Rule: Narrate Before You Act

Tell the user BEFORE every QA recording session:

Narrate your intentions BEFORE clicking. Say "I'm about to click Spin Rare on Sarah" → click → describe what happened. This aligns speech timestamps with actions, making the pipeline 3x more accurate at identifying what you were pointing at.

The 2-5 second gap between clicking and narrating is the #1 accuracy killer. Coaching the user to narrate-first is more impactful than any technical fix.

Workflow Detection & Routing

Read the user's request and route to the right workflow:

If ambiguous: Ask whether this is a QA recording to process or a video to extract gems from.

Key Design Decisions (learned from real usage)

LLM reads the SRT directly — no automated hotspot detection (sox/ImageMagick). Claude reading the transcript is a better hotspot detector than volume spikes or frame diffs. The automated signals (from the original Twitch stalker pipeline) are unnecessary for QA narration.
Regular interval + hotspot frames — Extract one frame every 30 seconds for full visual coverage, PLUS frames at each identified hotspot. Hotspot-only extraction misses visual bugs described after the fact.
Before/after context frames — At each hotspot, extract frames at -5s and +5s in addition to the hotspot timestamp. The user may explain an issue THEN show it, or show it THEN explain.
Whisper model: ggml-small — Fast on Apple Silicon (~14s for 7min video), accurate enough for English QA narration. The ggml-large-v3 is better but 5x slower — not worth it for QA.
Findings live in the PROJECT repo — docs/qa-session-YYYY-MM-DD/ in the project being tested, not in the orchestrator. Each round gets a suffix: qa-findings-round2.md.
BrainLayer storage is mandatory — After every video processing run, brain_store the findings summary with tags ["qa", "<project>", "round-N"].
QA is iterative — Expect 3-6 rounds per feature. The skill supports multi-round workflows with proper round numbering and delta tracking (what was fixed vs. what persists).
Gems use the same media primitives with a different analysis target — QA hotspots look for bugs and UX issues. Gems hotspots look for surprising insights, strong opinions, technical revelations, numbers, examples, and reusable advice.
BrainLayer is the destination for gems — Files are intermediate artifacts. Use brain_digest for full transcripts/notes, then brain_store the structured gems. If BrainLayer is unavailable, write the full output to docs.local/qa-video/[date]-[title].md and flag that persistence failed.
Gemini handles visual-heavy frame batches — Per /agent-routing, use Gemini for bulk frame/OCR/visual reads. Claude wraps up with synthesis, brain_digest, brain_store, ledger updates, and Drive archival.

Prerequisites

Optional (click capture — Phase 2):

Quick Reference

Start a recording session:

# With click capture (recommended):
bash ~/Gits/orchestrator/scripts/qa/qa-record.sh ~/Gits/<project>/docs/

# Manual (just screen recording):
# Cmd+Shift+5 → Record Selected Portion → narrate while testing

Process a video:

VIDEO="/path/to/recording.mov"
WORKDIR="~/Gits/<project>/docs/qa-session-$(date +%Y-%m-%d)"
mkdir -p "$WORKDIR/frames"

# 1. Extract audio
ffmpeg -i "$VIDEO" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$WORKDIR/audio.wav"

# 2. Transcribe
whisper-cli -m ~/.cache/whisper/ggml-small.bin -f "$WORKDIR/audio.wav" \
  --output-srt --output-txt -of "$WORKDIR/transcript" -l auto

# 3. Read transcript.srt → identify hotspot timestamps
# 4. Extract frames (hotspots + every 30s)
# 5. Read frames with Claude Vision
# 6. Compile findings doc

For the full step-by-step, load workflows/process.md.

Extract YouTube gems:

# Give /qa-video the URL and load workflows/gems.md
/qa-video https://www.youtube.com/watch?v=VIDEO_ID

Related Skills

etanhey/phoenix-human-view

tools

VerifiedTrustedCommunity

The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).

3SKILL.mdUpdated Jun 7, 2026

etanhey/phoenix-human-view

etanhey/mac-systems

tools

VerifiedTrustedCommunity

macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.

3SKILL.mdUpdated Jun 7, 2026

etanhey/judge-fleet

development

VerifiedTrustedCommunity

Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.

3SKILL.mdUpdated Jun 7, 2026

etanhey/fleet-wrap

development

VerifiedTrustedCommunity

Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).

3SKILL.mdUpdated Jun 7, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/etanhey/golems.git

# Copy into Claude Code skills folder (global)
cp -r golems/skills/golem-powers/qa-video ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

etanhey/golems

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT