skills/design-skills/video-generator/SKILL.md
Professional AI video production workflow. Use when creating videos, short films, commercials, or any video content using AI generation tools.
npx skillsauth add abcnuts/manus-skills video-generatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
4 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before starting, memorize these non-negotiable rules:
[PHASE 1 STOP] MUST ask questions to gather information. DO NOT assume or guess missing details—always ask the user. Never proceed without explicit user confirmation.
[DETAILED VIDEO PROMPT] Video prompts must include detailed transition_description (2-4 sentences). One-line prompts are insufficient.
[KEYFRAME DIFFERENCE] Last keyframe must show interpolatable change from first keyframe: subject position/pose, subject state (open/close, appear/disappear), or composition change. Subtle-only changes (lighting, background) while subject stays static cause unnatural video motion.
[PHASE 4 MANDATORY] MUST generate reference images before keyframes. Never skip Phase 4.
[ASPECT RATIO] ALL keyframes must use 16:9 or 9:16, and must be upright (not rotated). Never generate 1:1 or other ratios.
[NO TTS FOR ON-SCREEN] Never use TTS for on-screen dialogue or singing. Video model generates audio with lip sync.
[NARRATION CLIP BY CLIP] Generate off-screen narration separately for each clip, not all at once.
[AUDIO MIXING] When combining audio tracks (video audio, narration, BGM), preserve ALL tracks—overlay, never replace. Narration must be clearly audible and maintain consistent volume across all clips.
| Tool | Use When |
|------|----------|
| generate_image | Create new images (with or without references) |
| generate_image_variation | Edit existing images |
| Field | Description | |-------|-------------| | Purpose | Goal and target audience | | Narrative arc | Story structure and key points | | Duration | Total length in seconds | | Aspect ratio | 16:9 or 9:16 only | | Visual style | Sub-genre aesthetic (e.g., "Makoto Shinkai anime", "Pixar 3D") | | Reference materials | Reference videos, images, brand guidelines | | Language | For dialogue and narration | | Recurring elements | Characters/objects with appearance descriptions | | Dialogue/singing needs | On-screen character audio | | Narration needs | Off-screen narrator (gender, tone, pace) |
Use these perspectives to guide your questions:
| Dimension | Expert Role | Key Questions | |-----------|-------------|---------------| | Strategy & Audience | Creative Director | Who is this for? What's the goal? What action should viewers take? | | Narrative & Structure | Screenwriter | What's the story? Key moments? Emotional arc? | | Visual Style | Director + Art Director | What look and feel? Reference videos/images? Color mood? | | Shot Execution | Cinematographer | Any specific shots in mind? Product hero shots needed? | | Sound Design | Sound Designer | Voiceover? Music mood? Dialogue? Sound effects? |
Ask questions across all dimensions. Prioritize based on user's initial description.
[MANDATORY STOP - DO NOT PROCEED WITHOUT USER CONFIRMATION] Summarize gathered information and wait for user confirmation before Phase 2.
Define these 4 dimensions (applied to primary reference images in Phase 4):
| Dimension | Example Values | |-----------|----------------| | Sub-genre | Makoto Shinkai anime, Pixar 3D, cyberpunk noir | | Rendering + Line | 2D hand-drawn with thick outlines, 3D cel-shading | | Color + Lighting | High saturation neon, soft diffused natural light | | Detail density | Minimalist, highly detailed backgrounds |
Example specification:
Sub-genre: Cyberpunk anime
Rendering + Line: 2D digital painting, thin glowing outlines
Color + Lighting: High saturation neon (pink, cyan, purple), dark backgrounds, rim lighting
Detail density: Highly detailed backgrounds, moderate character detail
For each character/object:
| Field | Description | |-------|-------------| | unique_identifier | Name for reference | | appearance | Text description for prompts | | outfit_description | Clothing/accessories (characters) | | language | Spoken/sung language (if applicable) | | mechanical_properties | Physical behavior (if applicable) |
| Scenario | BGM Source | |----------|------------| | Music video / diegetic music (visible source) | Embedded (in video prompt) | | Background mood music | Separate (Phase 5 BGM Preparation) | | No music | None |
If Separate, define: genre, instruments, tempo
| Field | Values | |-------|--------| | narrative_purpose | establish / develop / climax / resolve / transition / supplementary (product shot, detail, reaction, insert, B-roll, POV) | | pacing | slow / moderate / fast | | scene | Environment description | | content_action | Subject + action + trajectory | | transition_description | [REQUIRED] Detailed transition process. Must include: subject appearance, movement trajectory, state changes, existence statements. 2-4 sentences minimum. | | duration | 4 / 6 / 8 | | camera_movement | static / pan / tilt / dolly / zoom / crane / arc / handheld | | first_keyframe_framing | Shot size + angle + composition | | first_keyframe_visible_content | What's visible | | last_keyframe_framing | Shot size + angle + composition | | last_keyframe_visible_content | What's visible | | last_keyframe_edit_from_first | yes / no (see decision table below) | | inter_clip_boundary | continuous / scene_cut | | first_keyframe_reuse | yes / no | | last_keyframe_required | yes / no | | on_screen_dialogue | "Name: text" or "Name: [lyrics] (style)" or None | | sound_effects | Sources or None | | bgm_source | embedded / separate / none | | bgm_cue | If embedded: style, BPM, instruments. If separate: emotion, intensity | | narration_cue | Narrator text or None |
inter_clip_boundary = continuous → next clip's first_keyframe_reuse = yesfirst_keyframe_reuse = yes → previous clip must have last_keyframe_required = yesWhen planning last_keyframe_visible_content, ensure interpolatable change from first_keyframe_visible_content:
[WARNING] Avoid last keyframes with only lighting or background changes while subject remains static—this causes unnatural video motion.
| Camera Movement | First & Last Keyframe Overlap? | Set to |
|-----------------|-------------------------------|--------|
| static, small pan/tilt, zoom | Yes (same scene area) | yes |
| large pan, dolly, tracking, crane, arc | No (different area) | no |
This field directly becomes part of the video prompt. The more detailed, the better.
Must include:
Length guideline: 2-4 sentences minimum. One-line descriptions are insufficient.
| Insufficient | Sufficient | |--------------|------------| | "Open box revealing jar" | "The frosted glass jar with gold lid is inside the box from the start, hidden by the closed cream-colored lid. Elegant hands with manicured nails lift the lid upward smoothly. As the lid rises, the jar gradually comes into view - first the gold cap edge, then the full jar nestled in champagne velvet." | | "Person walks left to right" | "Woman in white dress with brown hair starts at left edge of frame, walks steadily rightward at moderate pace, maintaining upright posture, reaches right edge by end of clip." | | "Light turns on" | "Room starts in complete darkness. Light gradually increases from the ceiling fixture at center, warm yellow glow spreading outward across the wooden furniture until fully illuminated." |
| Movement | Constraint | |----------|------------| | Pan/Tilt/Zoom | Camera fixed, content within rotational/zoom range | | Dolly/Tracking/Crane | Content physically traversable within duration | | Arc | Subject centered in both keyframes, environment allows orbit | | Handheld | Similar to Dolly but allows irregularity | | Combined | Must satisfy ALL involved movement constraints |
Common Mistakes:
| Mistake | Correction | |---------|------------| | "Pan from corridor entrance to middle" | Use "dolly forward" | | First: room A, Last: room B | Split into two clips | | 6-second clip covering 100 meters | Extend duration or reduce distance |
After all clips planned, list required reference images:
| Element | Clips Using It | Required Images | |---------|----------------|-----------------| | (name) | Clip X (MS), Clip Y (CU) | Full body, Face close-up |
[WARNING] Only generate what clips actually need. Do NOT generate all angles by default.
MANDATORY. Do not skip to Phase 5.
Step 1: Primary reference (visual anchor)
generate_image (no references)Step 2: Additional angles/shots
generate_image with primary reference as reference[WARNING] Never generate additional refs without using primary ref as reference.
[CRITICAL] ALL keyframes: aspect ratio from Phase 1 (16:9 or 9:16). Never 1:1.
first_keyframe_reuse = yes → Use previous clip's last keyframe (no generation)
first_keyframe_reuse = no → Generate new keyframe
If generating first keyframe:
generate_imagelast_keyframe_required = no → Skip
last_keyframe_required = yes:
last_keyframe_edit_from_first = yes → Edit mode
last_keyframe_edit_from_first = no → Generate mode
If EDIT mode:
generate_image_variationIf GENERATE mode:
generate_imageWhen generating last keyframe, verify:
Video prompt should be detailed. Even with keyframes, video models may drift during generation.
Prompt includes:
Audio in prompt:
| Type | Include | |------|---------| | On-screen dialogue | "Name says: text" with tone, language | | On-screen singing | "Name sings: [lyrics]" with style, language | | Sound effects | Source + quality | | Embedded BGM | Style, BPM, instruments, mood |
Prompt ending by bgm_source:
Example (music video with embedded BGM):
Hatsune Miku center stage, singing in Japanese with sweet electronic voice:
"ラララ、光の中で踊り出す", energetic J-pop at 140 BPM with synthesizer,
crowd cheering, concert atmosphere
[CRITICAL] Never use TTS for on-screen dialogue/singing. Video model generates audio with lip sync.
Method: Search and download from royalty-free music libraries (e.g., Pixabay, YouTube Audio Library).
[CRITICAL] Generating music with Python or any other tools is strictly prohibited. You must only use pre-existing, royalty-free tracks.
Match the downloaded music to the style defined in Phase 2.
[WARNING] Generate clip by clip, not all at once.
| Type | Method | Output | |------|--------|--------| | On-screen dialogue/singing | Video model | Embedded | | Sound effects | Video model | Embedded | | Embedded BGM | Video model | Embedded | | Separate BGM | Search only | Separate track | | Narration | TTS (clip by clip) | Separate track |
When combining multiple audio sources:
| Track | Source | |-------|--------| | Video audio | Embedded in video clips (dialogue, sound effects, embedded BGM) | | Narration | TTS generated (off-screen narrator) | | Separate BGM | Searched from royalty-free source |
[CRITICAL] Mixing rules:
tools
Generate comprehensive demonstrations showing how to access projects and work across different environments (Manus terminals, personal computers, team collaboration). Use when users ask "how do I access this from another terminal/computer", "how do I share this with my team", "how do I get this on my Mac", or need clarification on Manus persistence vs GitHub usage.
development
Use when you have a spec or requirements for a multi-step task, before touching code
data-ai
Use when about to claim work is complete, fixed, or passing, before committing or creating PRs - requires running verification commands and confirming output before making any success claims; evidence before assertions always
development
Use when implementing any feature or bugfix, before writing implementation code