heygen-translate/SKILL.md
Translate and dub a video into another language with voice cloning and lip-sync, powered by HeyGen Video Translation. The presenter keeps their face, their voice is cloned into the target language, and lips re-sync to the new audio — viewers see the same person speaking natively. Use when: (1) localizing an existing video into one or more languages ("translate this video to Spanish", "make this in French and German", "dub this into Japanese", "I need this in 10 languages for a launch"), (2) the user has a finished video and wants the SAME presenter speaking another language (not a new presenter — that's heygen-video), (3) podcast / audio-only translation ("translate this podcast", "dub the audio but keep my video"), (4) high-stakes translations where the user wants to review/edit subtitles before final render (the proofreads workflow), (5) "translate my video", "dub this", "localize this clip", "make a multilingual version", "subtitle and dub". Returns the translated video URL (or audio file for audio-only mode), one per target language. Chain signal: if the user wants to CREATE a new video in another language (no source video exists yet), route to heygen-video and write the script in the target language — do not use heygen-translate. Use heygen-translate only when there is an existing source video to localize. NOT for: creating new videos from scratch (use heygen-video), avatar creation (use heygen-avatar), TTS-only synthesis (use heygen-video with audio-only output), or text-only translation.
npx skillsauth add heygen-com/skills heygen-translateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Translate and dub an existing video into 175+ languages. The system clones the presenter's voice into the target language, re-syncs their lips to the new audio, and returns a fully dubbed video. You provide a source video and a target language — the engine handles transcription, translation, voice cloning, lip-sync, and (optionally) burned-in captions.
This is not new-video generation. The presenter, performance, framing, and brand assets in the original video are preserved. Translation rides on top of what's already there.
Pick one transport at session start. Never mix, never switch mid-session, never narrate the choice.
Detect in this order:
video_generate tool exposes a HeyGen translation model, prefer that. Currently the plugin generates videos but does not expose translation directly — fall through to the next tier until HeyGen ships translation through video_generate.HEYGEN_API_KEY is set in the environment AND heygen --version exits 0, use CLI. API-key presence is an explicit signal that the user wants direct API access.HEYGEN_API_KEY AND HeyGen MCP tools are visible (mcp__heygen__*). OAuth auth, runs against the user's plan credits.heygen --version exits 0. Auth via heygen auth login.curl -fsSL https://static.heygen.ai/cli/install.sh | bash then heygen auth login."After mode detection, verify auth actually works before entering Phase 1. This avoids wasting the user's time gathering inputs only to hit an auth error on submit.
heygen auth status (silent). If it exits 0, proceed. If it exits non-zero (no key, expired, invalid):
echo "<key>" | heygen auth login (writes to ~/.heygen/credentials, survives across sessions).heygen auth status. If still failing, surface the error and stop.This is a one-time setup. Once heygen auth login persists the key, future sessions pick it up automatically. Don't ask again if heygen auth status passes.
Hard rules:
curl api.heygen.com/... Every operation in this skill has a CLI command and (where supported) an MCP tool. Use those.mcp__heygen__* tools. If translation isn't exposed via MCP yet, fall through to CLI for translation operations specifically. Do not synthesize raw HTTP calls.heygen ... commands. Run heygen video-translate --help and heygen video-translate <subcommand> --help to discover arguments. Use --request-schema to see the full JSON shape of any create command.create_video_translation (single language). Multi-language and proofreads are
not yet exposed via MCP — fall through to CLI for those. Run mcp__heygen__*
tool listing at session start to confirm what's available; tool surface evolves.
heygen video-translate
├── languages list # supported target languages
├── create # submit a translation job (single or batch)
├── get <id> # check status / fetch result
├── list # list past translations
├── update # update job metadata
├── delete # delete a job
└── proofreads
├── create # extract editable subtitles before final render
├── get <id> # check proofread session status
├── srt get <id> # download the extracted SRT
├── srt update <id> # upload edited SRT
└── generate <id> # render the final video from approved SRT
heygen asset create --file <path> # for local source video uploads (max 32 MB)
Every command supports --help. Use --request-schema on any create to see the full JSON body. CLI output: JSON on stdout, {error:{code,message,hint}} envelope on stderr, exit codes 0 ok · 1 API · 2 usage · 3 auth · 4 timeout. Add --wait on create to block until the job completes (default timeout 20m).
📖 Detailed CLI/MCP error → action mapping → references/troubleshooting.md
The skill runs four phases. Phase 1 (Discovery) is the only place you ask questions. Phase 2 (Pre-flight) is silent. Phase 3 (Submit + Poll) is silent. Phase 4 (Deliver) is one short message.
Phase 1 — Discovery — gather minimum needed inputs from the user
Phase 2 — Pre-flight — validate language, classify content, set flags
Phase 3 — Submit + Poll — kick off, background poll, surface only on done/fail
Phase 4 — Deliver — post the result with one-line summary
Ask only what you don't already have. Communicate in the user's language. Never run a form. One or two questions per turn, max.
Required inputs (block until you have these):
Important inputs (ask if not provided, with smart defaults):
enable_dynamic_duration: true). Only set to false when the user needs frame-exact timing (e.g., syncing to a timeline, ad slot, or external audio track).Optional (ask only if relevant):
start_time and end_time in seconds.📖 Locale-pair gotchas (formality registers, RTL languages, tonal compression, lip-sync ceiling) → references/language-locale-guide.md
Silent. No user-facing chatter. Three checks, in order.
Check 2a: Language validation.
MCP: list_video_translation_languages() (if exposed). Otherwise CLI.
CLI: heygen video-translate languages list | jq -r '.data.languages[]'
The list contains exact strings ("Spanish (Spain)", "Chinese (Mandarin, Simplified)", "Arabic (Saudi Arabia)"). Match the user's input case-insensitively against these exact strings. If they say "Spanish", default to "Spanish (Spain)" and confirm in Phase 4. If they say "Chinese", default to "Chinese (Mandarin, Simplified)". If they specify a region ("Mexican Spanish"), map it ("Spanish (Mexico)"). If no match: ask the user to pick from the closest options.
Check 2b: Source video routing.
| Source the user gave you | Route |
|--------------------------|-------|
| Public HTTPS URL (no auth, returns video MIME on HEAD) | Pass directly as {type: "url", url: "..."} |
| Auth-walled URL, 403, 404, or HTML response | Tell the user, ask for a public URL or local file |
| Local file path | Upload via heygen asset create --file <path> (CLI) or upload_asset (MCP). Max 32 MB. Use the returned asset_id as {type: "asset_id", asset_id: "..."} |
| Existing HeyGen asset_id | Pass directly as {type: "asset_id", asset_id: "..."} |
📖 Asset routing edge cases (very large files, presigned URLs, auth-walled sources) → references/asset-routing.md
Check 2c: Content profile.
Pick one profile based on the source. Don't list all five to the user — propose silently and only ask if the source is genuinely ambiguous (e.g., a music-heavy talking-head where you can't tell if speech enhancement will help).
| Profile | Use when | Flags |
|---------|----------|-------|
| Talking head / presenter (default) | One person speaks to camera; clean audio | mode: precision, enable_speech_enhancement: true, enable_caption: true, enable_dynamic_duration: true, keep_the_same_format: true |
| Podcast / audio-only | The visual is static, doesn't matter, or doesn't exist | mode: precision, translate_audio_only: true, enable_speech_enhancement: true, enable_caption: true |
| Music / high-soundtrack | Background music interferes with speech | mode: precision, disable_music_track: true, enable_speech_enhancement: true, enable_dynamic_duration: true, keep_the_same_format: true |
| Multi-speaker | Two or more distinct speakers | Talking-head defaults + speaker_num: <count>. Speaker count is REQUIRED here — don't guess. |
| Corporate / branded | Brand voice, glossary discipline, high-stakes | Talking-head defaults + (if user has one) brand_voice_id. Strongly consider proofreads for this profile. |
Always:
mode: "precision" unless the user explicitly asks for "fast" / "quick" / "speed".enable_dynamic_duration: set based on the user's answer to the duration flexibility question in Phase 1. Default true (recommended) — lets translated speech breathe instead of being crammed into the source's exact timing. Set false only when the user explicitly needs fixed-length output. Tonal compression makes flexibility especially important for en→zh, en→ja, en→ko (Asian languages run shorter); de→en, ja→en (run longer); ar/he/ur (RTL + register shifts).keep_the_same_format: true for visual translations — preserves the source's resolution and bitrate so the dubbed video matches the original's encoding.enable_watermark: false (the default).Silent. Background work. Surface only on (a) per-language completion, (b) per-language hard failure, (c) >5 min progress check.
Branching:
Submit one job per target language using batch syntax (--output-languages accepts multiple).
MCP (single language only at time of writing):
create_video_translation(
video={type, url|asset_id},
output_languages=["Spanish (Spain)"],
mode="precision",
enable_speech_enhancement=true,
enable_caption=true,
enable_dynamic_duration=true,
keep_the_same_format=true,
speaker_num=<n>, # only when known multi-speaker
)
CLI:
heygen video-translate create \
-d '{"video":{"type":"url","url":"https://..."},"output_languages":["Spanish (Spain)","Japanese (Japan)"]}' \
--mode precision \
--enable-speech-enhancement \
--enable-caption \
--enable-dynamic-duration \
--keep-the-same-format \
--speaker-num 1 \
--title "<short title>"
Response returns one video_translation_id per language. Capture all of them.
Polling (silent, backgrounded):
Use --wait on create to block until completion when running ONE language. For batch, drop --wait and poll each ID:
# CLI mode polling (background)
heygen video-translate get <video-translation-id>
# Returns { data: { status: "pending"|"running"|"succeeded"|"failed", video_url, ... } }
Polling cadence: 30s for the first 3 minutes, then 60s. Most translations complete in 5–15 min; some (long videos, batched languages) take 30+ min. Hard timeout: 60 min per translation — beyond that, treat as stuck and surface the issue.
MCP equivalents: get_video_translation(id) (if exposed). Otherwise fall through to CLI for polling.
📖 Background polling pattern (don't poll in foreground / harness-specific notes) → references/troubleshooting.md#polling
For high-stakes content, run a proofread session first so the user can review/edit the translated subtitles before the engine commits to a final render.
# 1. Create proofread session — returns proofread_ids (one per language)
heygen video-translate proofreads create \
-d '{"video":{"type":"url","url":"https://..."}}' \
--output-languages "Spanish (Spain)" \
--mode precision \
--enable-speech-enhancement \
--keep-the-same-format \
--speaker-num 1 \
--title "<short fileNAME-safe title>"
# → status: processing (3–5 min for short videos)
# 2. Poll until completed (or failed + failure_message)
heygen video-translate proofreads get <proofread-id>
# → status: completed
# 3. Fetch presigned URLs for editable + original SRTs
heygen video-translate proofreads srt get <proofread-id> > /tmp/srt-resp.json
SRT_URL=$(jq -r '.data.srt_url' /tmp/srt-resp.json) # target-lang, edit this
ORIG_URL=$(jq -r '.data.original_srt_url' /tmp/srt-resp.json) # source-lang transcript
curl -s "$SRT_URL" -o /tmp/proofread.srt
# 4. Edit /tmp/proofread.srt by hand or sed (glossary, register, names)
# See references/proofreads-workflow.md for the full edit playbook.
# 5. Host the edited SRT at a public URL, then upload by reference.
# ⚠️ asset_id route is currently BLOCKED for SRTs —
# `heygen asset create` only accepts png/jpeg/mp4/webm/mp3/wav/pdf.
# Use the URL route. (gist raw, S3 public-read, presigned ≥2h, etc.)
EDITED_URL="https://example.com/proofread-edited.srt"
heygen video-translate proofreads srt update <proofread-id> \
-d "{\"srt\":{\"type\":\"url\",\"url\":\"$EDITED_URL\"}}"
# 6. Kick off final render — returns a video_translation_id
heygen video-translate proofreads generate <proofread-id> --captions
# → {"data":{"video_translation_id":"<vid-id>","status":"processing"}}
# 7. Poll the translation to completion (NOT proofreads get — graduates here)
heygen video-translate get <vid-id>
# → status: running → succeeded; data.video_url has the final mp4
📖 When to insist on proofread, common SRT edits, glossary discipline → references/proofreads-workflow.md
One message per completed language. Format:
✅ Spanish (Spain) — <video_url> 1m 47s, precision mode, captions on.
If a language failed: one short line with the cause (from troubleshooting reference). Don't flood the user with retry options unless they ask. If the user batched many languages, deliver each as it completes — don't wait for all to finish before posting any.
Source-quality disclaimer. Translation can't improve on the source. If the source has muffled audio, fast cuts, heavy occlusion of the face, or low resolution, lip-sync and voice quality will degrade. When you detect these conditions in Phase 2 (or the user mentions them), warn upfront. Don't surface this after a bad result.
The defaults above cover the common case. The decisions below are what separate this skill from a generic API wrapper. Use them as judgement calls during the workflow, not as a checklist to recite.
For talking-head: 1 speaker. For interviews / podcasts / panels: count exactly, don't guess. The engine separates voices by speaker_num; wrong count means voices bleed across speakers in the dubbed output. If the user is unsure, ask them to scrub the video and count.
A 30-second triage in Phase 2 saves 10–30 minutes of bad translation. Watch/listen to the first ~10 seconds of the source and check:
enable_speech_enhancement: true. If music dominates, disable_music_track: true. If both, warn the user that quality may be lower regardless of flags.enable_caption: true AND warn about the existing burn-in.📖 Full locale-pair table with register notes and known quirks → references/language-locale-guide.md
Lip-sync is best on:
Lip-sync degrades on:
If the user's source has these conditions, warn them in Phase 1/2: "Heads up — the source has [X], so lip-sync won't be as tight as it would on a static talking-head. Want me to proceed anyway, or switch to audio-only translation?"
enable_caption: true produces captions burned into the video by default. Pros: no separate file, plays anywhere. Cons: not editable later, can collide with source graphics, fixed font/style. For high-stakes content where the user might want to restyle captions (brand kit, language-specific font), prefer the proofreads workflow — it gives an SRT they can use as a sidecar caption file.
translate_audio_only: true skips lip-sync entirely. Use it for:
Output is an audio file (typically MP3). Tell the user how to use it: "This gives you a translated audio track. Composite it back over the original video in your editor, or use it standalone." Do NOT pitch audio-only as a "quality workaround" for bad lip-sync — it's a different deliverable.
Translations bill by source video duration. A 5-minute video translated into 5 languages = 25 billable minutes. Surface time and cost expectations in Phase 1 when the user requests batches: "That's 5 languages × 5 minutes = ~25 min of translation time. Each one will take 10–20 min to render. Sound good?"
Don't quote dollar figures (pricing changes, varies by plan). Quote source minutes × language count, plus an honest render-time range.
Common error responses → human-readable causes → next action:
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| 400 "video URL not accessible" | URL requires auth, returned HTML, or wrong MIME | Ask for public URL or local file → upload route |
| 400 "language not supported" | String didn't match canonical languages list | Re-run languages list, present closest matches |
| failed status with "audio extraction" | Source has no audible speech, very corrupted audio, or wrong codec | Verify the source has speech; consider re-encoding |
| failed with "speaker detection" | speaker_num mismatched actual speakers, or audio is too noisy | Re-submit with correct speaker count or enable_speech_enhancement: true |
| Stuck >30 min in running | Backend queue / occasional stalls | Check status, give it 60 min total, then surface to user |
| Lip-sync looks bad on output | Source face conditions (see lip-sync ceiling) | Re-frame expectation; offer audio-only as alternative |
| Captions in wrong direction | RTL language with burned-in caption colliding with source layout | Switch to proofread + sidecar SRT |
📖 Full error → action table including auth and asset upload errors → references/troubleshooting.md
Signals the user is new to HeyGen translation specifically:
heygen video-translate list)For first-timers, suggest a 30–60 second test clip before committing to a full video. This catches source-quality issues, voice-clone fidelity, and lip-sync ceiling without burning a long-video translation.
When the source is borderline:
"Heads up — the source video has [muffled audio / dim lighting / fast cuts / heavy music / etc.]. The translation engine can't improve on the source, so the dub might inherit some of that. Want to proceed, fix the source first, or test with a short clip?"
Don't surface this after a bad result. Surface it in Phase 2.
Currently best-supported via the proofreads workflow but not yet first-class flags:
brand_voice_id flag is for voice consistency across translations, not for glossaries.srt field (Enterprise plan) on create, or via proofreads srt update for any plan. Use srt_role: "input" to apply YOUR subtitles to the source language; output to apply them as the target-language captions.start_time / end_time in seconds. Useful for "translate just minute 2:00–4:00".--output-languages (CLI) or output_languages array (MCP). One job ID returned per language.--callback-url and --callback-id skip polling entirely. Use when you have a webhook endpoint and want event-driven completion.development
Generate HeyGen presenter videos via the v3 Video Agent pipeline — handles Frame Check (aspect ratio correction), prompt engineering, avatar resolution, and voice selection. Required for any HeyGen video generation. Replaces deprecated endpoints with v3. Use when: (1) generating any HeyGen video (via API or otherwise), (2) sending a personalized video message (outreach, update, announcement, pitch, knowledge), (3) creating a HeyGen presenter-led explainer, tutorial, or product demo with a human face, (4) "make a video of me saying...", "send a video to my leads", "record an update for my team", "create a video pitch", "make a loom-style message", "I want to appear in this video", "generate a HeyGen video", "make a talking head video". Accepts avatar_id from heygen-avatar for identity-first HeyGen videos, or uses a stock presenter. Returns video share URL + HeyGen session URL for iteration. Chain signal: when the user wants to create/design an avatar AND make a video in the same request, run heygen-avatar first, then return here. Conjunctions to watch: "and then", "and immediately", "first...then", "X and make a video", "design [presenter] and record" = always CHAIN. If the user provides a photo AND wants a video, route to heygen-avatar first. NOT for: avatar creation or identity setup (use heygen-avatar first), cinematic footage or b-roll without a presenter, translating videos, TTS-only, or streaming avatars.
development
Create a persistent HeyGen avatar — a reusable face + voice identity for the agent, the user, or any named character — powered by HeyGen Avatar V technology. Prompt-based creation by default (description → HeyGen builds it); photo upload is optional for real-person digital twins. Use when: (1) giving the agent a face + voice so it can present videos ("bring yourself to life", "create your avatar", "give yourself an avatar", "design a presenter", "set up an avatar", "let's make an avatar"), (2) the user wants to appear in videos as themselves ("create my avatar", "I want my face in a video", "digital twin of me", "build me an avatar"), (3) building a named character presenter ("create an avatar called Cleo", "design a character named X"), (4) establishing HeyGen identity before making videos — the correct FIRST step when no avatar exists yet. Chain signal: when the user says both an identity/avatar action AND a video action in the same request ("create an avatar AND make a video", "set up identity THEN create a video", "design a presenter AND immediately record"), run heygen-avatar first, then heygen-video. Returns avatar_id + voice_id — pass directly to heygen-video to create HeyGen videos. NOT for: generating videos (use heygen-video), translating videos, or TTS-only tasks.
development
Create HeyGen avatar videos via the v3 Video Agent pipeline — handles avatar resolution, aspect ratio correction, prompt engineering, and voice selection automatically. Required for any HeyGen API usage (api.heygen.com). Replaces deprecated v1/v2 endpoints with the optimized v3 pipeline. Use when: (1) calling any HeyGen API endpoint (api.heygen.com), (2) creating a HeyGen avatar or digital twin from a photo, (3) making a personalized video message (outreach, pitch, update, announcement, knowledge), (4) "make a video of me", "create my HeyGen avatar", "I want to appear in this video", (5) "send a video to my leads", "record an update for my team", "make a loom-style message", (6) building identity-first videos where the presenter IS the user or agent, Covers: HeyGen API, api.heygen.com, video generate, avatar create, voice list, talking photo, HeyGen avatar creation, voice design, photo → digital twin, HeyGen video generation, identity-first video, messaging-first video, AI presenter, talking head video. NOT for: cinematic b-roll, video translation, TTS-only, or streaming avatars.
tools
# HeyGen Video Agent — NanoClaw Container Skill ## When to Use Use this skill when the user wants to create a video with an AI avatar presenter. Triggers: "make a video", "create a video message", "record a video", "avatar video", "talking head video", "video pitch", "video update". NOT for: image generation, audio-only TTS, video translation, or cinematic b-roll. ## Required Environment - `HEYGEN_API_KEY` — Get from https://app.heygen.com/settings?nav=API - `heygen` CLI — install: `curl -f