guides/video/talking-head-production/SKILL.md
Talking head video production with AI avatars, lipsync, and voiceover. Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS). Also covers OmniHuman, PixVerse, Fabric. Portrait requirements, audio quality, production workflows. Use for: spokesperson videos, course content, social media, presentations, demos. Triggers: talking head, avatar video, lipsync, lip sync, ai spokesperson, virtual presenter, ai presenter, omnihuman, talking avatar, video presenter, ai talking head, presenter video, ai face video, p-video-avatar
npx skillsauth add inference-sh-8/skills talking-head-productionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
Security scan pending...
This skill is queued for security scanning. Results will appear when the scan completes.
Install the belt CLI skill:
npx skills add belt-sh/cli
Create talking head videos with AI avatars and lipsync via inference.sh CLI.
Requires inference.sh CLI (
belt). Install instructions
belt login
# Recommended: P-Video-Avatar (built-in TTS, fastest, cheapest)
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "Welcome to our product tour. Today I will show you three features that will save you hours every week.",
"voice": "Zephyr (Female)"
}'
The source portrait image is critical. Poor portraits = poor video output.
| Requirement | Why | Spec | |------------|-----|------| | Center-framed | Avatar needs face in predictable position | Face centered in frame | | Head and shoulders | Body visible for natural gestures | Crop below chest | | Eyes to camera | Creates connection with viewer | Direct frontal gaze | | Neutral expression | Starting point for animation | Slight smile OK, not laughing/frowning | | Clear face | Model needs to detect features | No sunglasses, heavy shadows, or obstructions | | High resolution | Detail preservation | Min 512x512 face region, ideally 1024x1024+ |
# Generate a professional portrait with P-Image
belt app run pruna/p-image --input '{
"prompt": "professional headshot portrait of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, photorealistic",
"aspect_ratio": "9:16"
}'
| Type | When to Use |
|------|-------------|
| Solid color | Professional, clean, easy to composite |
| Soft bokeh | Natural, lifestyle feel |
| Office/studio | Business context |
| Dynamic (P-Video-Avatar) | Use video_prompt to set background |
Start with P-Video-Avatar — it's 18x faster and 6x cheaper than alternatives, with built-in TTS.
| Model | App ID | Built-in TTS | Best For |
|-------|--------|-------------|----------|
| P-Video-Avatar | pruna/p-video-avatar | Yes (30 voices, 10 langs) | Best overall: speed, cost, quality |
| OmniHuman 1.5 | bytedance/omnihuman-1-5 | No | Multi-character, gestures |
| OmniHuman 1.0 | bytedance/omnihuman-1-0 | No | Single character |
| Fabric 1.0 | falai/fabric-1-0 | Yes | Image talks with lipsync |
| PixVerse Lipsync | falai/pixverse-lipsync | No | Realistic lipsync |
| Model | Speed (per sec of video) | Cost per second | |-------|-------------------------|----------------| | P-Video-Avatar | ~1.83s/s | $0.025 | | OmniHuman 1.5 | ~28s/s (15x slower) | $0.16 (6.4x more) | | Fabric 1.0 | ~34s/s (18x slower) | $0.14 (5.6x more) |
No separate TTS step needed — P-Video-Avatar has built-in voices:
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "Hi there! I am excited to share something with you today.",
"voice": "Puck (Male)",
"voice_language": "English (US)",
"resolution": "720p"
}'
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "This is exciting news for our community!",
"voice": "Aoede (Female)",
"voice_prompt": "Enthusiastic and energetic tone, slightly faster pace",
"video_prompt": "The person is presenting on stage with dramatic lighting",
"resolution": "1080p"
}'
Provide your own audio file:
# P-Video-Avatar with custom audio
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"audio": "https://speech.mp3"
}'
# OmniHuman with custom audio
belt app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://portrait.jpg",
"audio_url": "https://speech.mp3"
}'
# 1. Generate a portrait image
belt app run pruna/p-image --input '{
"prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic",
"aspect_ratio": "9:16"
}'
# 2. Create avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
"image": "<image-url-from-step-1>",
"voice_script": "Hi there! Let me walk you through our latest features.",
"voice": "Zephyr (Female)"
}'
# 1. Generate speech
belt app run falai/dia-tts --input '{
"prompt": "[S1] Your narration script here."
}'
# 2. Create talking head
belt app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://portrait.jpg",
"audio_url": "<audio-url-from-step-1>"
}'
OmniHuman 1.5 supports up to 2 characters:
# 1. Generate dialogue with two speakers
belt app run falai/dia-tts --input '{
"prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'
# 2. Create video with two characters
belt app run bytedance/omnihuman-1-5 --input '{
"image_url": "https://two-person-portrait.png",
"audio_url": "<audio-url>"
}'
For content longer than ~60 seconds, split into segments:
# Generate clips with same portrait for consistency
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment one..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment two..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment three..."}' --no-wait
# Merge all segments
belt app run infsh/media-merger --input '{
"media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'
P-Video-Avatar supports 10 languages with built-in TTS:
# Spanish
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "Bienvenidos a nuestra demostración de producto.",
"voice": "Kore (Female)",
"voice_language": "Spanish"
}'
# Japanese
belt app run pruna/p-video-avatar --input '{
"image": "https://portrait.jpg",
"voice_script": "こんにちは、製品デモへようこそ。",
"voice": "Leda (Female)",
"voice_language": "Japanese"
}'
# 1. Transcribe original video
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}'
# 2. Translate text (manually or with LLM)
# 3. Generate speech in new language
belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}'
# 4. Lipsync original video with new audio
belt app run infsh/latentsync-1-6 --input '{
"video_url": "https://original-video.mp4",
"audio_url": "<new-audio-url>"
}'
When providing your own audio, quality directly impacts lipsync accuracy.
| Parameter | Target | Why | |-----------|--------|-----| | Background noise | None/minimal | Noise confuses lipsync timing | | Volume | Consistent throughout | Prevents sync drift | | Sample rate | 44.1kHz or 48kHz | Standard quality | | Format | MP3 128kbps+ or WAV | Compatible with all tools |
Female: Zephyr, Kore, Leda, Aoede, Callirrhoe, Autonoe, Despina, Erinome, Laomedeia, Achernar, Gacrux, Pulcherrima, Vindemiatrix, Sulafat
Male: Puck, Charon, Fenrir, Orus, Enceladus, Iapetus, Umbriel, Algenib, Algieba, Schedar, Achird, Zubenelgenubi, Sadachbia, Sadaltager, Alnilam, Rasalgethi
Languages: English (US), English (UK), Spanish, French, German, Italian, Portuguese (Brazil), Japanese, Korean, Hindi
┌─────────────────────────────────┐
│ Headroom (minimal) │
│ ┌───────────────────────────┐ │
│ │ │ │
│ │ ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│ │ /|\ │ │
│ │ | Head & shoulders │ │
│ │ / \ visible │ │
│ │ │ │
│ └───────────────────────────┘ │
│ Crop below chest │
└─────────────────────────────────┘
| Mistake | Problem | Fix | |---------|---------|-----| | Low-res portrait | Blurry face, poor lipsync | Use 1024x1024+ face region | | Profile/side angle | Lipsync can't track mouth well | Use frontal or near-frontal | | Noisy audio | Lipsync drifts, looks unnatural | Use built-in TTS or record clean | | Too-long clips | Quality degrades | Split into segments, stitch | | Sunglasses/obstruction | Face features hidden | Clear face required | | Inconsistent lighting | Uncanny when animated | Even, soft lighting |
# Dedicated P-Video-Avatar skill
npx skills add inference-sh/skills@p-video-avatar
# All avatar models
npx skills add inference-sh/skills@ai-avatar-video
# All video generation models
npx skills add inference-sh/skills@ai-video-generation
# Text-to-speech
npx skills add inference-sh/skills@text-to-speech
# Image generation (for portraits)
npx skills add inference-sh/skills@ai-image-generation
Browse all apps: belt app store
data-ai
Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.
development
Declarative UI widgets from JSON for React/Next.js from ui.inference.sh. Render rich interactive UIs from structured agent responses. Capabilities: forms, buttons, cards, layouts, inputs, selects, checkboxes. Use for: agent-generated UIs, dynamic forms, data display, interactive cards. Triggers: widgets, declarative ui, json ui, widget renderer, agent widgets, dynamic ui, form widgets, card widgets, shadcn widgets, structured output ui
tools
Tool lifecycle UI components for React/Next.js from ui.inference.sh. Display tool calls: pending, progress, approval required, results. Capabilities: tool status, progress indicators, approval flows, results display. Use for: showing agent tool calls, human-in-the-loop approvals, tool output. Triggers: tool ui, tool calls, tool status, tool approval, tool results, agent tools, mcp tools ui, function calling ui, tool lifecycle, tool pending
development
Chat UI building blocks for React/Next.js from ui.inference.sh. Components: container, messages, input, typing indicators, avatars. Capabilities: chat interfaces, message lists, input handling, streaming. Use for: building custom chat UIs, messaging interfaces, AI assistants. Triggers: chat ui, chat component, message list, chat input, shadcn chat, react chat, chat interface, messaging ui, conversation ui, chat building blocks