guides/video/talking-head-production/SKILL.md
Talking head video production with AI avatars, lipsync, and voiceover. Covers portrait requirements, audio quality, OmniHuman, PixVerse lipsync, Dia TTS. Use for: spokesperson videos, course content, social media, presentations, demos. Triggers: talking head, avatar video, lipsync, lip sync, ai spokesperson, virtual presenter, ai presenter, omnihuman, talking avatar, video presenter, ai talking head, presenter video, ai face video
npx skillsauth add inference-sh/agent-skills-registry talking-head-productionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create talking head videos with AI avatars and lipsync via inference.sh CLI.
Requires inference.sh CLI (
belt). Install instructions
belt login
# Generate dialogue audio
belt app run falai/dia-tts --input '{
"prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week."
}'
# Create talking head video with OmniHuman
belt app run bytedance/omnihuman-1-5 --input '{
"image": "path/to/portrait.png",
"audio": "path/to/dialogue.mp3"
}'
The source portrait image is critical. Poor portraits = poor video output.
| Requirement | Why | Spec | |------------|-----|------| | Center-framed | Avatar needs face in predictable position | Face centered in frame | | Head and shoulders | Body visible for natural gestures | Crop below chest | | Eyes to camera | Creates connection with viewer | Direct frontal gaze | | Neutral expression | Starting point for animation | Slight smile OK, not laughing/frowning | | Clear face | Model needs to detect features | No sunglasses, heavy shadows, or obstructions | | High resolution | Detail preservation | Min 512x512 face region, ideally 1024x1024+ |
| Type | When to Use | |------|-------------| | Solid color | Professional, clean, easy to composite | | Soft bokeh | Natural, lifestyle feel | | Office/studio | Business context | | Transparent (via bg removal) | Compositing into other scenes |
# Generate a professional portrait background
belt app run falai/flux-dev-lora --input '{
"prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography"
}'
# Or remove background from existing portrait
belt app run <bg-removal-app> --input '{
"image": "path/to/portrait-with-background.png"
}'
Audio quality directly impacts lipsync accuracy. Clean audio = accurate lip movement.
| Parameter | Target | Why | |-----------|--------|-----| | Background noise | None/minimal | Noise confuses lipsync timing | | Volume | Consistent throughout | Prevents sync drift | | Sample rate | 44.1kHz or 48kHz | Standard quality | | Format | MP3 128kbps+ or WAV | Compatible with all tools |
# Simple narration
belt app run falai/dia-tts --input '{
"prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here."
}'
# With emotion and pacing
belt app run falai/dia-tts --input '{
"prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?"
}'
| Model | App ID | Best For | Max Duration |
|-------|--------|----------|-------------|
| OmniHuman 1.5 | bytedance/omnihuman-1-5 | Multi-character, gestures, high quality | ~30s per clip |
| OmniHuman 1.0 | bytedance/omnihuman-1-0 | Single character, simpler | ~30s per clip |
| PixVerse Lipsync | falai/pixverse-lipsync | Quick lipsync on existing video | Short clips |
| Fabric | falai/fabric-1-0 | Cloth/fabric animation on portraits | Short clips |
# 1. Generate or prepare audio
belt app run falai/dia-tts --input '{
"prompt": "[S1] Your narration script here."
}'
# 2. Generate talking head
belt app run bytedance/omnihuman-1-5 --input '{
"image": "portrait.png",
"audio": "narration.mp3"
}'
# 1-2. Same as above
# 3. Add captions to the talking head video
belt app run infsh/caption-videos --input '{
"video": "talking-head.mp4",
"caption_file": "captions.srt"
}'
For content longer than 30 seconds, split into segments:
# Generate audio segments
belt app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait
belt app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait
belt app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait
# Generate talking head for each segment (same portrait for consistency)
belt app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait
belt app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait
belt app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait
# Merge all segments
belt app run infsh/media-merger --input '{
"media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'
OmniHuman 1.5 supports up to 2 characters:
# 1. Generate dialogue with two speakers
belt app run falai/dia-tts --input '{
"prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'
# 2. Create video with two characters
belt app run bytedance/omnihuman-1-5 --input '{
"image": "two-person-portrait.png",
"audio": "dialogue.mp3"
}'
┌─────────────────────────────────┐
│ Headroom (minimal) │
│ ┌───────────────────────────┐ │
│ │ │ │
│ │ ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│ │ /|\ │ │
│ │ | Head & shoulders │ │
│ │ / \ visible │ │
│ │ │ │
│ └───────────────────────────┘ │
│ Crop below chest │
└─────────────────────────────────┘
| Mistake | Problem | Fix | |---------|---------|-----| | Low-res portrait | Blurry face, poor lipsync | Use 1024x1024+ face region | | Profile/side angle | Lipsync can't track mouth well | Use frontal or near-frontal | | Noisy audio | Lipsync drifts, looks unnatural | Record clean or use TTS | | Too-long clips | Quality degrades after 30s | Split into segments, stitch | | Sunglasses/obstruction | Face features hidden | Clear face required | | Inconsistent lighting | Uncanny when animated | Even, soft lighting | | No captions | Loses silent/mobile viewers | Always add captions |
npx skills add inference-sh/skills@ai-avatar-video
npx skills add inference-sh/skills@ai-video-generation
npx skills add inference-sh/skills@text-to-speech
Browse all apps: belt app list
development
Declarative UI widgets from JSON for React/Next.js from ui.inference.sh. Render rich interactive UIs from structured agent responses. Capabilities: forms, buttons, cards, layouts, inputs, selects, checkboxes. Use for: agent-generated UIs, dynamic forms, data display, interactive cards. Triggers: widgets, declarative ui, json ui, widget renderer, agent widgets, dynamic ui, form widgets, card widgets, shadcn widgets, structured output ui
tools
Tool lifecycle UI components for React/Next.js from ui.inference.sh. Display tool calls: pending, progress, approval required, results. Capabilities: tool status, progress indicators, approval flows, results display. Use for: showing agent tool calls, human-in-the-loop approvals, tool output. Triggers: tool ui, tool calls, tool status, tool approval, tool results, agent tools, mcp tools ui, function calling ui, tool lifecycle, tool pending
development
Chat UI building blocks for React/Next.js from ui.inference.sh. Components: container, messages, input, typing indicators, avatars. Capabilities: chat interfaces, message lists, input handling, streaming. Use for: building custom chat UIs, messaging interfaces, AI assistants. Triggers: chat ui, chat component, message list, chat input, shadcn chat, react chat, chat interface, messaging ui, conversation ui, chat building blocks
tools
Batteries-included agent component for React/Next.js from ui.inference.sh. One component with runtime, tools, streaming, approvals, and widgets built in. Capabilities: drop-in agent, human-in-the-loop, client-side tools, form filling. Use for: building AI chat interfaces, agentic UIs, SaaS copilots, assistants. Triggers: agent component, agent ui, chat agent, shadcn agent, react agent, agentic ui, ai assistant ui, copilot ui, inference ui, human in the loop