Install the belt CLI skill: npx skills add belt-sh/cli

Talking Head Production

Create talking head videos with AI avatars and lipsync via inference.sh CLI.

Quick Start

Requires inference.sh CLI (belt). Install instructions

belt login

# Recommended: P-Video-Avatar (built-in TTS, fastest, cheapest)
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Welcome to our product tour. Today I will show you three features that will save you hours every week.",
  "voice": "Zephyr (Female)"
}'

Portrait Requirements

The source portrait image is critical. Poor portraits = poor video output.

Must Have

| Requirement | Why | Spec | |------------|-----|------| | Center-framed | Avatar needs face in predictable position | Face centered in frame | | Head and shoulders | Body visible for natural gestures | Crop below chest | | Eyes to camera | Creates connection with viewer | Direct frontal gaze | | Neutral expression | Starting point for animation | Slight smile OK, not laughing/frowning | | Clear face | Model needs to detect features | No sunglasses, heavy shadows, or obstructions | | High resolution | Detail preservation | Min 512x512 face region, ideally 1024x1024+ |

Generate a Portrait

# Generate a professional portrait with P-Image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, photorealistic",
  "aspect_ratio": "9:16"
}'

Background Options

| Type | When to Use | |------|-------------| | Solid color | Professional, clean, easy to composite | | Soft bokeh | Natural, lifestyle feel | | Office/studio | Business context | | Dynamic (P-Video-Avatar) | Use video_prompt to set background |

Model Selection

Start with P-Video-Avatar — it's 18x faster and 6x cheaper than alternatives, with built-in TTS.

| Model | App ID | Built-in TTS | Best For | |-------|--------|-------------|----------| | P-Video-Avatar | pruna/p-video-avatar | Yes (30 voices, 10 langs) | Best overall: speed, cost, quality | | OmniHuman 1.5 | bytedance/omnihuman-1-5 | No | Multi-character, gestures | | OmniHuman 1.0 | bytedance/omnihuman-1-0 | No | Single character | | Fabric 1.0 | falai/fabric-1-0 | Yes | Image talks with lipsync | | PixVerse Lipsync | falai/pixverse-lipsync | No | Realistic lipsync |

Cost & Speed Comparison

| Model | Speed (per sec of video) | Cost per second | |-------|-------------------------|----------------| | P-Video-Avatar | ~1.83s/s | $0.025 | | OmniHuman 1.5 | ~28s/s (15x slower) | $0.16 (6.4x more) | | Fabric 1.0 | ~34s/s (18x slower) | $0.14 (5.6x more) |

Production Workflows

Simple: Text Script -> Video (P-Video-Avatar)

No separate TTS step needed — P-Video-Avatar has built-in voices:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Hi there! I am excited to share something with you today.",
  "voice": "Puck (Male)",
  "voice_language": "English (US)",
  "resolution": "720p"
}'

With Style Control

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "This is exciting news for our community!",
  "voice": "Aoede (Female)",
  "voice_prompt": "Enthusiastic and energetic tone, slightly faster pace",
  "video_prompt": "The person is presenting on stage with dramatic lighting",
  "resolution": "1080p"
}'

Audio-Driven (Any Model)

Provide your own audio file:

# P-Video-Avatar with custom audio
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "audio": "https://speech.mp3"
}'

# OmniHuman with custom audio
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Full Workflow: Generate Portrait + Avatar

# 1. Generate a portrait image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic",
  "aspect_ratio": "9:16"
}'

# 2. Create avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
  "image": "<image-url-from-step-1>",
  "voice_script": "Hi there! Let me walk you through our latest features.",
  "voice": "Zephyr (Female)"
}'

With Separate TTS (for non-TTS models)

# 1. Generate speech
belt app run falai/dia-tts --input '{
  "prompt": "[S1] Your narration script here."
}'

# 2. Create talking head
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "<audio-url-from-step-1>"
}'

Multi-Character Conversation

OmniHuman 1.5 supports up to 2 characters:

# 1. Generate dialogue with two speakers
belt app run falai/dia-tts --input '{
  "prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'

# 2. Create video with two characters
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://two-person-portrait.png",
  "audio_url": "<audio-url>"
}'

Long-Form (Stitched Clips)

For content longer than ~60 seconds, split into segments:

# Generate clips with same portrait for consistency
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment one..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment two..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment three..."}' --no-wait

# Merge all segments
belt app run infsh/media-merger --input '{
  "media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'

Multilingual Content

P-Video-Avatar supports 10 languages with built-in TTS:

# Spanish
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Bienvenidos a nuestra demostración de producto.",
  "voice": "Kore (Female)",
  "voice_language": "Spanish"
}'

# Japanese
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "こんにちは、製品デモへようこそ。",
  "voice": "Leda (Female)",
  "voice_language": "Japanese"
}'

Dub Existing Video

# 1. Transcribe original video
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}'

# 2. Translate text (manually or with LLM)

# 3. Generate speech in new language
belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}'

# 4. Lipsync original video with new audio
belt app run infsh/latentsync-1-6 --input '{
  "video_url": "https://original-video.mp4",
  "audio_url": "<new-audio-url>"
}'

Audio Quality (for non-TTS workflows)

When providing your own audio, quality directly impacts lipsync accuracy.

| Parameter | Target | Why | |-----------|--------|-----| | Background noise | None/minimal | Noise confuses lipsync timing | | Volume | Consistent throughout | Prevents sync drift | | Sample rate | 44.1kHz or 48kHz | Standard quality | | Format | MP3 128kbps+ or WAV | Compatible with all tools |

Available Voices (P-Video-Avatar)

Female: Zephyr, Kore, Leda, Aoede, Callirrhoe, Autonoe, Despina, Erinome, Laomedeia, Achernar, Gacrux, Pulcherrima, Vindemiatrix, Sulafat

Male: Puck, Charon, Fenrir, Orus, Enceladus, Iapetus, Umbriel, Algenib, Algieba, Schedar, Achird, Zubenelgenubi, Sadachbia, Sadaltager, Alnilam, Rasalgethi

Languages: English (US), English (UK), Spanish, French, German, Italian, Portuguese (Brazil), Japanese, Korean, Hindi

Framing Guidelines

┌─────────────────────────────────┐
│          Headroom (minimal)     │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │     ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│  │    /|\                    │  │
│  │     |   Head & shoulders  │  │
│  │    / \  visible           │  │
│  │                           │  │
│  └───────────────────────────┘  │
│       Crop below chest          │
└─────────────────────────────────┘

Common Mistakes

| Mistake | Problem | Fix | |---------|---------|-----| | Low-res portrait | Blurry face, poor lipsync | Use 1024x1024+ face region | | Profile/side angle | Lipsync can't track mouth well | Use frontal or near-frontal | | Noisy audio | Lipsync drifts, looks unnatural | Use built-in TTS or record clean | | Too-long clips | Quality degrades | Split into segments, stitch | | Sunglasses/obstruction | Face features hidden | Clear face required | | Inconsistent lighting | Uncanny when animated | Even, soft lighting |

Related Skills

# Dedicated P-Video-Avatar skill
npx skills add inference-sh/skills@p-video-avatar

# All avatar models
npx skills add inference-sh/skills@ai-avatar-video

# All video generation models
npx skills add inference-sh/skills@ai-video-generation

# Text-to-speech
npx skills add inference-sh/skills@text-to-speech

# Image generation (for portraits)
npx skills add inference-sh/skills@ai-image-generation

Browse all apps: belt app store

Install the belt CLI skill: npx skills add belt-sh/cli

Talking Head Production

Create talking head videos with AI avatars and lipsync via inference.sh CLI.

Quick Start

Requires inference.sh CLI (belt). Install instructions

belt login

# Recommended: P-Video-Avatar (built-in TTS, fastest, cheapest)
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Welcome to our product tour. Today I will show you three features that will save you hours every week.",
  "voice": "Zephyr (Female)"
}'

Portrait Requirements

The source portrait image is critical. Poor portraits = poor video output.

Must Have

Generate a Portrait

# Generate a professional portrait with P-Image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, photorealistic",
  "aspect_ratio": "9:16"
}'

Background Options

Model Selection

Start with P-Video-Avatar — it's 18x faster and 6x cheaper than alternatives, with built-in TTS.

Cost & Speed Comparison

Production Workflows

Simple: Text Script -> Video (P-Video-Avatar)

No separate TTS step needed — P-Video-Avatar has built-in voices:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Hi there! I am excited to share something with you today.",
  "voice": "Puck (Male)",
  "voice_language": "English (US)",
  "resolution": "720p"
}'

With Style Control

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "This is exciting news for our community!",
  "voice": "Aoede (Female)",
  "voice_prompt": "Enthusiastic and energetic tone, slightly faster pace",
  "video_prompt": "The person is presenting on stage with dramatic lighting",
  "resolution": "1080p"
}'

Audio-Driven (Any Model)

Provide your own audio file:

# P-Video-Avatar with custom audio
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "audio": "https://speech.mp3"
}'

# OmniHuman with custom audio
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Full Workflow: Generate Portrait + Avatar

# 1. Generate a portrait image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic",
  "aspect_ratio": "9:16"
}'

# 2. Create avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
  "image": "<image-url-from-step-1>",
  "voice_script": "Hi there! Let me walk you through our latest features.",
  "voice": "Zephyr (Female)"
}'

With Separate TTS (for non-TTS models)

# 1. Generate speech
belt app run falai/dia-tts --input '{
  "prompt": "[S1] Your narration script here."
}'

# 2. Create talking head
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "<audio-url-from-step-1>"
}'

Multi-Character Conversation

OmniHuman 1.5 supports up to 2 characters:

# 1. Generate dialogue with two speakers
belt app run falai/dia-tts --input '{
  "prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'

# 2. Create video with two characters
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://two-person-portrait.png",
  "audio_url": "<audio-url>"
}'

Long-Form (Stitched Clips)

For content longer than ~60 seconds, split into segments:

# Generate clips with same portrait for consistency
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment one..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment two..."}' --no-wait
belt app run pruna/p-video-avatar --input '{"image": "https://portrait.jpg", "voice_script": "Segment three..."}' --no-wait

# Merge all segments
belt app run infsh/media-merger --input '{
  "media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'

Multilingual Content

P-Video-Avatar supports 10 languages with built-in TTS:

# Spanish
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Bienvenidos a nuestra demostración de producto.",
  "voice": "Kore (Female)",
  "voice_language": "Spanish"
}'

# Japanese
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "こんにちは、製品デモへようこそ。",
  "voice": "Leda (Female)",
  "voice_language": "Japanese"
}'

Dub Existing Video

# 1. Transcribe original video
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}'

# 2. Translate text (manually or with LLM)

# 3. Generate speech in new language
belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}'

# 4. Lipsync original video with new audio
belt app run infsh/latentsync-1-6 --input '{
  "video_url": "https://original-video.mp4",
  "audio_url": "<new-audio-url>"
}'

Audio Quality (for non-TTS workflows)

When providing your own audio, quality directly impacts lipsync accuracy.

Available Voices (P-Video-Avatar)

Female: Zephyr, Kore, Leda, Aoede, Callirrhoe, Autonoe, Despina, Erinome, Laomedeia, Achernar, Gacrux, Pulcherrima, Vindemiatrix, Sulafat

Male: Puck, Charon, Fenrir, Orus, Enceladus, Iapetus, Umbriel, Algenib, Algieba, Schedar, Achird, Zubenelgenubi, Sadachbia, Sadaltager, Alnilam, Rasalgethi

Languages: English (US), English (UK), Spanish, French, German, Italian, Portuguese (Brazil), Japanese, Korean, Hindi

Framing Guidelines

┌─────────────────────────────────┐
│          Headroom (minimal)     │
│  ┌───────────────────────────┐  │
│  │                           │  │
│  │     ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│  │    /|\                    │  │
│  │     |   Head & shoulders  │  │
│  │    / \  visible           │  │
│  │                           │  │
│  └───────────────────────────┘  │
│       Crop below chest          │
└─────────────────────────────────┘

Common Mistakes

Related Skills

# Dedicated P-Video-Avatar skill
npx skills add inference-sh/skills@p-video-avatar

# All avatar models
npx skills add inference-sh/skills@ai-avatar-video

# All video generation models
npx skills add inference-sh/skills@ai-video-generation

# Text-to-speech
npx skills add inference-sh/skills@text-to-speech

# Image generation (for portraits)
npx skills add inference-sh/skills@ai-image-generation

Browse all apps: belt app store

Adoption

inference-sh-8/talking-head-production

$ install --global

SKILL.md

Talking Head Production

Quick Start

Portrait Requirements

Must Have

Generate a Portrait

Background Options

Model Selection

Cost & Speed Comparison

Production Workflows

Simple: Text Script -> Video (P-Video-Avatar)

With Style Control

Audio-Driven (Any Model)

Full Workflow: Generate Portrait + Avatar

With Separate TTS (for non-TTS models)

Multi-Character Conversation

Long-Form (Stitched Clips)

Multilingual Content

Dub Existing Video

Audio Quality (for non-TTS workflows)

Available Voices (P-Video-Avatar)

Framing Guidelines

Common Mistakes

Related Skills

Related Skills

inference-sh-8/ai-podcast

inference-sh-8/widgets-ui

inference-sh-8/tools-ui

inference-sh-8/chat-ui

inference-sh-8/talking-head-production

$ install --global

SKILL.md

Talking Head Production

Quick Start

Portrait Requirements

Must Have

Generate a Portrait

Background Options

Model Selection

Cost & Speed Comparison

Production Workflows

Simple: Text Script -> Video (P-Video-Avatar)

With Style Control

Audio-Driven (Any Model)

Full Workflow: Generate Portrait + Avatar

With Separate TTS (for non-TTS models)

Multi-Character Conversation

Long-Form (Stitched Clips)

Multilingual Content

Dub Existing Video

Audio Quality (for non-TTS workflows)

Available Voices (P-Video-Avatar)

Framing Guidelines

Common Mistakes

Related Skills

Related Skills

inference-sh-8/ai-podcast

inference-sh-8/widgets-ui

inference-sh-8/tools-ui

inference-sh-8/chat-ui