Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

inference-sh-3/ai-avatar-video

Name: ai-avatar-video
Author: inference-sh-3

tools/video/ai-avatar-video/SKILL.md

npx skillsauth add inference-sh-3/skills ai-avatar-video

Install the belt CLI skill: npx skills add belt-sh/cli

AI Avatar & Talking Head Videos

Create AI avatars and talking head videos via inference.sh CLI.

AI Avatar & Talking Head Videos

Quick Start

Requires inference.sh CLI (belt). Install instructions

belt login

# Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS)
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Hello, welcome to our product demo!",
  "voice": "Zephyr (Female)"
}'

Available Models

Start with P-Video-Avatar — it's 18x faster and 6x cheaper than alternatives, with built-in TTS, dynamic backgrounds, and 1080p support.

| Model | App ID | Best For | Built-in TTS | |-------|--------|----------|-------------| | P-Video-Avatar | pruna/p-video-avatar | Best overall: speed, cost, quality, control | Yes (30 voices, 10 languages) | | OmniHuman 1.5 | bytedance/omnihuman-1-5 | Multi-character, audio-driven | No | | Fabric 1.0 | falai/fabric-1-0 | Image talks with lipsync | Yes | | PixVerse Lipsync | falai/pixverse-lipsync | Highly realistic lipsync | No |

Cost & Speed Comparison

| Model | Speed (per sec of video) | Cost per second | |-------|-------------------------|----------------| | P-Video-Avatar | ~1.83s/s | $0.025 | | OmniHuman 1.5 | ~28s/s (15x slower) | $0.16 (6.4x more) | | Fabric 1.0 | ~34s/s (18x slower) | $0.14 (5.6x more) |

Examples

P-Video-Avatar (Recommended)

Generate avatar from portrait + text script with built-in TTS:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Welcome to our product walkthrough. Today I will show you three key features.",
  "voice": "Puck (Male)",
  "voice_language": "English (US)",
  "resolution": "720p"
}'

With custom style control:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "This is exciting news!",
  "voice": "Aoede (Female)",
  "voice_prompt": "Enthusiastic and energetic tone",
  "video_prompt": "The person is presenting on stage with dramatic lighting",
  "resolution": "1080p"
}'

With audio file instead of TTS:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "audio": "https://speech.mp3"
}'

Full Workflow: Generate Portrait + Avatar

Use Pruna P-Image to generate the portrait, then create the avatar:

# 1. Generate a portrait image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic",
  "aspect_ratio": "9:16"
}'

# 2. Create avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
  "image": "<image-url-from-step-1>",
  "voice_script": "Hi there! Let me walk you through our latest features.",
  "voice": "Zephyr (Female)"
}'

OmniHuman 1.5 (Multi-Character)

belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Supports specifying which character to drive in multi-person images.

Fabric 1.0 (Image Talks)

belt app run falai/fabric-1-0 --input '{
  "image_url": "https://face.jpg",
  "audio_url": "https://audio.mp3"
}'

PixVerse Lipsync

belt app run falai/pixverse-lipsync --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Full Workflow: TTS + Avatar (Non-TTS Models)

For models without built-in TTS (OmniHuman, PixVerse), generate speech first:

# 1. Generate speech — Inworld TTS-2 for expressive character voices
belt app run inworld/text-to-speech-2 --input '{
  "text": "[friendly] Welcome to our product demo! [excited] Let me show you three features that will change how you work.",
  "voice_id": "Sarah",
  "delivery_mode": "CREATIVE"
}' > speech.json

# 2. Create avatar video with the speech
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://presenter-photo.jpg",
  "audio_url": "<audio-url-from-step-1>"
}'

Tip: For most use cases, P-Video-Avatar with built-in TTS is simpler — no separate audio step needed. Use this workflow only when you specifically need OmniHuman (multi-character) or PixVerse (realistic lipsync).

Full Workflow: Dub Video in Another Language

# 1. Transcribe original video
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}' > transcript.json

# 2. Translate text (manually or with an LLM)

# 3. Generate speech in new language
belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}' > new_speech.json

# 4. Lipsync the original video with new audio
belt app run infsh/latentsync-1-6 --input '{
  "video_url": "https://original-video.mp4",
  "audio_url": "<new-audio-url>"
}'

Avatar UGC Generation

Create UGC-style content with P-Video-Avatar — built-in TTS, no separate audio step needed:

# 1. Generate a relatable UGC-style portrait
belt app run pruna/p-image --input '{
  "prompt": "casual selfie-style photo of a young woman in a cozy room, natural lighting, looking at camera, warm smile, authentic feel",
  "aspect_ratio": "9:16"
}'

# 2. Create UGC avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
  "image": "<image-url-from-step-1>",
  "voice_script": "Okay so I just tried this product and honestly? It is a game changer. I was not expecting to love it this much but here we are!",
  "voice": "Zephyr (Female)",
  "voice_prompt": "Excited, casual, authentic tone like talking to a friend",
  "video_prompt": "The person is talking casually to camera in their room, natural gestures",
  "resolution": "1080p"
}'

Why P-Video-Avatar for UGC

All-in-one — built-in TTS means no separate audio generation step
30 voices, 10 languages — match your target audience
Voice + video prompts — control tone, emotion, body language, and background independently
18x faster, 6x cheaper — produce UGC at scale vs. Fabric/OmniHuman/HeyGen
1080p support — platform-ready vertical video from a single portrait image

Batch UGC: Same Product, Multiple Presenters

# Generate 3 different presenters
for voice in "Zephyr (Female)" "Puck (Male)" "Aoede (Female)"; do
  belt app run pruna/p-video-avatar --input "{
    \"image\": \"https://portrait.jpg\",
    \"voice_script\": \"This changed my morning routine completely. Five minutes and I am done.\",
    \"voice\": \"$voice\",
    \"voice_prompt\": \"Casual, authentic, like a real testimonial\",
    \"video_prompt\": \"Person talking to camera in a bright kitchen\",
    \"resolution\": \"1080p\"
  }"
done

Use Cases

UGC & Marketing: Product demos, UGC-style ads with AI presenters
Education: Course videos, explainers
Localization: Dub content across 10 languages from one image
Social Media: Consistent virtual influencer content
Corporate: Training videos, announcements
Gaming: Character avatars, NPC dialogue

Tips

Use high-quality portrait photos (front-facing, good lighting)
Audio should be clear with minimal background noise
P-Video-Avatar supports built-in TTS — no need for a separate speech generation step
P-Video-Avatar output aspect ratio matches the input image
Generate portraits with pruna/p-image using 9:16 aspect ratio for vertical videos
OmniHuman 1.5 supports multiple people in one image
LatentSync is best for syncing existing videos to new audio

Related Skills

# Dedicated P-Video-Avatar skill
npx skills add inference-sh/skills@p-video-avatar

# Full platform skill (all 250+ apps)
npx skills add inference-sh/skills@infsh-cli

# Text-to-speech (generate audio for non-TTS avatar models)
npx skills add inference-sh/skills@text-to-speech

# Speech-to-text (transcribe for dubbing)
npx skills add inference-sh/skills@speech-to-text

# Video generation
npx skills add inference-sh/skills@ai-video-generation

# Image generation (create avatar images)
npx skills add inference-sh/skills@ai-image-generation

Browse all video apps: belt app store --category video

Documentation

Running Apps - How to run apps via CLI
Content Pipeline Example - Building media workflows
Streaming Results - Real-time progress updates

inference-sh-3/ai-avatar-video

tools/video/ai-avatar-video/SKILL.md

Create AI avatar and talking head videos via inference.sh CLI. Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS). Also: OmniHuman, Fabric, PixVerse. Audio: Inworld TTS-2 (100+ languages, emotion steering for characters), ElevenLabs, Kokoro. Capabilities: audio-driven avatars, text-to-avatar, lipsync videos, talking head generation, virtual presenters, UGC content. Use for: AI presenters, explainer videos, virtual influencers, dubbing, marketing videos, UGC ads, gaming avatars, NPC dialogue. Triggers: ai avatar, talking head, lipsync, avatar video, virtual presenter, ai spokesperson, audio driven video, heygen alternative, synthesia alternative, talking avatar, lip sync, video avatar, ai presenter, digital human, ugc, ugc video, ugc ad, avatar ugc

450 stars

tools

Updated May 19, 2026

$ install --global

skillsauth

npx skillsauth add inference-sh-3/skills ai-avatar-video

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security scan pending...

This skill is queued for security scanning. Results will appear when the scan completes.

SKILL.md

name:: ai-avatar-video
description:: Create AI avatar and talking head videos via inference.sh CLI. Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS). Also: OmniHuman, Fabric, PixVerse. Audio: Inworld TTS-2 (100+ languages, emotion steering for characters), ElevenLabs, Kokoro. Capabilities: audio-driven avatars, text-to-avatar, lipsync videos, talking head generation, virtual presenters, UGC content. Use for: AI presenters, explainer videos, virtual influencers, dubbing, marketing videos, UGC ads, gaming avatars, NPC dialogue. Triggers: ai avatar, talking head, lipsync, avatar video, virtual presenter, ai spokesperson, audio driven video, heygen alternative, synthesia alternative, talking avatar, lip sync, video avatar, ai presenter, digital human, ugc, ugc video, ugc ad, avatar ugc
allowed-tools:: Bash(belt *)

Install the belt CLI skill: npx skills add belt-sh/cli

AI Avatar & Talking Head Videos

Create AI avatars and talking head videos via inference.sh CLI.

AI Avatar & Talking Head Videos

Quick Start

Requires inference.sh CLI (belt). Install instructions

belt login

# Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS)
belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Hello, welcome to our product demo!",
  "voice": "Zephyr (Female)"
}'

Available Models

Start with P-Video-Avatar — it's 18x faster and 6x cheaper than alternatives, with built-in TTS, dynamic backgrounds, and 1080p support.

Cost & Speed Comparison

Examples

P-Video-Avatar (Recommended)

Generate avatar from portrait + text script with built-in TTS:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "Welcome to our product walkthrough. Today I will show you three key features.",
  "voice": "Puck (Male)",
  "voice_language": "English (US)",
  "resolution": "720p"
}'

With custom style control:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "voice_script": "This is exciting news!",
  "voice": "Aoede (Female)",
  "voice_prompt": "Enthusiastic and energetic tone",
  "video_prompt": "The person is presenting on stage with dramatic lighting",
  "resolution": "1080p"
}'

With audio file instead of TTS:

belt app run pruna/p-video-avatar --input '{
  "image": "https://portrait.jpg",
  "audio": "https://speech.mp3"
}'

Full Workflow: Generate Portrait + Avatar

Use Pruna P-Image to generate the portrait, then create the avatar:

# 1. Generate a portrait image
belt app run pruna/p-image --input '{
  "prompt": "professional headshot portrait of a young woman, neutral background, looking at camera, studio lighting, photorealistic",
  "aspect_ratio": "9:16"
}'

# 2. Create avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
  "image": "<image-url-from-step-1>",
  "voice_script": "Hi there! Let me walk you through our latest features.",
  "voice": "Zephyr (Female)"
}'

OmniHuman 1.5 (Multi-Character)

belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Supports specifying which character to drive in multi-person images.

Fabric 1.0 (Image Talks)

belt app run falai/fabric-1-0 --input '{
  "image_url": "https://face.jpg",
  "audio_url": "https://audio.mp3"
}'

PixVerse Lipsync

belt app run falai/pixverse-lipsync --input '{
  "image_url": "https://portrait.jpg",
  "audio_url": "https://speech.mp3"
}'

Full Workflow: TTS + Avatar (Non-TTS Models)

For models without built-in TTS (OmniHuman, PixVerse), generate speech first:

# 1. Generate speech — Inworld TTS-2 for expressive character voices
belt app run inworld/text-to-speech-2 --input '{
  "text": "[friendly] Welcome to our product demo! [excited] Let me show you three features that will change how you work.",
  "voice_id": "Sarah",
  "delivery_mode": "CREATIVE"
}' > speech.json

# 2. Create avatar video with the speech
belt app run bytedance/omnihuman-1-5 --input '{
  "image_url": "https://presenter-photo.jpg",
  "audio_url": "<audio-url-from-step-1>"
}'

Tip: For most use cases, P-Video-Avatar with built-in TTS is simpler — no separate audio step needed. Use this workflow only when you specifically need OmniHuman (multi-character) or PixVerse (realistic lipsync).

Full Workflow: Dub Video in Another Language

# 1. Transcribe original video
belt app run infsh/fast-whisper-large-v3 --input '{"audio_url": "https://video.mp4"}' > transcript.json

# 2. Translate text (manually or with an LLM)

# 3. Generate speech in new language
belt app run infsh/kokoro-tts --input '{"text": "<translated-text>"}' > new_speech.json

# 4. Lipsync the original video with new audio
belt app run infsh/latentsync-1-6 --input '{
  "video_url": "https://original-video.mp4",
  "audio_url": "<new-audio-url>"
}'

Avatar UGC Generation

Create UGC-style content with P-Video-Avatar — built-in TTS, no separate audio step needed:

# 1. Generate a relatable UGC-style portrait
belt app run pruna/p-image --input '{
  "prompt": "casual selfie-style photo of a young woman in a cozy room, natural lighting, looking at camera, warm smile, authentic feel",
  "aspect_ratio": "9:16"
}'

# 2. Create UGC avatar video with built-in TTS
belt app run pruna/p-video-avatar --input '{
  "image": "<image-url-from-step-1>",
  "voice_script": "Okay so I just tried this product and honestly? It is a game changer. I was not expecting to love it this much but here we are!",
  "voice": "Zephyr (Female)",
  "voice_prompt": "Excited, casual, authentic tone like talking to a friend",
  "video_prompt": "The person is talking casually to camera in their room, natural gestures",
  "resolution": "1080p"
}'

Why P-Video-Avatar for UGC

All-in-one — built-in TTS means no separate audio generation step
30 voices, 10 languages — match your target audience
Voice + video prompts — control tone, emotion, body language, and background independently
18x faster, 6x cheaper — produce UGC at scale vs. Fabric/OmniHuman/HeyGen
1080p support — platform-ready vertical video from a single portrait image

Batch UGC: Same Product, Multiple Presenters

# Generate 3 different presenters
for voice in "Zephyr (Female)" "Puck (Male)" "Aoede (Female)"; do
  belt app run pruna/p-video-avatar --input "{
    \"image\": \"https://portrait.jpg\",
    \"voice_script\": \"This changed my morning routine completely. Five minutes and I am done.\",
    \"voice\": \"$voice\",
    \"voice_prompt\": \"Casual, authentic, like a real testimonial\",
    \"video_prompt\": \"Person talking to camera in a bright kitchen\",
    \"resolution\": \"1080p\"
  }"
done

Use Cases

UGC & Marketing: Product demos, UGC-style ads with AI presenters
Education: Course videos, explainers
Localization: Dub content across 10 languages from one image
Social Media: Consistent virtual influencer content
Corporate: Training videos, announcements
Gaming: Character avatars, NPC dialogue

Tips

Use high-quality portrait photos (front-facing, good lighting)
Audio should be clear with minimal background noise
P-Video-Avatar supports built-in TTS — no need for a separate speech generation step
P-Video-Avatar output aspect ratio matches the input image
Generate portraits with pruna/p-image using 9:16 aspect ratio for vertical videos
OmniHuman 1.5 supports multiple people in one image
LatentSync is best for syncing existing videos to new audio

Related Skills

# Dedicated P-Video-Avatar skill
npx skills add inference-sh/skills@p-video-avatar

# Full platform skill (all 250+ apps)
npx skills add inference-sh/skills@infsh-cli

# Text-to-speech (generate audio for non-TTS avatar models)
npx skills add inference-sh/skills@text-to-speech

# Speech-to-text (transcribe for dubbing)
npx skills add inference-sh/skills@speech-to-text

# Video generation
npx skills add inference-sh/skills@ai-video-generation

# Image generation (create avatar images)
npx skills add inference-sh/skills@ai-image-generation

Browse all video apps: belt app store --category video

Documentation

Running Apps - How to run apps via CLI
Content Pipeline Example - Building media workflows
Streaming Results - Real-time progress updates

Related Skills

inference-sh-3/ai-podcast

data-ai

VerifiedTrustedCommunity

Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.

457SKILL.mdUpdated May 21, 2026

inference-sh-3/ai-podcast

inference-sh-3/seedance

tools

Community

Generate videos with ByteDance Seedance 2.0 via inference.sh CLI. Unified model for text-to-video, image-to-video, and reference-to-video with synchronized audio, up to 1080p, 4-15s duration. Pro and Fast variants. Studio variants with private asset library for portrait consistency. Use for: social media videos, music videos, product demos, animated content, AI video with sound. Triggers: seedance, seedance 2, bytedance video, seedance t2v, seedance i2v, seedance r2v, video with audio, seedance 2.0, bytedance seedance, seedance studio

450SKILL.mdUpdated May 13, 2026

inference-sh-3/seedance

Security Scans

mcp-scan — Pending Scan

Semgrep — Pending Scan

Trivy — Pending Scan

OWASP — Pending Scan

VirusTotal — Pending Scan

inference-sh-3/p-video-avatar

tools

Community

Generate talking head avatar videos with Pruna P-Video-Avatar via inference.sh CLI. Turn a portrait image into a realistic speaking video with built-in TTS. 18x faster and 6x cheaper than competitors. Models: P-Video-Avatar, P-Image (for portrait generation). Capabilities: text-to-avatar, audio-driven avatars, 30 voices, 10 languages, 720p/1080p, built-in TTS, dynamic backgrounds, full-body control. Use for: AI presenters, product demos, explainer videos, virtual influencers, marketing, education, multilingual content, UGC, gaming avatars. Triggers: avatar video, talking head, ai avatar, p-video-avatar, pruna avatar, video avatar, ai presenter, digital human, virtual presenter, lipsync, talking avatar, ai spokesperson, heygen alternative, synthesia alternative, veed alternative, fabric alternative, omnihuman alternative

450SKILL.mdUpdated May 13, 2026

inference-sh-3/p-video-avatar

Security Scans

mcp-scan — Pending Scan

Semgrep — Pending Scan

Trivy — Pending Scan

OWASP — Pending Scan

VirusTotal — Pending Scan

inference-sh-3/happyhorse

tools

Community

Generate and edit videos with Alibaba HappyHorse 1.0 models via inference.sh CLI. Models: HappyHorse T2V, I2V, R2V, Video Edit. Capabilities: text-to-video, image-to-video, reference-to-video, video editing with natural language, character preservation, 720P/1080P, up to 15 seconds. Use for: physically realistic video, video editing, character-consistent content, product demos, social media. Triggers: happyhorse, happy horse, alibaba video, happyhorse 1.0, dashscope video, alibaba happyhorse, video editing ai, ai video editor

450SKILL.mdUpdated May 13, 2026

inference-sh-3/happyhorse

Security Scans

mcp-scan — Pending Scan

Semgrep — Pending Scan

Trivy — Pending Scan

OWASP — Pending Scan

VirusTotal — Pending Scan

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/inference-sh-3/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/tools/video/ai-avatar-video ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

inference-sh-3/skills

450 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT