ai-podcast/SKILL.md
Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.
npx skillsauth add inference-sh-8/skills ai-podcastInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create multi-person talking head podcast videos using the inference.sh pipeline: portrait generation → TTS audio → avatar video → merge. Supports real humans (via Phota), 3D mascots, illustrated characters, and mixed casts.
Use when the user wants to create a podcast, talking head video, demo reel, promotional conversation, or any multi-speaker video content.
Characters (images) → TTS (audio per turn) → Avatar (video per turn) → Merge (final video)
Choose the right tool per character type:
| Character Type | Tool | Notes |
|---------------|------|-------|
| Real human (new) | pruna/p-image | 16:9, prompt_upsampling: true. Quick, no training needed, but identity won't be consistent across multiple generations. |
| Real human (consistent ID) | phota/generate with [[profile_id]] | Consistent identity across all shots. Requires a trained Phota profile first (see below). |
| Brand mascot / logo character | google/gemini-3-pro-image-preview | Pass logo + character sheet as reference images |
| Illustrated / stylized | google/gemini-3-pro-image-preview | Pass style reference as input image |
Training a Phota identity (optional but recommended for humans):
If you need a real human character with consistent identity across multiple angles and shots, train a Phota profile first:
infsh app run phota/train --input '{
"images": ["url1.jpg", "url2.jpg", ...],
"wait": true
}' --save profile.json
wait: trueprofile_id you then use in phota/generate as [[profile_id]] in promptsIf you don't need cross-shot consistency (e.g. single-speaker video, one angle only), pruna/p-image is simpler and cheaper.
Character sheets first, podcast frames second:
For branded characters (logo on clothing):
phota/edit with the logo as a second reference image to add the logoGenerate at least 2 angles per character for visual variety:
| Angle | When to use | |-------|-------------| | Front/medium | Establishing shots, opening, closing | | Close-up | Reactions, emotional moments, punchy lines |
For close-ups, prompt for "tight framing, chest up, shallow depth of field" — not "turned to the side" (which just makes them look away).
Identity consistency rules:
phota/generate or phota/edit for new angles — Gemini does not preserve facial identity and will produce a different personpruna/p-image, or consider training a Phota profile if you need many shotsFraming rule: Use tight framing on individual speakers. Wide shots with multiple seats show empty chairs when only one person is on screen.
Before proceeding, visually inspect all frames for:
Fix issues before generating video — re-rendering video is the most expensive step in the pipeline.
Rules for natural conversation:
Duration guide: | Target | Words | |--------|-------| | 15s | ~38 words | | 30s | ~75 words | | 60s | ~150 words |
Use inworld/text-to-speech-2 for each turn.
infsh app run inworld/text-to-speech-2 --input '{
"text": "...",
"voice_id": "...",
"speaking_rate": 1.05,
"audio_encoding": "MP3"
}' --save output.json
Voice selection:
inworld/text-to-speech-2:voices to list all available voicesSpeaking rate:
All TTS turns can run in parallel (cheap, fast ~2-8s each).
Use pruna/p-video-avatar for each turn.
infsh app run pruna/p-video-avatar --input '{
"image": "<character_frame_url>",
"audio": "<tts_audio_url>",
"resolution": "720p",
"video_prompt": "..."
}' --save output.json
Critical: Run clips SEQUENTIALLY, not in parallel. Parallel runs hit the same GPU and cause CUDA OOM failures. Each clip takes 15-90s depending on audio length.
Angle assignment plan: Alternate between front and close-up shots across turns for visual variety. Example for 6 turns:
T1: Speaker A — front
T2: Speaker B — front
T3: Speaker C — front (or close-up)
T4: Speaker A — close-up
T5: Speaker B — close-up
T6: Speaker A — front
Use infsh/media-merger to stitch all clips into the final video.
# Build input JSON
{
"media_files": [
{"file": "<clip1_url>"},
{"file": "<clip2_url>"},
...
],
"fps": 24,
"output_format": "mp4"
}
infsh app run infsh/media-merger --input merger_input.json --save final.json
Merger is free and takes 2-6 minutes depending on total duration.
Gemini does not preserve human facial identity — generating alternate angles of a real human with Gemini will produce a different person. For identity-consistent human shots, use Phota with a trained profile_id, or generate all angles in a single batch. This was learned after Gemini produced an entirely different face for a close-up that was supposed to match the front shot.
NEVER run p-video-avatar clips in parallel — they compete for GPU memory and fail with CUDA OOM. Run them sequentially. This was learned after 2 of 3 parallel runs failed.
NEVER set speaking_rate below 1.0 — it sounds artificial and disengaging. Default to 1.05. Learned from user feedback that 0.9 rate "felt weird and disengaging."
ALWAYS QA frames before generating video — video generation is the most expensive step in the pipeline. Catching a double mic or wrong logo in the image stage is cheap to fix. Catching it after video generation means re-rendering the entire clip.
ALWAYS use tight framing for individual speaker shots — wide/establishing shots show empty seats where other speakers should be. Frame from waist or chest up so no empty chairs are visible.
ALWAYS pass the logo as a reference image when generating branded characters — describing a logo in text produces wrong results. Pass the actual logo file as a second image input.
ALWAYS get voice approval before full production — generate samples with the same line across 5-8 candidate voices and let the user pick before committing to the full script.
Script should read like a conversation, not an ad — people reading ad copy in turns sounds fake. Include reactions, interruptions, varied turn lengths, and genuine questions. The host should have personality, not just set up talking points.
| App | Purpose |
|-----|---------|
| pruna/p-image | Generate portraits from text |
| phota/train | Train identity profile from 30-50 face images |
| phota/generate | Generate images with trained identity via [[profile_id]] |
| phota/edit | Edit images preserving identity of known subjects |
| google/gemini-3-pro-image-preview | Image gen/edit, mascots, style transfer |
| inworld/text-to-speech-2 | Text to speech, 100+ languages, voice steering |
| pruna/p-video-avatar | Portrait + audio → talking head video |
| infsh/media-merger | Concatenate video clips into one video |
Use belt task cost <task-id> to check the cost of any individual task.
data-ai
Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.
development
Declarative UI widgets from JSON for React/Next.js from ui.inference.sh. Render rich interactive UIs from structured agent responses. Capabilities: forms, buttons, cards, layouts, inputs, selects, checkboxes. Use for: agent-generated UIs, dynamic forms, data display, interactive cards. Triggers: widgets, declarative ui, json ui, widget renderer, agent widgets, dynamic ui, form widgets, card widgets, shadcn widgets, structured output ui
tools
Tool lifecycle UI components for React/Next.js from ui.inference.sh. Display tool calls: pending, progress, approval required, results. Capabilities: tool status, progress indicators, approval flows, results display. Use for: showing agent tool calls, human-in-the-loop approvals, tool output. Triggers: tool ui, tool calls, tool status, tool approval, tool results, agent tools, mcp tools ui, function calling ui, tool lifecycle, tool pending
development
Chat UI building blocks for React/Next.js from ui.inference.sh. Components: container, messages, input, typing indicators, avatars. Capabilities: chat interfaces, message lists, input handling, streaming. Use for: building custom chat UIs, messaging interfaces, AI assistants. Triggers: chat ui, chat component, message list, chat input, shadcn chat, react chat, chat interface, messaging ui, conversation ui, chat building blocks