ai-podcast/SKILL.md
Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.
npx skillsauth add inference-sh-2/skills ai-podcastInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create multi-person talking head podcast videos using the inference.sh pipeline: portrait generation → TTS audio → avatar video → merge. Supports real humans (via Phota), 3D mascots, illustrated characters, and mixed casts.
Use when the user wants to create a podcast, talking head video, demo reel, promotional conversation, or any multi-speaker video content.
Characters (images) → TTS (audio per turn) → Avatar (video per turn) → Merge (final video)
Choose the right tool per character type:
| Character Type | Tool | Notes |
|---------------|------|-------|
| Real human (new) | pruna/p-image | 16:9, prompt_upsampling: true. Quick, no training needed, but identity won't be consistent across multiple generations. |
| Real human (consistent ID) | phota/generate with [[profile_id]] | Consistent identity across all shots. Requires a trained Phota profile first (see below). |
| Brand mascot / logo character | google/gemini-3-pro-image-preview | Pass logo + character sheet as reference images |
| Illustrated / stylized | google/gemini-3-pro-image-preview | Pass style reference as input image |
Training a Phota identity (optional but recommended for humans):
If you need a real human character with consistent identity across multiple angles and shots, train a Phota profile first:
infsh app run phota/train --input '{
"images": ["url1.jpg", "url2.jpg", ...],
"wait": true
}' --save profile.json
wait: trueprofile_id you then use in phota/generate as [[profile_id]] in promptsIf you don't need cross-shot consistency (e.g. single-speaker video, one angle only), pruna/p-image is simpler and cheaper.
Character sheets first, podcast frames second:
For branded characters (logo on clothing):
phota/edit with the logo as a second reference image to add the logoGenerate at least 2 angles per character for visual variety:
| Angle | When to use | |-------|-------------| | Front/medium | Establishing shots, opening, closing | | Close-up | Reactions, emotional moments, punchy lines |
For close-ups, prompt for "tight framing, chest up, shallow depth of field" — not "turned to the side" (which just makes them look away).
Identity consistency rules:
phota/generate or phota/edit for new angles — Gemini does not preserve facial identity and will produce a different personpruna/p-image, or consider training a Phota profile if you need many shotsFraming rule: Use tight framing on individual speakers. Wide shots with multiple seats show empty chairs when only one person is on screen.
Before proceeding, visually inspect all frames for:
Fix issues before generating video — re-rendering video is the most expensive step in the pipeline.
Rules for natural conversation:
Duration guide: | Target | Words | |--------|-------| | 15s | ~38 words | | 30s | ~75 words | | 60s | ~150 words |
Use inworld/text-to-speech-2 for each turn.
infsh app run inworld/text-to-speech-2 --input '{
"text": "...",
"voice_id": "...",
"speaking_rate": 1.05,
"audio_encoding": "MP3"
}' --save output.json
Voice selection:
inworld/text-to-speech-2:voices to list all available voicesSpeaking rate:
All TTS turns can run in parallel (cheap, fast ~2-8s each).
Use pruna/p-video-avatar for each turn.
infsh app run pruna/p-video-avatar --input '{
"image": "<character_frame_url>",
"audio": "<tts_audio_url>",
"resolution": "720p",
"video_prompt": "..."
}' --save output.json
Critical: Run clips SEQUENTIALLY, not in parallel. Parallel runs hit the same GPU and cause CUDA OOM failures. Each clip takes 15-90s depending on audio length.
Angle assignment plan: Alternate between front and close-up shots across turns for visual variety. Example for 6 turns:
T1: Speaker A — front
T2: Speaker B — front
T3: Speaker C — front (or close-up)
T4: Speaker A — close-up
T5: Speaker B — close-up
T6: Speaker A — front
Use infsh/media-merger to stitch all clips into the final video.
# Build input JSON
{
"media_files": [
{"file": "<clip1_url>"},
{"file": "<clip2_url>"},
...
],
"fps": 24,
"output_format": "mp4"
}
infsh app run infsh/media-merger --input merger_input.json --save final.json
Merger is free and takes 2-6 minutes depending on total duration.
Gemini does not preserve human facial identity — generating alternate angles of a real human with Gemini will produce a different person. For identity-consistent human shots, use Phota with a trained profile_id, or generate all angles in a single batch. This was learned after Gemini produced an entirely different face for a close-up that was supposed to match the front shot.
NEVER run p-video-avatar clips in parallel — they compete for GPU memory and fail with CUDA OOM. Run them sequentially. This was learned after 2 of 3 parallel runs failed.
NEVER set speaking_rate below 1.0 — it sounds artificial and disengaging. Default to 1.05. Learned from user feedback that 0.9 rate "felt weird and disengaging."
ALWAYS QA frames before generating video — video generation is the most expensive step in the pipeline. Catching a double mic or wrong logo in the image stage is cheap to fix. Catching it after video generation means re-rendering the entire clip.
ALWAYS use tight framing for individual speaker shots — wide/establishing shots show empty seats where other speakers should be. Frame from waist or chest up so no empty chairs are visible.
ALWAYS pass the logo as a reference image when generating branded characters — describing a logo in text produces wrong results. Pass the actual logo file as a second image input.
ALWAYS get voice approval before full production — generate samples with the same line across 5-8 candidate voices and let the user pick before committing to the full script.
Script should read like a conversation, not an ad — people reading ad copy in turns sounds fake. Include reactions, interruptions, varied turn lengths, and genuine questions. The host should have personality, not just set up talking points.
| App | Purpose |
|-----|---------|
| pruna/p-image | Generate portraits from text |
| phota/train | Train identity profile from 30-50 face images |
| phota/generate | Generate images with trained identity via [[profile_id]] |
| phota/edit | Edit images preserving identity of known subjects |
| google/gemini-3-pro-image-preview | Image gen/edit, mascots, style transfer |
| inworld/text-to-speech-2 | Text to speech, 100+ languages, voice steering |
| pruna/p-video-avatar | Portrait + audio → talking head video |
| infsh/media-merger | Concatenate video clips into one video |
Use belt task cost <task-id> to check the cost of any individual task.
tools
Build multi-step AI content creation pipelines combining image, video, audio, and text. Workflow examples: generate image -> animate -> add voiceover -> merge with music. Tools: FLUX, Veo, Kokoro TTS, OmniHuman, media merger, upscaling. Use for: YouTube videos, social media content, marketing materials, automated content. Triggers: content pipeline, ai workflow, content creation, multi-step ai, content automation, ai video workflow, generate and edit, ai content factory, automated content creation, ai production pipeline, media pipeline, content at scale
tools
Build automated AI workflows combining multiple models and services. Patterns: batch processing, scheduled tasks, event-driven pipelines, agent loops. Tools: inference.sh CLI, bash scripting, Python SDK, webhook integration. Use for: content automation, data processing, monitoring, scheduled generation. Triggers: ai automation, workflow automation, batch processing, ai pipeline, automated content, scheduled ai, ai cron, ai batch job, automated generation, ai workflow, content at scale, automation script, ai orchestration
documentation
Master prompt engineering for AI models: LLMs, image generators, video models. Techniques: chain-of-thought, few-shot, system prompts, negative prompts. Models: Claude, GPT-4, Gemini, FLUX, Veo, Stable Diffusion prompting. Use for: better AI outputs, consistent results, complex tasks, optimization. Triggers: prompt engineering, how to prompt, better prompts, prompt tips, prompting guide, llm prompting, image prompt, ai prompting, prompt optimization, prompt template, prompt structure, effective prompts, prompt techniques
testing
Product Hunt launch optimization with specific specs, timing, and gallery strategy. Covers taglines, gallery images, maker comments, and launch day tactics. Use for: product launches, startup launches, side project launches, Product Hunt optimization. Triggers: product hunt, ph launch, product hunt launch, launch strategy, product launch, startup launch, product hunt tips, product hunt gallery, ph optimization, launch day, product hunt maker