Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

inferencesh/ai-podcast

Name: ai-podcast
Author: inferencesh

ai-podcast/SKILL.md

npx skillsauth add inferencesh/skills ai-podcast

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AI Podcast Generator

Create multi-person talking head podcast videos using the inference.sh pipeline: portrait generation → TTS audio → avatar video → merge. Supports real humans (via Phota), 3D mascots, illustrated characters, and mixed casts.

Use when the user wants to create a podcast, talking head video, demo reel, promotional conversation, or any multi-speaker video content.

Pipeline Overview

Characters (images) → TTS (audio per turn) → Avatar (video per turn) → Merge (final video)

Process

Step 1: Character Creation

Choose the right tool per character type:

| Character Type | Tool | Notes | |---------------|------|-------| | Real human (new) | pruna/p-image | 16:9, prompt_upsampling: true. Quick, no training needed, but identity won't be consistent across multiple generations. | | Real human (consistent ID) | phota/generate with [[profile_id]] | Consistent identity across all shots. Requires a trained Phota profile first (see below). | | Brand mascot / logo character | google/gemini-3-pro-image-preview | Pass logo + character sheet as reference images | | Illustrated / stylized | google/gemini-3-pro-image-preview | Pass style reference as input image |

Training a Phota identity (optional but recommended for humans):

If you need a real human character with consistent identity across multiple angles and shots, train a Phota profile first:

infsh app run phota/train --input '{
  "images": ["url1.jpg", "url2.jpg", ...],
  "wait": true
}' --save profile.json

Requires 30-50 face images of the subject
Training takes a few minutes with wait: true
Returns a profile_id you then use in phota/generate as [[profile_id]] in prompts
The profile is reusable forever — train once, generate unlimited shots

If you don't need cross-shot consistency (e.g. single-speaker video, one angle only), pruna/p-image is simpler and cheaper.

Character sheets first, podcast frames second:

Generate a character sheet (plain white background, multiple angles) for each character
Then place characters into the podcast studio setting using the sheet as reference

For branded characters (logo on clothing):

Generate the character with a plain version of the garment
Use phota/edit with the logo as a second reference image to add the logo
Always pass the logo image alongside character references when generating new angles

Step 2: Alternate Angles

Generate at least 2 angles per character for visual variety:

| Angle | When to use | |-------|-------------| | Front/medium | Establishing shots, opening, closing | | Close-up | Reactions, emotional moments, punchy lines |

For close-ups, prompt for "tight framing, chest up, shallow depth of field" — not "turned to the side" (which just makes them look away).

Identity consistency rules:

For real humans with a Phota profile: use phota/generate or phota/edit for new angles — Gemini does not preserve facial identity and will produce a different person
For real humans without a Phota profile: try to generate all needed angles in one go with pruna/p-image, or consider training a Phota profile if you need many shots
For mascots/illustrations: Gemini 3 Pro is fine, pass the established frame as reference

Framing rule: Use tight framing on individual speakers. Wide shots with multiple seats show empty chairs when only one person is on screen.

Step 3: QA Frames

Before proceeding, visually inspect all frames for:

Extra people in the background
Multiple microphones (should be single mic per shot)
Wrong or distorted logos
Inconsistent character identity across angles
Weird artifacts (extra limbs, merged objects)

Fix issues before generating video — re-rendering video is the most expensive step in the pipeline.

Step 4: Write the Script

Rules for natural conversation:

Write it like a real conversation, NOT like people reading ad copy in turns
Include reactions ("wait, hold on", "that is wild"), interruptions, and follow-up questions
Vary turn length — short reactions (1 sentence) mixed with longer explanations (2-3 sentences)
The host should ask real questions, not set up obvious talking points
Keep total duration target in mind: ~2.5 words/second for natural speech at 1.05x rate

Duration guide: | Target | Words | |--------|-------| | 15s | ~38 words | | 30s | ~75 words | | 60s | ~150 words |

Step 5: Generate TTS Audio

Use inworld/text-to-speech-2 for each turn.

infsh app run inworld/text-to-speech-2 --input '{
  "text": "...",
  "voice_id": "...",
  "speaking_rate": 1.05,
  "audio_encoding": "MP3"
}' --save output.json

Voice selection:

Generate samples with the same line across candidate voices BEFORE committing
Let the user listen and approve voices
Good podcast voices: Tyler, Nate, Lauren, Kelsey, Naomi, Anjali (EN_US)
Use inworld/text-to-speech-2:voices to list all available voices

Speaking rate:

Default to 1.05 for natural podcast pacing
Use 1.1 for short snappy reactions
NEVER go below 1.0 — sounds slow and disengaging
Keep rate consistent per character across all their turns

All TTS turns can run in parallel (cheap, fast ~2-8s each).

Step 6: Generate Video Clips

Use pruna/p-video-avatar for each turn.

infsh app run pruna/p-video-avatar --input '{
  "image": "<character_frame_url>",
  "audio": "<tts_audio_url>",
  "resolution": "720p",
  "video_prompt": "..."
}' --save output.json

Critical: Run clips SEQUENTIALLY, not in parallel. Parallel runs hit the same GPU and cause CUDA OOM failures. Each clip takes 15-90s depending on audio length.

Angle assignment plan: Alternate between front and close-up shots across turns for visual variety. Example for 6 turns:

T1: Speaker A — front
T2: Speaker B — front
T3: Speaker C — front (or close-up)
T4: Speaker A — close-up
T5: Speaker B — close-up
T6: Speaker A — front

Step 7: Merge

Use infsh/media-merger to stitch all clips into the final video.

# Build input JSON
{
  "media_files": [
    {"file": "<clip1_url>"},
    {"file": "<clip2_url>"},
    ...
  ],
  "fps": 24,
  "output_format": "mp4"
}

infsh app run infsh/media-merger --input merger_input.json --save final.json

Merger is free and takes 2-6 minutes depending on total duration.

Rules

Gemini does not preserve human facial identity — generating alternate angles of a real human with Gemini will produce a different person. For identity-consistent human shots, use Phota with a trained profile_id, or generate all angles in a single batch. This was learned after Gemini produced an entirely different face for a close-up that was supposed to match the front shot.
NEVER run p-video-avatar clips in parallel — they compete for GPU memory and fail with CUDA OOM. Run them sequentially. This was learned after 2 of 3 parallel runs failed.
NEVER set speaking_rate below 1.0 — it sounds artificial and disengaging. Default to 1.05. Learned from user feedback that 0.9 rate "felt weird and disengaging."
ALWAYS QA frames before generating video — video generation is the most expensive step in the pipeline. Catching a double mic or wrong logo in the image stage is cheap to fix. Catching it after video generation means re-rendering the entire clip.
ALWAYS use tight framing for individual speaker shots — wide/establishing shots show empty seats where other speakers should be. Frame from waist or chest up so no empty chairs are visible.
ALWAYS pass the logo as a reference image when generating branded characters — describing a logo in text produces wrong results. Pass the actual logo file as a second image input.
ALWAYS get voice approval before full production — generate samples with the same line across 5-8 candidate voices and let the user pick before committing to the full script.
Script should read like a conversation, not an ad — people reading ad copy in turns sounds fake. Include reactions, interruptions, varied turn lengths, and genuine questions. The host should have personality, not just set up talking points.

App Reference

| App | Purpose | |-----|---------| | pruna/p-image | Generate portraits from text | | phota/train | Train identity profile from 30-50 face images | | phota/generate | Generate images with trained identity via [[profile_id]] | | phota/edit | Edit images preserving identity of known subjects | | google/gemini-3-pro-image-preview | Image gen/edit, mascots, style transfer | | inworld/text-to-speech-2 | Text to speech, 100+ languages, voice steering | | pruna/p-video-avatar | Portrait + audio → talking head video | | infsh/media-merger | Concatenate video clips into one video |

Use belt task cost <task-id> to check the cost of any individual task.

inferencesh/ai-podcast

ai-podcast/SKILL.md

Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.

442 stars

data-ai

Updated May 17, 2026

$ install --global

skillsauth

npx skillsauth add inferencesh/skills ai-podcast

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 17, 2026, 7:08 AM35.4s1 file scanned

SKILL.md

name:: ai-podcast
description:: Generate multi-person talking head podcast videos from scratch using AI — character creation, TTS, avatar animation, and video stitching. Use when the user wants to create a podcast, talking head video, or multi-speaker conversation video.
allowed-tools:: Bash, Read, Write, Agent, Glob, Grep

AI Podcast Generator

Use when the user wants to create a podcast, talking head video, demo reel, promotional conversation, or any multi-speaker video content.

Pipeline Overview

Characters (images) → TTS (audio per turn) → Avatar (video per turn) → Merge (final video)

Process

Step 1: Character Creation

Choose the right tool per character type:

Training a Phota identity (optional but recommended for humans):

If you need a real human character with consistent identity across multiple angles and shots, train a Phota profile first:

infsh app run phota/train --input '{
  "images": ["url1.jpg", "url2.jpg", ...],
  "wait": true
}' --save profile.json

Requires 30-50 face images of the subject
Training takes a few minutes with wait: true
Returns a profile_id you then use in phota/generate as [[profile_id]] in prompts
The profile is reusable forever — train once, generate unlimited shots

If you don't need cross-shot consistency (e.g. single-speaker video, one angle only), pruna/p-image is simpler and cheaper.

Character sheets first, podcast frames second:

Generate a character sheet (plain white background, multiple angles) for each character
Then place characters into the podcast studio setting using the sheet as reference

For branded characters (logo on clothing):

Generate the character with a plain version of the garment
Use phota/edit with the logo as a second reference image to add the logo
Always pass the logo image alongside character references when generating new angles

Step 2: Alternate Angles

Generate at least 2 angles per character for visual variety:

| Angle | When to use | |-------|-------------| | Front/medium | Establishing shots, opening, closing | | Close-up | Reactions, emotional moments, punchy lines |

For close-ups, prompt for "tight framing, chest up, shallow depth of field" — not "turned to the side" (which just makes them look away).

Identity consistency rules:

For real humans with a Phota profile: use phota/generate or phota/edit for new angles — Gemini does not preserve facial identity and will produce a different person
For real humans without a Phota profile: try to generate all needed angles in one go with pruna/p-image, or consider training a Phota profile if you need many shots
For mascots/illustrations: Gemini 3 Pro is fine, pass the established frame as reference

Framing rule: Use tight framing on individual speakers. Wide shots with multiple seats show empty chairs when only one person is on screen.

Step 3: QA Frames

Before proceeding, visually inspect all frames for:

Extra people in the background
Multiple microphones (should be single mic per shot)
Wrong or distorted logos
Inconsistent character identity across angles
Weird artifacts (extra limbs, merged objects)

Fix issues before generating video — re-rendering video is the most expensive step in the pipeline.

Step 4: Write the Script

Rules for natural conversation:

Write it like a real conversation, NOT like people reading ad copy in turns
Include reactions ("wait, hold on", "that is wild"), interruptions, and follow-up questions
Vary turn length — short reactions (1 sentence) mixed with longer explanations (2-3 sentences)
The host should ask real questions, not set up obvious talking points
Keep total duration target in mind: ~2.5 words/second for natural speech at 1.05x rate

Duration guide: | Target | Words | |--------|-------| | 15s | ~38 words | | 30s | ~75 words | | 60s | ~150 words |

Step 5: Generate TTS Audio

Use inworld/text-to-speech-2 for each turn.

infsh app run inworld/text-to-speech-2 --input '{
  "text": "...",
  "voice_id": "...",
  "speaking_rate": 1.05,
  "audio_encoding": "MP3"
}' --save output.json

Voice selection:

Generate samples with the same line across candidate voices BEFORE committing
Let the user listen and approve voices
Good podcast voices: Tyler, Nate, Lauren, Kelsey, Naomi, Anjali (EN_US)
Use inworld/text-to-speech-2:voices to list all available voices

Speaking rate:

Default to 1.05 for natural podcast pacing
Use 1.1 for short snappy reactions
NEVER go below 1.0 — sounds slow and disengaging
Keep rate consistent per character across all their turns

All TTS turns can run in parallel (cheap, fast ~2-8s each).

Step 6: Generate Video Clips

Use pruna/p-video-avatar for each turn.

infsh app run pruna/p-video-avatar --input '{
  "image": "<character_frame_url>",
  "audio": "<tts_audio_url>",
  "resolution": "720p",
  "video_prompt": "..."
}' --save output.json

Critical: Run clips SEQUENTIALLY, not in parallel. Parallel runs hit the same GPU and cause CUDA OOM failures. Each clip takes 15-90s depending on audio length.

Angle assignment plan: Alternate between front and close-up shots across turns for visual variety. Example for 6 turns:

T1: Speaker A — front
T2: Speaker B — front
T3: Speaker C — front (or close-up)
T4: Speaker A — close-up
T5: Speaker B — close-up
T6: Speaker A — front

Step 7: Merge

Use infsh/media-merger to stitch all clips into the final video.

# Build input JSON
{
  "media_files": [
    {"file": "<clip1_url>"},
    {"file": "<clip2_url>"},
    ...
  ],
  "fps": 24,
  "output_format": "mp4"
}

infsh app run infsh/media-merger --input merger_input.json --save final.json

Merger is free and takes 2-6 minutes depending on total duration.

Rules

Gemini does not preserve human facial identity — generating alternate angles of a real human with Gemini will produce a different person. For identity-consistent human shots, use Phota with a trained profile_id, or generate all angles in a single batch. This was learned after Gemini produced an entirely different face for a close-up that was supposed to match the front shot.
NEVER run p-video-avatar clips in parallel — they compete for GPU memory and fail with CUDA OOM. Run them sequentially. This was learned after 2 of 3 parallel runs failed.
NEVER set speaking_rate below 1.0 — it sounds artificial and disengaging. Default to 1.05. Learned from user feedback that 0.9 rate "felt weird and disengaging."
ALWAYS QA frames before generating video — video generation is the most expensive step in the pipeline. Catching a double mic or wrong logo in the image stage is cheap to fix. Catching it after video generation means re-rendering the entire clip.
ALWAYS use tight framing for individual speaker shots — wide/establishing shots show empty seats where other speakers should be. Frame from waist or chest up so no empty chairs are visible.
ALWAYS pass the logo as a reference image when generating branded characters — describing a logo in text produces wrong results. Pass the actual logo file as a second image input.
ALWAYS get voice approval before full production — generate samples with the same line across 5-8 candidate voices and let the user pick before committing to the full script.
Script should read like a conversation, not an ad — people reading ad copy in turns sounds fake. Include reactions, interruptions, varied turn lengths, and genuine questions. The host should have personality, not just set up talking points.

App Reference

Use belt task cost <task-id> to check the cost of any individual task.

Related Skills

inferencesh/ai-content-pipeline

tools

Community

Build multi-step AI content creation pipelines combining image, video, audio, and text. Workflow examples: generate image -> animate -> add voiceover -> merge with music. Tools: FLUX, Veo, Kokoro TTS, OmniHuman, media merger, upscaling. Use for: YouTube videos, social media content, marketing materials, automated content. Triggers: content pipeline, ai workflow, content creation, multi-step ai, content automation, ai video workflow, generate and edit, ai content factory, automated content creation, ai production pipeline, media pipeline, content at scale

457SKILL.mdUpdated Apr 23, 2026

inferencesh/ai-content-pipeline

Security Scans

mcp-scan — Pending Scan

Semgrep — Pending Scan

Trivy — Pending Scan

OWASP — Pending Scan

VirusTotal — Pending Scan

inferencesh/ai-automation-workflows

tools

VerifiedTrustedCommunity

Build automated AI workflows combining multiple models and services. Patterns: batch processing, scheduled tasks, event-driven pipelines, agent loops. Tools: inference.sh CLI, bash scripting, Python SDK, webhook integration. Use for: content automation, data processing, monitoring, scheduled generation. Triggers: ai automation, workflow automation, batch processing, ai pipeline, automated content, scheduled ai, ai cron, ai batch job, automated generation, ai workflow, content at scale, automation script, ai orchestration

450SKILL.mdUpdated Apr 21, 2026

inferencesh/ai-automation-workflows

inferencesh/ai-image-generation

tools

VerifiedTrustedCommunity

Generate AI images with GPT-Image-2, FLUX, Gemini, Grok, Seedream, Reve and 50+ models via inference.sh CLI. Models: GPT-Image-2, FLUX Dev LoRA, FLUX.2 Klein LoRA, Gemini 3 Pro Image, Grok Imagine, Seedream 4.5, Reve, ImagineArt. Capabilities: text-to-image, image-to-image, inpainting, LoRA, image editing, upscaling, text rendering. Use for: AI art, product mockups, concept art, social media graphics, marketing visuals, illustrations. Triggers: flux, image generation, ai image, text to image, stable diffusion, generate image, ai art, midjourney alternative, dall-e alternative, text2img, t2i, image generator, ai picture, create image with ai, generative ai, ai illustration, grok image, gemini image, gpt image, openai image, chatgpt image

414SKILL.mdUpdated Apr 23, 2026

inferencesh/ai-image-generation

inferencesh/talking-head-production

testing

VerifiedTrustedCommunity

Talking head video production with AI avatars, lipsync, and voiceover. Recommended: P-Video-Avatar (fastest, cheapest, built-in TTS). Also covers OmniHuman, PixVerse, Fabric. Portrait requirements, audio quality, production workflows. Use for: spokesperson videos, course content, social media, presentations, demos. Triggers: talking head, avatar video, lipsync, lip sync, ai spokesperson, virtual presenter, ai presenter, omnihuman, talking avatar, video presenter, ai talking head, presenter video, ai face video, p-video-avatar

390SKILL.mdUpdated Apr 23, 2026

inferencesh/talking-head-production

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/inferencesh/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/ai-podcast ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

inferencesh/skills

442 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT