Podcast Edit Skill

Process raw podcast/meeting recordings into polished podcast episodes.

Capabilities

Smart trimming — Find where the actual podcast starts/ends by transcribing and detecting intros/outros
Filler word removal — Remove verbal tics: 嗯, 呃, 啊, 哦, 对对对, um, uh, etc.
Silence trimming — Cut long dead air (>2s) down to natural pauses (~0.6s)
Audio enhancement — Noise reduction, EQ, multi-speaker volume balancing, loudness normalization to podcast standard (−16 LUFS)

Prerequisites

ffmpeg and ffprobe installed
OPENAI_API_KEY in environment (for Whisper API transcription)
Python 3 with stdlib only (no extra deps for the helper script)

Workflow

Step 1: Inspect the audio file

ffprobe -v quiet -print_format json -show_format -show_streams "INPUT_FILE"

Note: duration, sample rate, channels, codec, bitrate.

Step 2: Find podcast start/end (if user says to trim front/back)

Split into 5-minute chunks and transcribe via OpenAI Whisper API with segment-level timestamps:

# Extract chunk
ffmpeg -y -i "INPUT_FILE" -ss OFFSET -t 300 -ar 16000 -ac 1 /tmp/chunk_OFFSET.mp3

# Transcribe
curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@/tmp/chunk_OFFSET.mp3" \
  -F model="whisper-1" \
  -F response_format="verbose_json" \
  -F language="LANG" \
  -F 'timestamp_granularities[]=segment' > /tmp/transcript_OFFSET.json

Scan transcriptions for:

Start markers: "welcome", "hello everyone", "大家好", "欢迎", intro music, first substantive topic sentence
End markers: "see you next time", "bye", "下期见", "感谢收听", followed by post-show chat

Do an initial trim with -ss START -to END and -c copy (no re-encode) to create a working file.

Step 3: Remove filler words

Split the trimmed file into 5-minute chunks and transcribe each with word-level timestamps:

# Extract chunks
for i in $(seq 0 300 DURATION); do
  ffmpeg -y -i "TRIMMED_FILE" -ss $i -t 300 -ar 16000 -ac 1 /tmp/wchunk_${i}.mp3
done

# Transcribe each chunk (can run in parallel)
for i in $(seq 0 300 DURATION); do
  curl -s https://api.openai.com/v1/audio/transcriptions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -F file="@/tmp/wchunk_${i}.mp3" \
    -F model="whisper-1" \
    -F response_format="verbose_json" \
    -F language="LANG" \
    -F 'timestamp_granularities[]=word' \
    -F 'timestamp_granularities[]=segment' > /tmp/wtranscript_${i}.json &
done
wait

Then run the filler removal script that ships with this skill:

python3 ./filler_removal.py \
  --total-duration DURATION \
  --end-at END_TIMESTAMP \
  --cut START1:END1 --cut START2:END2 \
  --chunk-offsets 0,300,600,900,...

Arguments:

--total-duration: Duration of the trimmed input file in seconds (required)
--end-at: Cut everything after this timestamp (e.g., post-show chat start)
--cut START:END: Cut a specific range. Can be repeated.
--chunk-offsets: Comma-separated chunk offsets (default: auto 0,300,600,…)

The script outputs /tmp/ffmpeg_filter.txt with an atrim+concat filter.

Apply the filter in two passes:

# Step A: Cut fillers → intermediate WAV (avoids re-encoding artifacts)
ffmpeg -y -i "TRIMMED_FILE" \
  -filter_complex_script /tmp/ffmpeg_filter.txt \
  -map '[out]' -c:a pcm_s16le -ar 44100 /tmp/podcast_cut.wav

# Step B: Enhance audio → final MP3
ffmpeg -y -i /tmp/podcast_cut.wav \
  -af "ENHANCEMENT_CHAIN" \
  -c:a libmp3lame -b:a 192k "OUTPUT_FILE"

Limitations: Whisper word-level timestamps for Chinese can miss fillers that are blended into adjacent speech. The script catches standalone fillers reliably but may miss ~10–20% of embedded ones.

Step 4: Audio enhancement filter chain

Default chain (guest-friendly — handles multi-speaker volume imbalance). The biggest mistake in past runs is using a noise gate (agate) that silences the quieter guest entirely. Never add agate back to the default chain.

highpass=f=80,                                    # Remove room rumble
lowpass=f=12000,                                  # Remove hiss (use 7500 for 16kHz sources)
afftdn=nf=-25:nr=8:nt=w,                         # Gentle FFT noise reduction
equalizer=f=180:t=q:w=1.5:g=-2,                  # Cut mud
equalizer=f=2500:t=q:w=1.2:g=3,                  # Boost presence
equalizer=f=4500:t=q:w=1.5:g=1.5,                # Boost clarity
dynaudnorm=f=200:g=5:p=0.95:m=5:s=0,             # Rolling-window normalization — lifts the quieter speaker independently
acompressor=threshold=-20dB:ratio=2:attack=5:release=200:makeup=1,  # Gentle glue
loudnorm=I=-16:TP=-1.5:LRA=13                    # Podcast standard loudness

Why dynaudnorm is the star: it normalizes in 200 ms rolling windows, so when the guest is speaking, that window gets lifted independently of the host's louder windows. Order matters — run dynaudnorm BEFORE acompressor so the compressor sees a balanced signal.

Never add these to the default chain:

agate (noise gate) — cuts off any speaker quieter than the threshold; kills the guest.
Heavy compression (ratio >3:1, makeup >2 dB) — flattens dynamics and makes the guest sound pumped.
Narrow LRA (<12) in loudnorm — crushes natural speech dynamics.

Adjust lowpass based on source sample rate:

16kHz source → lowpass=7500
44.1kHz+ source → lowpass=12000 (or skip)

Verify guest audibility after rendering: run ffmpeg -i OUTPUT -af "ebur128=peak=true" -f null - and check I: is near −16 LUFS and LRA: is 4–6 LU (tighter LRA is fine because dynaudnorm did per-window balancing first). If the output sounds like the guest was cut, suspect a gate or aggressive compressor crept back in.

Step 5: Verify output

ls -lh "OUTPUT_FILE"
ffprobe -v quiet -show_entries format=duration -of csv=p=0 "OUTPUT_FILE"

Report: duration, file size, what was removed (filler count, silence count, time saved).

Output conventions

Format: MP3, 192 kbps, mono (unless source is stereo with separate speakers per channel)
Loudness: −16 LUFS (podcast standard)
Always two-pass: cut to WAV first, then enhance to MP3

Show notes — bilingual writing (if applicable)

If the host is producing bilingual Chinese/English show notes, the Chinese section must be written in actual Chinese — not Chinese grammar with English verbs and nouns sprinkled in. Code-switching like "close 了一个 deal", "build 出来的 agent", or "PR 不是 buy 来的" reads like a draft and is the #1 mistake to avoid.

Translation rules

Translate these common startup/tech English loanwords into Chinese:

close deal → 拿下订单 / 成交 / 签下
build (a product) → 搭建 / 做出 / 打造
integration → 集成
view (video/page views) → 播放 / 浏览
stack (tech stack) → 体系 / 技术栈
category leader → 品类领导者
front-end / front end (product sense) → 外壳 / 前端
success story → 客户案例 / 成功故事
SMB → 中小企业
Enterprise (segment) → 大型企业 / 企业级
aha moment → 顿悟时刻
onboarding → 上手 / 入门
retention → 留存
churn → 流失
pipeline → 销售漏斗 / 业务线

What to KEEP in English inside Chinese text

Brand and product names — company / product / person names stay as-is
Very common startup acronyms — CEO, CTO, CMO, PMF, ARR, MRR, PR, AI, AI Agent, SaaS, API
Currency with numeric prefix — $20K, $200K, or 200 美金 (either form is fine when paired with a number)

Before finalizing

Re-read the Chinese section as a Chinese reader. If any sentence feels like it was half-translated — e.g., contains "build", "close", "deal", "view", "stack", "leader" as standalone English words — rewrite those words in Chinese. The only English that should survive a re-read is brand names and the acronyms above.

Name verification (CRITICAL)

Whisper frequently mangles company names, product names, and personal names. Before generating show notes or any output that includes names and links:

After transcription, extract all proper nouns — company names, product names, personal names, URLs mentioned.
Ask the user to confirm/correct them — Whisper hears similar-sounding but wrong tokens for brand names.
Never guess URLs from transcribed names — a name that sounds like "Acme" could be acme.com, acmehq.com, or something else entirely. Always ask.
Use confirmed names consistently in show notes, titles, episode metadata, and all outputs.

This is especially important when generating backlinks or social posts — a misspelled domain is a wasted link.

Show notes structure (recommended)

Two separate sections — Chinese first, then English (or whichever languages the show targets). Do NOT interleave or put them side-by-side.

Heading rule: only use H2 (##). Avoid H3 or deeper — flatten all sub-sections to H2.

Timestamp format: always MM:SS with leading zeros (e.g., 08:25, 00:00, 42:10). Never 0:00 or 1:05.

EP{NNN}: {Episode title}

---

## 中文

**嘉宾：** {中文姓名 English Name}, {中文职位} {公司} (URL)

## 简介
{完整中文段落}

## 时间轴
- 00:00 — {中文描述}
- 08:25 — {中文描述}

## 核心要点
- {中文要点}

## 相关链接
- {品牌名}：{URL}

---

## English

**Guest:** {English Name}, {Title} at {Company} (URL)

## Summary
{Full English paragraph}

## Timestamps
- 00:00 — {English description}
- 08:25 — {English description}

## Key Takeaways
- {English takeaway}

## Links
- {Brand}: {URL}

Why two sections instead of bilingual bullets: Chinese readers want clean Chinese prose, English readers want clean English prose. Alternating "中文 / English" on every bullet makes both halves harder to read. Write each section as if it were the only one.

Quick trim (no filler removal)

If the user just wants a simple trim (e.g., "cut the first 3s"):

ffmpeg -y -i "INPUT" -ss 3 -c copy "OUTPUT"

Use -c copy for instant lossless trim when no audio processing is needed.

Podcast Edit Skill

Process raw podcast/meeting recordings into polished podcast episodes.

Capabilities

Smart trimming — Find where the actual podcast starts/ends by transcribing and detecting intros/outros
Filler word removal — Remove verbal tics: 嗯, 呃, 啊, 哦, 对对对, um, uh, etc.
Silence trimming — Cut long dead air (>2s) down to natural pauses (~0.6s)
Audio enhancement — Noise reduction, EQ, multi-speaker volume balancing, loudness normalization to podcast standard (−16 LUFS)

Prerequisites

ffmpeg and ffprobe installed
OPENAI_API_KEY in environment (for Whisper API transcription)
Python 3 with stdlib only (no extra deps for the helper script)

Workflow

Step 1: Inspect the audio file

ffprobe -v quiet -print_format json -show_format -show_streams "INPUT_FILE"

Note: duration, sample rate, channels, codec, bitrate.

Step 2: Find podcast start/end (if user says to trim front/back)

Split into 5-minute chunks and transcribe via OpenAI Whisper API with segment-level timestamps:

# Extract chunk
ffmpeg -y -i "INPUT_FILE" -ss OFFSET -t 300 -ar 16000 -ac 1 /tmp/chunk_OFFSET.mp3

# Transcribe
curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@/tmp/chunk_OFFSET.mp3" \
  -F model="whisper-1" \
  -F response_format="verbose_json" \
  -F language="LANG" \
  -F 'timestamp_granularities[]=segment' > /tmp/transcript_OFFSET.json

Scan transcriptions for:

Start markers: "welcome", "hello everyone", "大家好", "欢迎", intro music, first substantive topic sentence
End markers: "see you next time", "bye", "下期见", "感谢收听", followed by post-show chat

Do an initial trim with -ss START -to END and -c copy (no re-encode) to create a working file.

Step 3: Remove filler words

Split the trimmed file into 5-minute chunks and transcribe each with word-level timestamps:

# Extract chunks
for i in $(seq 0 300 DURATION); do
  ffmpeg -y -i "TRIMMED_FILE" -ss $i -t 300 -ar 16000 -ac 1 /tmp/wchunk_${i}.mp3
done

# Transcribe each chunk (can run in parallel)
for i in $(seq 0 300 DURATION); do
  curl -s https://api.openai.com/v1/audio/transcriptions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -F file="@/tmp/wchunk_${i}.mp3" \
    -F model="whisper-1" \
    -F response_format="verbose_json" \
    -F language="LANG" \
    -F 'timestamp_granularities[]=word' \
    -F 'timestamp_granularities[]=segment' > /tmp/wtranscript_${i}.json &
done
wait

Then run the filler removal script that ships with this skill:

python3 ./filler_removal.py \
  --total-duration DURATION \
  --end-at END_TIMESTAMP \
  --cut START1:END1 --cut START2:END2 \
  --chunk-offsets 0,300,600,900,...

Arguments:

--total-duration: Duration of the trimmed input file in seconds (required)
--end-at: Cut everything after this timestamp (e.g., post-show chat start)
--cut START:END: Cut a specific range. Can be repeated.
--chunk-offsets: Comma-separated chunk offsets (default: auto 0,300,600,…)

The script outputs /tmp/ffmpeg_filter.txt with an atrim+concat filter.

Apply the filter in two passes:

# Step A: Cut fillers → intermediate WAV (avoids re-encoding artifacts)
ffmpeg -y -i "TRIMMED_FILE" \
  -filter_complex_script /tmp/ffmpeg_filter.txt \
  -map '[out]' -c:a pcm_s16le -ar 44100 /tmp/podcast_cut.wav

# Step B: Enhance audio → final MP3
ffmpeg -y -i /tmp/podcast_cut.wav \
  -af "ENHANCEMENT_CHAIN" \
  -c:a libmp3lame -b:a 192k "OUTPUT_FILE"

Step 4: Audio enhancement filter chain

highpass=f=80,                                    # Remove room rumble
lowpass=f=12000,                                  # Remove hiss (use 7500 for 16kHz sources)
afftdn=nf=-25:nr=8:nt=w,                         # Gentle FFT noise reduction
equalizer=f=180:t=q:w=1.5:g=-2,                  # Cut mud
equalizer=f=2500:t=q:w=1.2:g=3,                  # Boost presence
equalizer=f=4500:t=q:w=1.5:g=1.5,                # Boost clarity
dynaudnorm=f=200:g=5:p=0.95:m=5:s=0,             # Rolling-window normalization — lifts the quieter speaker independently
acompressor=threshold=-20dB:ratio=2:attack=5:release=200:makeup=1,  # Gentle glue
loudnorm=I=-16:TP=-1.5:LRA=13                    # Podcast standard loudness

Never add these to the default chain:

agate (noise gate) — cuts off any speaker quieter than the threshold; kills the guest.
Heavy compression (ratio >3:1, makeup >2 dB) — flattens dynamics and makes the guest sound pumped.
Narrow LRA (<12) in loudnorm — crushes natural speech dynamics.

Adjust lowpass based on source sample rate:

16kHz source → lowpass=7500
44.1kHz+ source → lowpass=12000 (or skip)

Step 5: Verify output

ls -lh "OUTPUT_FILE"
ffprobe -v quiet -show_entries format=duration -of csv=p=0 "OUTPUT_FILE"

Report: duration, file size, what was removed (filler count, silence count, time saved).

Output conventions

Format: MP3, 192 kbps, mono (unless source is stereo with separate speakers per channel)
Loudness: −16 LUFS (podcast standard)
Always two-pass: cut to WAV first, then enhance to MP3

Show notes — bilingual writing (if applicable)

Translation rules

Translate these common startup/tech English loanwords into Chinese:

close deal → 拿下订单 / 成交 / 签下
build (a product) → 搭建 / 做出 / 打造
integration → 集成
view (video/page views) → 播放 / 浏览
stack (tech stack) → 体系 / 技术栈
category leader → 品类领导者
front-end / front end (product sense) → 外壳 / 前端
success story → 客户案例 / 成功故事
SMB → 中小企业
Enterprise (segment) → 大型企业 / 企业级
aha moment → 顿悟时刻
onboarding → 上手 / 入门
retention → 留存
churn → 流失
pipeline → 销售漏斗 / 业务线

What to KEEP in English inside Chinese text

Brand and product names — company / product / person names stay as-is
Very common startup acronyms — CEO, CTO, CMO, PMF, ARR, MRR, PR, AI, AI Agent, SaaS, API
Currency with numeric prefix — $20K, $200K, or 200 美金 (either form is fine when paired with a number)

Before finalizing

Name verification (CRITICAL)

Whisper frequently mangles company names, product names, and personal names. Before generating show notes or any output that includes names and links:

After transcription, extract all proper nouns — company names, product names, personal names, URLs mentioned.
Ask the user to confirm/correct them — Whisper hears similar-sounding but wrong tokens for brand names.
Never guess URLs from transcribed names — a name that sounds like "Acme" could be acme.com, acmehq.com, or something else entirely. Always ask.
Use confirmed names consistently in show notes, titles, episode metadata, and all outputs.

This is especially important when generating backlinks or social posts — a misspelled domain is a wasted link.

Show notes structure (recommended)

Two separate sections — Chinese first, then English (or whichever languages the show targets). Do NOT interleave or put them side-by-side.

Heading rule: only use H2 (##). Avoid H3 or deeper — flatten all sub-sections to H2.

Timestamp format: always MM:SS with leading zeros (e.g., 08:25, 00:00, 42:10). Never 0:00 or 1:05.

EP{NNN}: {Episode title}

---

## 中文

**嘉宾：** {中文姓名 English Name}, {中文职位} {公司} (URL)

## 简介
{完整中文段落}

## 时间轴
- 00:00 — {中文描述}
- 08:25 — {中文描述}

## 核心要点
- {中文要点}

## 相关链接
- {品牌名}：{URL}

---

## English

**Guest:** {English Name}, {Title} at {Company} (URL)

## Summary
{Full English paragraph}

## Timestamps
- 00:00 — {English description}
- 08:25 — {English description}

## Key Takeaways
- {English takeaway}

## Links
- {Brand}: {URL}

Quick trim (no filler removal)

If the user just wants a simple trim (e.g., "cut the first 3s"):

ffmpeg -y -i "INPUT" -ss 3 -c copy "OUTPUT"

Use -c copy for instant lossless trim when no audio processing is needed.

Adoption

OpenClaudia/podcast-edit

$ install --global

Security Scan Results

SKILL.md

Podcast Edit Skill

Capabilities

Prerequisites

Workflow

Step 1: Inspect the audio file

Step 2: Find podcast start/end (if user says to trim front/back)

Step 3: Remove filler words

Step 4: Audio enhancement filter chain

Step 5: Verify output

Output conventions

Show notes — bilingual writing (if applicable)

Translation rules

What to KEEP in English inside Chinese text

Before finalizing

Name verification (CRITICAL)

Show notes structure (recommended)

Quick trim (no filler removal)

Related Skills

OpenClaudia/generate-image

OpenClaudia/similarweb-traffic

OpenClaudia/geo-query-finder

OpenClaudia/youtube-analytics

OpenClaudia/podcast-edit

$ install --global

Security Scan Results

SKILL.md

Podcast Edit Skill

Capabilities

Prerequisites

Workflow

Step 1: Inspect the audio file

Step 2: Find podcast start/end (if user says to trim front/back)

Step 3: Remove filler words

Step 4: Audio enhancement filter chain

Step 5: Verify output

Output conventions

Show notes — bilingual writing (if applicable)

Translation rules

What to KEEP in English inside Chinese text

Before finalizing

Name verification (CRITICAL)

Show notes structure (recommended)

Quick trim (no filler removal)

Related Skills

OpenClaudia/generate-image

OpenClaudia/similarweb-traffic

OpenClaudia/geo-query-finder

OpenClaudia/youtube-analytics