lecture-stt/SKILL.md
Transcribe audio lectures into structured markdown notes using LLM-based STT (Gemini 3 Flash Preview) with contextual prompting. Supports PDF slide guides, domain-aware term hints, and local Whisper fallback. Triggers: "강의 전사", "STT", "lecture transcription", "오디오 전사", "강의 녹음", "audio to text", "lecture notes", "음성 변환", "녹음 텍스트", "전사해줘", "transcribe", "whisper"
npx skillsauth add lidge-jun/cli-jaw-skills lecture-sttInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Convert lecture audio into near-verbatim structured markdown using LLM-based STT. Leverages contextual prompting — PDF slides identify which slide is being discussed, while output captures only what the professor actually said. Without proper prompting, models tend to copy-paste slide text instead of transcribing speech.
Required:
.m4a, .mp3, .wav (up to ~8 hours)Optional (strongly recommended):
GOOGLE_APPLICATION_CREDENTIALS or ADC configured?
├─ Yes → Vertex AI + Gemini 3 Flash Preview (recommended, location='global')
GEMINI_API_KEY exists?
├─ Yes → Gemini API + Gemini 3 Flash Preview (recommended)
│ └─ Fallback: Gemini 3.1 Flash-Lite (faster but may copy slide text)
├─ No, OPENAI_API_KEY exists?
│ └─ Yes → OpenAI gpt-4o-transcribe (supports diarization)
└─ No → mlx-whisper local (Apple Silicon, offline, free)
Vertex AI requires
location='global'for gemini-3.* models. Usingus-central1returns 404.
Choose prompt level based on available inputs:
Transcribe this audio in its original language. Output transcription only.
Transcribe this audio in its original language. Output transcription only.
Context: {domain description}
Key Terms: {comma-separated terms}
Transcribe this audio in its original language.
Context: {domain}
Key Terms: {terms}
Speaker: {speaker info, if known}
Output format:
- Sentence-level line breaks
- Numbers in Arabic numerals (e.g., 12,000)
- Preserve the original language throughout
- Transcription text only
You are a speech-to-text transcription assistant. You are given a PDF slide
deck and an audio recording of a {subject} university lecture.
## Rule 1: Original Language Only
Never translate. If the professor speaks English, output in English.
If Korean, output in Korean. If mixed, preserve the mix exactly as spoken.
## Rule 2: Verbatim STT — Speech Only
This is STT, not summarization or note-taking.
Write down what the professor actually said, word by word,
preserving their exact phrasing and speaking style.
The PDF is only for identifying which slide is being discussed.
Output only the professor's spoken words.
If the professor reads a slide aloud, transcribe their spoken version
(which may differ from slide text).
- Include filler words, false starts, self-corrections
- Include every tangent, joke, anecdote, aside, and digression
- Keep natural speaking style (colloquial, formal, rambling — all of it)
- A 10-minute audio should produce 2000+ words
- More text is always better than less
## Page-by-Page Structure
- Structure by PDF page: `## p.{N} — {slide title or topic}`
- Go through every page in order
- Skipped pages: `*[No lecture content — slide only]*`
- Content before first slide → `## p.0 — Pre-lecture`
- Content after last slide → `## Closing`
## Beyond the PDF
Capture all off-slide content under nearest page with 💬 marker:
- Verbal explanations, intuitions, reasoning
- Real-world examples, anecdotes, case studies
- Exam tips, common mistakes, warnings
- Q&A with students
- Administrative announcements
## Output Format
- `---` dividers between page sections
- `## p.{N} — {title}` for every page
- `> ` blockquote for extended examples
- 💬 for beyond-PDF verbal content
- Math in KaTeX: $Y = C + I + G$
- One sentence per line
## Key Terms
{auto-inferred + user-supplied terms}
Transcribe this audio verbatim. Include all speech as-is.
Sentence-level line breaks. No formatting.
Use when exact wording is needed without page structure (prefer Whisper for this).
When the user provides only a subject title, infer domain terms:
| Title pattern | Context | Auto terms | | -------------- | ---------------------- | ------------------------------------------ | | 경제, 거시경제 | Macroeconomics | GDP, multiplier, MPC, IS-LM, fiscal policy | | 물리 | Physics | Newton's laws, energy conservation, E=mc² | | 법학, 헌법 | Constitutional law | 위헌법률심판, 헌법소원, 기본권 | | CS, 프로그래밍 | Computer Science | algorithm, Big-O, data structure | | (no match) | Ask model to infer | — |
| Duration | Strategy | | --------- | --------------------------------------- | | < 2 hours | Single request | | 2–8 hours | Split into 2–4 chunks with 10s overlap | | 8+ hours | 30-minute chunks, sequential processing |
Multiple audio files from the same lecture can be sent in a single request —
add multiple types.Part.from_bytes() parts.
Split with ffmpeg when needed:
DUR=900 # 15 minutes
OVERLAP=10
for i in $(seq 0 $((DUR - OVERLAP)) $(ffprobe -v error -show_entries format=duration -of csv=p=0 input.m4a | cut -d. -f1)); do
ffmpeg -i input.m4a -ss $i -t $((DUR + OVERLAP)) -y chunk_${i}.mp3
done
from google import genai
from google.genai import types
import os
# Vertex AI (service account / ADC) — location='global' required for gemini-3-*
if os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") or os.environ.get("GOOGLE_CLOUD_PROJECT"):
client = genai.Client(
vertexai=True,
project=os.environ.get("GOOGLE_CLOUD_PROJECT", "your-project-id"),
location="global",
)
# Gemini API (API key)
else:
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
Inline bytes avoids the Korean filename UnicodeEncodeError entirely.
with open("lecture.m4a", "rb") as f:
audio_bytes = f.read()
contents = [
types.Part.from_bytes(data=audio_bytes, mime_type="audio/mp4"),
prompt_text,
]
if pdf_path:
with open(pdf_path, "rb") as f:
pdf_bytes = f.read()
contents.insert(0, types.Part.from_bytes(data=pdf_bytes, mime_type="application/pdf"))
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents=contents,
config=types.GenerateContentConfig(max_output_tokens=65536),
)
print(response.text)
File API is only available with API key auth, not Vertex AI.
Korean/non-ASCII filenames cause UnicodeEncodeError in httpx — use _safe_upload():
import time, shutil, tempfile
def _safe_upload(client, filepath, ascii_name):
"""Upload with ASCII-safe temp name to avoid httpx UnicodeEncodeError."""
tmpdir = tempfile.mkdtemp(prefix="stt_")
try:
safe_path = os.path.join(tmpdir, ascii_name)
shutil.copy2(filepath, safe_path)
return client.files.upload(file=safe_path)
finally:
shutil.rmtree(tmpdir, ignore_errors=True)
audio_file = _safe_upload(client, "노시론1.m4a", "lecture.m4a")
from google.genai.types import FileState
while audio_file.state == FileState.PROCESSING:
time.sleep(2)
audio_file = client.files.get(name=audio_file.name)
contents = [audio_file]
if pdf_path:
pdf_file = _safe_upload(client, pdf_path, "slides.pdf")
contents.insert(0, pdf_file)
contents.append(prompt_text)
response = client.models.generate_content(
model="gemini-3-flash-preview",
contents=contents,
config=types.GenerateContentConfig(max_output_tokens=65536),
)
print(response.text)
client.files.delete(name=audio_file.name)
if pdf_path:
client.files.delete(name=pdf_file.name)
import mlx_whisper
result = mlx_whisper.transcribe(
"lecture.m4a",
path_or_hf_repo="mlx-community/whisper-large-v3-turbo",
language="ko",
)
print(result["text"])
from openai import OpenAI
client = OpenAI()
with open("lecture.m4a", "rb") as f:
transcript = client.audio.transcriptions.create(
model="gpt-4o-transcribe", file=f,
response_format="text", language="ko",
)
print(transcript)
# {Subject} — Lecture Notes
- Model: gemini-3-flash-preview (Vertex AI, global)
- Time: {elapsed}s
- Tokens: {in} → {out}
- Source: {audio_file} {+ pdf_file}
- Mode: STT (Level 4, verbatim, original language)
---
{transcribed content}
gemini-3-flash-preview for best instruction following. With Flash-Lite, add extra emphasis on speech-only transcription.| Model | Speed | Verbatim Quality | Notes | | -------------------------- | ----------- | ---------------- | --------------------------------------- | | Gemini 3 Flash Preview | ~50s/44min | ⭐ Best | Recommended. Speech-only, follows rules | | Gemini 3.1 Flash-Lite | ~30s/44min | ⚠️ Copies slides | Faster but ignores speech-only rules | | mlx-whisper turbo | ~4min local | Good (raw) | Offline fallback, no page structure | | OpenAI gpt-4o-transcribe | — | Good | Diarization support, expensive |
_safe_upload() or inline bytes to avoid UnicodeEncodeErrorlocation='global' required for gemini-3-* models# Required
pip install google-genai # Gemini API + Vertex AI (unified SDK)
# Optional
pip install mlx-whisper # Local Whisper (Apple Silicon)
pip install openai # OpenAI diarization
brew install ffmpeg # Audio splitting for long files
# Vertex AI auth
gcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS
development
Native Web UI structured renderer schemas for compose-block drafts, search-results cards, dataframe tables, chart-json charts, and diff output
tools
Unified search hub. Route any web/real-time/X lookup through a 4-tier escalation: built-in web search → cli-jaw browser CDP → progrok Grok OAuth → web-ai (Grok Expert / GPT Pro). Use for: search, 검색, web search, latest news, real-time info, X/Twitter, fact lookup, deep research.
development
UI/UX intent discovery, design vocabulary, product personalities, UX state patterns, typography line break judgment, favicon/product logo design, and logo trust section design. Use when user design direction is vague, when building onboarding/empty/error states, when setting up favicons or product logos, or when referencing a product aesthetic.
development
Canonical owner of module boundary rules, circular dependency detection/prevention, implicit coupling taxonomy, barrel/re-export discipline, and boundary-only defensive programming. Referenced by dev, dev-code-reviewer, dev-backend, dev-frontend stubs.