claude/skills/gemini-audio/SKILL.md
Guide for implementing Google Gemini API audio capabilities - analyze audio with transcription, summarization, and understanding (up to 9.5 hours), plus generate speech with controllable TTS. Use when processing audio files, creating transcripts, analyzing speech/music/sounds, or generating natural speech from text.
npx skillsauth add einverne/dotfiles gemini-audioInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Process audio with transcription, analysis, and understanding, plus generate natural speech using Google's Gemini API. Supports up to 9.5 hours of audio per request with multiple formats.
Use this skill when you need to:
The skill automatically detects your GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".claude/skills/gemini-audio/.env./.env (project root)Get your API key: Visit Google AI Studio
Create .env file with:
GEMINI_API_KEY=your_api_key_here
Install required package:
pip install google-genai
from google import genai
import os
# API key auto-detected from environment
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
# Upload audio file
myfile = client.files.upload(file='podcast.mp3')
# Transcribe
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Generate a transcript of the speech.', myfile]
)
print(response.text)
# Summarize
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize the key points in 5 bullets.', myfile]
)
print(response.text)
# Transcribe audio
python .claude/skills/gemini-audio/scripts/transcribe.py audio.mp3
# Summarize audio
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
"Summarize key points"
# Analyze specific segment (timestamps in MM:SS format)
python .claude/skills/gemini-audio/scripts/analyze.py audio.mp3 \
"What is discussed from 02:30 to 05:15?"
# Generate speech
python .claude/skills/gemini-audio/scripts/generate-speech.py \
"Welcome to our podcast" \
--output welcome.wav
| Format | MIME Type | Best Use |
|--------|-----------|----------|
| WAV | audio/wav | Uncompressed, highest quality |
| MP3 | audio/mp3 | Compressed, widely compatible |
| AAC | audio/aac | Compressed, good quality |
| FLAC | audio/flac | Lossless compression |
| OGG Vorbis | audio/ogg | Open format |
| AIFF | audio/aiff | Apple format |
| Model | Quality | Speed | Cost/1M tokens |
|-------|---------|-------|----------------|
| gemini-2.5-flash-native-audio-preview-09-2025 | High | Fast | $10 |
| gemini-2.5-pro TTS mode | Premium | Slower | $20 |
response = client.models.generate_content(
model='gemini-2.5-flash-native-audio-preview-09-2025',
contents='Generate audio: Welcome to today\'s episode, in a warm, friendly tone.'
)
# Save audio output
with open('output.wav', 'wb') as f:
f.write(response.audio_data)
# Upload and reuse
myfile = client.files.upload(file='large-audio.mp3')
# Use file multiple times
response1 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Transcribe this', myfile]
)
response2 = client.models.generate_content(
model='gemini-2.5-flash',
contents=['Summarize this', myfile]
)
from google.genai import types
with open('small-audio.mp3', 'rb') as f:
audio_bytes = f.read()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=[
'Describe this audio',
types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
]
)
python scripts/transcribe.py meeting.mp3 --include-timestamps
python scripts/analyze.py interview.wav "Extract main topics and key quotes"
python scripts/analyze.py discussion.mp3 "Identify speakers and extract dialogue"
python scripts/analyze.py podcast.mp3 "Summarize content from 10:30 to 15:45"
python scripts/analyze.py ambient.wav "Identify all sounds: voices, music, ambient"
gemini-2.5-flash ($1/1M tokens) for most tasksgemini-2.5-pro ($3/1M tokens) for complex analysisAudio Input (32 tokens/second):
Model Pricing:
TTS Pricing:
For detailed information, see:
references/api-reference.md - Complete API specificationsreferences/code-examples.md - Comprehensive code examplesreferences/tts-guide.md - Text-to-speech implementation guidereferences/best-practices.md - Advanced optimization strategiesAll scripts support 3-step API key detection:
Run any script with --help for detailed usage.
development
生成符合项目规范的 React 组件。当用户要求创建组件、新建 React 组件或生成组件文件时使用
development
生成符合 Conventional Commits 规范的 Git 提交信息。当用户要求生成提交、创建 commit 或写提交信息时使用
devops
将当前分支部署到测试环境。当用户要求部署、发布到测试或在 staging 环境测试时使用
development
进行系统化的代码审查,检查代码质量、安全性和性能。当用户要求审查代码、review 或检查代码时使用