skills/ai-multimodal/SKILL.md
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
npx skillsauth add devattom/.claude ai-multimodalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
| Task | Audio | Image | Video | Document | Generation | |------|:-----:|:-----:|:-----:|:--------:|:----------:| | Transcription | ✓ | - | ✓ | - | - | | Summarization | ✓ | ✓ | ✓ | ✓ | - | | Q&A | ✓ | ✓ | ✓ | ✓ | - | | Object Detection | - | ✓ | ✓ | - | - | | Text Extraction | - | ✓ | - | ✓ | - | | Structured Output | ✓ | ✓ | ✓ | ✓ | - | | Creation | TTS | - | - | - | ✓ | | Timestamps | ✓ | - | ✓ | - | - | | Segmentation | - | ✓ | - | - | - |
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.envGet API key: https://aistudio.google.com/apikey
For Vertex AI:
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
Install SDK (auto-managed via venv):
# The wrapper script auto-creates and manages a virtual environment
# No manual installation needed - just run the scripts via run.sh
Use the wrapper script scripts/run.sh which auto-manages the venv:
Transcribe Audio:
./scripts/run.sh gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash
Analyze Image:
./scripts/run.sh gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Process Video:
./scripts/run.sh gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Extract from PDF:
./scripts/run.sh gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json
Generate Image:
./scripts/run.sh gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9
Generate Architecture Diagram (best with gemini-3-pro-image-preview):
./scripts/run.sh gemini_batch_process.py \
--task generate \
--prompt "Create an architecture diagram showing [system name] for developers.
Topic: [What to visualize]
Layout: Isometric view with clear layers
Visual Style: Flat vector art, modern tech aesthetic
Format: 16:9
Key Elements: [Component list with icons]
Text Labels: [Exact labels to include]
Color Coding: Blue=core, Green=data, Orange=external" \
--output docs/assets/architecture-diagram \
--model gemini-3-pro-image-preview \
--aspect-ratio 16:9
Tip: For best text label results, keep labels short (max 15 chars) and generate a legend backup.
Optimize Media:
# Prepare large video for processing
python scripts/media_optimizer.py \
--input large-video.mp4 \
--output docs/assets/<output-file-name> \
--target-size 100MB
# Batch optimize multiple files
python scripts/media_optimizer.py \
--input-dir ./videos \
--output-dir docs/assets/optimized \
--quality 85
Convert Documents to Markdown:
# Convert to PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
For detailed implementation guidance, see:
references/audio-processing.md - Transcription, analysis, TTS
references/vision-understanding.md - Captioning, detection, OCR
references/video-analysis.md - Scene detection, temporal understanding
references/document-extraction.md - PDF processing, structured output
references/image-generation.md - Text-to-image, editing
Input Pricing:
Token Rates:
TTS Pricing:
gemini-2.5-flash for most tasks (best price/performance)media_optimizer.py)Free Tier:
YouTube Limits:
Storage Limits:
Common errors and solutions:
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
media_optimizer.py: Prepare media for Gemini API
document_converter.py: Convert documents to PDF
Run any script with --help for detailed usage.
development
Use when you want to audit a project wiki for quality issues — stale version claims, contradictions between pages, orphan pages, broken wiki links, missing cross-references, or misalignment between wiki content and the actual codebase state.
development
Systematic error debugging with analysis, solution discovery, and verification
development
Structured adversarial debate between AI councillors using Agent Teams to evaluate ideas, plans, or decisions. ALWAYS use when the user says "council", "debate this", "evaluate this idea", "challenge my plan", "stress-test", "devil's advocate", "multiple perspectives", "évaluer cette idée", "débattre", "challenger mon plan", "tester cette décision", or when the user wants rigorous multi-perspective analysis of a proposal, architecture decision, or strategic choice. Each councillor (visionary, critic, pragmatist, innovator, ethicist, domain expert) represents a distinct perspective and they challenge each other through cross-examination and peer exchange, producing a nuanced verdict (PROCEED / PROCEED WITH CONDITIONS / RECONSIDER / DO NOT PROCEED). Do NOT use for divergent brainstorming or idea generation — use workflow-brainstorm instead.
testing
Automated CI/CD pipeline fixer - watches CI, fixes errors locally, commits, and loops until green. Use when CI is failing and you want to automatically fix and verify changes.