skills_all/ai-multimodal/SKILL.md
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
npx skillsauth add activer007/ordinary-claude-skills ai-multimodalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
| Task | Audio | Image | Video | Document | Generation | |------|:-----:|:-----:|:-----:|:--------:|:----------:| | Transcription | ✓ | - | ✓ | - | - | | Summarization | ✓ | ✓ | ✓ | ✓ | - | | Q&A | ✓ | ✓ | ✓ | ✓ | - | | Object Detection | - | ✓ | ✓ | - | - | | Text Extraction | - | ✓ | - | ✓ | - | | Structured Output | ✓ | ✓ | ✓ | ✓ | - | | Creation | TTS | - | - | - | ✓ | | Timestamps | ✓ | - | ✓ | - | - | | Segmentation | - | ✓ | - | - | - |
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.envGet API key: https://aistudio.google.com/apikey
For Vertex AI:
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
Install SDK:
pip install google-genai python-dotenv pillow
Transcribe Audio:
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash
Analyze Image:
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Process Video:
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Extract from PDF:
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json
Generate Image:
python scripts/gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9
Optimize Media:
# Prepare large video for processing
python scripts/media_optimizer.py \
--input large-video.mp4 \
--output docs/assets/<output-file-name> \
--target-size 100MB
# Batch optimize multiple files
python scripts/media_optimizer.py \
--input-dir ./videos \
--output-dir docs/assets/optimized \
--quality 85
Convert Documents to Markdown:
# Convert to PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
For detailed implementation guidance, see:
references/audio-processing.md - Transcription, analysis, TTS
references/vision-understanding.md - Captioning, detection, OCR
references/video-analysis.md - Scene detection, temporal understanding
references/document-extraction.md - PDF processing, structured output
references/image-generation.md - Text-to-image, editing
Input Pricing:
Token Rates:
TTS Pricing:
gemini-2.5-flash for most tasks (best price/performance)media_optimizer.py)Free Tier:
YouTube Limits:
Storage Limits:
Common errors and solutions:
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
media_optimizer.py: Prepare media for Gemini API
document_converter.py: Convert documents to PDF
Run any script with --help for detailed usage.
tools
Generate typed TypeScript SDKs for AI agents to interact with MCP servers. Converts verbose JSON-RPC curl commands to clean function calls (docs.createDocument() vs curl). Auto-detects MCP tools from server modules, generates TypeScript types and client methods, creates runnable example scripts. Use when: building MCP-enabled applications, need typed programmatic access to MCP tools, want Claude Code to manage apps via scripts, eliminating manual JSON-RPC curl commands, validating MCP inputs/outputs, or creating reusable agent automation.
testing
Generate structured task lists from specs or requirements. IMPORTANT: After completing ANY spec via ExitSpecMode, ALWAYS ask the user: "Would you like me to generate a task list for this spec?" Use when user confirms or explicitly requests task generation from a plan/spec/PRD.
tools
Create compelling story-format summaries using UltraThink to find the best narrative framing. Support multiple formats - 3-part narrative, n-length with inline links, abridged 5-line, or comprehensive via Foundry MCP. USE WHEN user says 'create story explanation', 'narrative summary', 'explain as a story', or wants content in Daniel's conversational first-person voice.
testing
Navigate through the original three-world shamanic technology. Deploy when soul retrieval, power animal guidance, or journey between realms emerges. Deeply respectful of Tungus, Buryat, Yakut, Evenki traditions. Use for consciousness navigation, NOT cultural appropriation.