Claude-Skills/skills/ai-multimodal/SKILL.md
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
npx skillsauth add nordeim/prompt-engineering ai-multimodalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
| Task | Audio | Image | Video | Document | Generation | |------|:-----:|:-----:|:-----:|:--------:|:----------:| | Transcription | ✓ | - | ✓ | - | - | | Summarization | ✓ | ✓ | ✓ | ✓ | - | | Q&A | ✓ | ✓ | ✓ | ✓ | - | | Object Detection | - | ✓ | ✓ | - | - | | Text Extraction | - | ✓ | - | ✓ | - | | Structured Output | ✓ | ✓ | ✓ | ✓ | - | | Creation | TTS | - | - | - | ✓ | | Timestamps | ✓ | - | ✓ | - | - | | Segmentation | - | ✓ | - | - | - |
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.envGet API key: https://aistudio.google.com/apikey
For Vertex AI:
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
Install SDK:
pip install google-genai python-dotenv pillow
Transcribe Audio:
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash
Analyze Image:
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Process Video:
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Extract from PDF:
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json
Generate Image:
python scripts/gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9
Optimize Media:
# Prepare large video for processing
python scripts/media_optimizer.py \
--input large-video.mp4 \
--output docs/assets/<output-file-name> \
--target-size 100MB
# Batch optimize multiple files
python scripts/media_optimizer.py \
--input-dir ./videos \
--output-dir docs/assets/optimized \
--quality 85
Convert Documents to Markdown:
# Convert to PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
For detailed implementation guidance, see:
references/audio-processing.md - Transcription, analysis, TTS
references/vision-understanding.md - Captioning, detection, OCR
references/video-analysis.md - Scene detection, temporal understanding
references/document-extraction.md - PDF processing, structured output
references/image-generation.md - Text-to-image, editing
Input Pricing:
Token Rates:
TTS Pricing:
gemini-2.5-flash for most tasks (best price/performance)media_optimizer.py)Free Tier:
YouTube Limits:
Storage Limits:
Common errors and solutions:
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
media_optimizer.py: Prepare media for Gemini API
document_converter.py: Convert documents to PDF
Run any script with --help for detailed usage.
development
Performs comprehensive enterprise-grade critical code review on project folders or GitHub repositories, focusing on quality, security, performance, maintainability, and best practices
development
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
development
Build modern full-stack web applications with Next.js (App Router, Server Components, RSC, PPR, SSR, SSG, ISR), Turborepo (monorepo management, task pipelines, remote caching, parallel execution), and RemixIcon (3100+ SVG icons in outlined/filled styles). Use when creating React applications, implementing server-side rendering, setting up monorepos with multiple packages, optimizing build performance and caching strategies, adding icon libraries, managing shared dependencies, or working with TypeScript full-stack projects.
tools
Create beautiful, accessible user interfaces with shadcn/ui components (built on Radix UI + Tailwind), Tailwind CSS utility-first styling, and canvas-based visual designs. Use when building user interfaces, implementing design systems, creating responsive layouts, adding accessible components (dialogs, dropdowns, forms, tables), customizing themes and colors, implementing dark mode, generating visual designs and posters, or establishing consistent styling patterns across applications.