skills/image-vision/SKILL.md
Analyze images using LLM vision APIs (Anthropic Claude, OpenAI GPT-4, Google Gemini, Azure OpenAI). Use when tasks require: (1) Understanding image content, (2) Describing visual elements, (3) Answering questions about images, (4) Comparing images, (5) Extracting text from images (OCR). Provides ready-to-use scripts - no custom code needed for simple cases.
npx skillsauth add microsoft/amplifier-bundle-skills image-visionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze images using state-of-the-art LLM vision models. Use the provided scripts for most tasks - custom code only needed for advanced scenarios.
→ Read setup.md for one-time environment and API key setup
→ Use "Quick Start" canned scripts below
→ Read patterns.md for advanced patterns
→ Check setup.md for troubleshooting
ALWAYS use the wrapper scripts - they handle venv setup automatically:
# Simple analysis (auto-creates venv on first use)
./vision-analyze.sh <provider> <image_path> <prompt>
# Robust analysis (auto-fallback if provider times out)
./vision-analyze-robust.sh <image_path> <prompt> [timeout_seconds]
The wrapper scripts automatically:
Example usage:
# Analyze a UI screenshot (Anthropic Claude)
./vision-analyze.sh anthropic screenshot.png "Describe any UI bugs or issues you see"
# Extract text (Google Gemini - fastest)
./vision-analyze.sh gemini document.jpg "Extract all text from this image"
# Robust analysis with auto-fallback (tries Gemini → Anthropic → OpenAI)
./vision-analyze-robust.sh photo.png "Describe this image in detail"
# With custom timeout (default is 60 seconds)
./vision-analyze-robust.sh large-image.png "Analyze this" 120
If you need to call the Python scripts directly, you MUST use the venv Python:
# ❌ WRONG - uses system Python, will fail
python examples/anthropic-vision.py image.png "prompt"
# ✅ CORRECT - uses venv Python
./.venv/bin/python examples/anthropic-vision.py image.png "prompt"
For agents: Always use the wrapper scripts to avoid setup issues.
| Provider | Model | Best For | Speed | Cost | |----------|-------|----------|-------|------| | Anthropic | claude-sonnet-4-5 | Latest, balanced quality/speed | Fast | $$ | | Anthropic | claude-3-opus | Highest quality (older) | Slow | $$$ | | Anthropic | claude-3-haiku | Fastest, simple tasks | Very Fast | $ | | OpenAI | gpt-5 | Latest flagship model | Fast | $$$ | | OpenAI | gpt-4.1 | High-volume production | Fast | $$ | | Gemini | gemini-2.5-flash | Latest, excellent balance | Very Fast | $ | | Gemini | gemini-2.5-pro | Large images, best quality | Medium | $$ | | Azure | (deployment-based) | Enterprise, compliance | Varies | Varies |
Max sizes:
# UI/UX Analysis - High-level layout and spacing
./vision-analyze.sh anthropic app-screenshot.png \
"Analyze this UI for accessibility issues and suggest improvements"
# Bug Identification (use robust for auto-fallback)
./vision-analyze-robust.sh error-state.png \
"What's wrong with this interface? Describe any visual bugs."
# Content Moderation
./vision-analyze.sh openai user-upload.jpg \
"Does this image contain inappropriate content? Yes or no, and explain."
# Document Understanding (Gemini is fastest)
./vision-analyze.sh gemini invoice.png \
"Extract the total amount, date, and vendor name from this invoice"
# Design Review - Layout, color, hierarchy (not typography details)
./vision-analyze-robust.sh mockup.png \
"Provide design feedback on this mockup. Consider layout, color hierarchy, and spacing."
Vision models struggle with precise typography at typical screenshot resolutions:
❌ Unreliable for:
✅ Reliable for:
For Web UI bugs, use this hierarchy:
# 1. Vision for TRIAGE (identify area of concern)
./vision-analyze-robust.sh screenshot.png "Are there any visual inconsistencies in the navigation?"
# 2. Browser inspection for FACTS (if typography/font suspected)
# Use Playwright or DevTools to query computed CSS:
# const styles = await page.evaluate(() => ({
# fontFamily: getComputedStyle(element).fontFamily
# }));
# 3. Code investigation for ROOT CAUSE
# grep -r ".suspicious-class" src/
# 4. Vision for VERIFICATION (after fix applied)
./vision-analyze-robust.sh fixed.png "Is the navigation font now consistent?"
If vision gives contradictory results across 2+ attempts on similar screenshots:
This indicates the issue is too subtle for vision models to detect reliably.
Font/Typography (with caveats):
# Be explicit about what to look for
./vision-analyze.sh anthropic ui.png \
"Look at the navigation text. Do any items have decorative 'feet' at letter ends (serif font)
while others have clean straight edges (sans-serif)? Point out any font style differences."
# Note: Small fonts may be unreliable - verify with browser inspection
Alignment (relative observations):
# Ask for noticeable differences, not pixel precision
./vision-analyze.sh anthropic ui.png \
"Is the bullet (•) noticeably misaligned with the text baseline?
Describe its vertical position relative to the text."
Layout and Spacing:
# Vision is GOOD at this
./vision-analyze.sh anthropic ui.png \
"Compare the spacing between navigation sections. Is it consistent?"
All scripts output to stdout as plain text. The LLM's analysis is printed directly:
$ python examples/anthropic-vision.py screenshot.png "What's in this image?"
This image shows a web application dashboard with a navigation bar at the top,
a sidebar on the left with menu items, and a main content area displaying...
For structured output, modify your prompt:
python examples/openai-vision.py data.png \
"Extract data as JSON with keys: title, date, amount"
Use the canned scripts for:
Write custom scripts when you need:
→ See patterns.md for custom script examples
| ❌ Don't | ✅ Do | |----------|-------| | Write custom script for simple analysis | Use canned scripts | | Use low-quality compressed images | Use clear, high-res images | | Ask vague questions | Be specific in prompts | | Forget to set API keys | Set keys in environment variables | | Mix up provider-specific model names | Check provider comparison table |
| Task | Command |
|------|---------|
| Analyze (single provider) | ./vision-analyze.sh anthropic img.png "prompt" |
| Analyze (auto-fallback) | ./vision-analyze-robust.sh img.png "prompt" |
| Extract text (OCR) | ./vision-analyze.sh gemini img.png "Extract all text" |
| Health check | ./health-check.sh |
| Compare images | See patterns.md for custom script |
| Batch process | See patterns.md for custom script |
READ THIS BEFORE USING THIS SKILL:
# For AI agents (recommended) - auto-fallback on timeout
~/.amplifier/skills/image-vision/vision-analyze-robust.sh <image_path> <prompt>
# Single provider (faster if you know which to use)
~/.amplifier/skills/image-vision/vision-analyze.sh <provider> <image_path> <prompt>
Examples:
# Robust analysis (tries multiple providers if timeout)
~/.amplifier/skills/image-vision/vision-analyze-robust.sh screenshot.png "Analyze this UI"
# Specific provider
~/.amplifier/skills/image-vision/vision-analyze.sh anthropic screenshot.png "Describe this"
# Correct usage pattern
OUTPUT=$(~/.amplifier/skills/image-vision/vision-analyze-robust.sh image.png "Analyze this" 2>&1)
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "Vision analysis succeeded"
# Now you can use $OUTPUT
else
echo "ERROR: Vision analysis failed (exit code: $EXIT_CODE)"
echo "Error details: $OUTPUT"
# STOP HERE - do NOT proceed
exit 1
fi
If vision analysis fails, you MUST:
✅ DO:
❌ NEVER:
Example of CORRECT failure handling:
Agent: I attempted to analyze the 3 screenshots using the image-vision skill:
- screenshot-1.png: ✗ Anthropic timed out (60s)
- screenshot-1.png: ✗ Gemini timed out (60s)
- screenshot-1.png: ✗ OpenAI failed (API error)
I have NOT successfully analyzed any of the screenshots. I cannot provide visual design
feedback without actually seeing the images.
Options:
1. Retry with different settings
2. Investigate why all providers are failing
3. Defer visual analysis until the issue is resolved
I will NOT write design analysis documents based on guesswork or context alone.
Vision API calls typically take 5-60 seconds:
The wrapper scripts handle timeouts with:
If still hitting timeouts:
For interactive use:
cd image-vision && uv venvuv pip install anthropic openai google-generativeaiANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEYFor agents:
./health-check.sh→ See setup.md for complete instructions
setup.md — One-time environment setup, API keys, troubleshootingpatterns.md — Advanced patterns: batch processing, multi-turn, custom outputtools
Curmudgeonly engineering advisor that provides grounded skepticism, evidence-linked judgment, and constructive progress on architectural decisions, legacy refactors, tooling choices, and broad "how should I start?" questions. Sounds like a senior systems engineer who has reviewed too many designs to be impressed, but still cares about correctness. Use when: architectural decisions, legacy replacements, new tooling evaluation, broad planning questions.
testing
Use when verifying that completed work actually works. Auto-surface during /verify mode, post-implementation review, or before claiming a task is done. Teaches the discipline of testing outcomes vs implementation, the unit/integration/smoke gradient, and what "done" actually means.
development
Use when starting work in any repository. Auto-surface when an agent is about to write code, create a PR, or verify work. Teaches the discovery pattern for finding and applying per-repo conventions (AGENTS.md, PR templates, CONTRIBUTING.md) before acting.
tools
Use when designing a curl-piped install script for a project that cannot use uv tool install or npm publish — multi-service stacks (Docker Compose), raw TS/React apps, tools that bootstrap system dependencies, or installs for non-technical audiences. Documents the security trade-off, the community convention used by rustup, bun, deno, fly, ollama, and supabase, and the cases where this pattern is the wrong answer.