.claude/skills/vision-multimodal/SKILL.md
Vision and multimodal capabilities for Claude including image analysis, PDF processing, and document understanding. Activate for image input, base64 encoding, multiple images, and visual analysis.
npx skillsauth add markus41/claude vision-multimodalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Leverage Claude's vision capabilities for image analysis, document processing, and multimodal understanding.
| Format | Status | Best For | |--------|--------|----------| | JPEG | ✓ | Photos, natural scenes | | PNG | ✓ | Screenshots, UI, text | | GIF | ✓ | Animated (first frame) | | WebP | ✓ | Modern, compressed | | PDF | ✓ | Documents (via Files API) |
import anthropic
import base64
client = anthropic.Anthropic()
# Load and encode image
with open("image.jpg", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
}]
)
import httpx
# Fetch and encode from URL
image_url = "https://example.com/image.jpg"
response = httpx.get(image_url)
image_data = base64.standard_b64encode(response.content).decode("utf-8")
# Then use same pattern as above
# Compare multiple images (up to 100 per request)
messages = [{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image1}},
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image2}},
{"type": "text", "text": "Compare these two images and list the differences."}
]
}]
# Teach by example
messages = [
# Example 1
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Landscape\nElements: Mountains, lake, trees"},
# Example 2
{"role": "user", "content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Classify this image."}
]},
{"role": "assistant", "content": "Category: Portrait\nElements: Person, indoor, professional"},
# Target image
{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": target_image}},
{"type": "text", "text": "Classify this image."}
]}
]
# Using Files API (beta)
with open("document.pdf", "rb") as f:
pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data
}
},
{"type": "text", "text": "Summarize this document."}
]
}]
)
prompt = """You have perfect vision and exceptional attention to detail,
making you an expert at analyzing technical diagrams.
Analyze this architecture diagram and identify:
1. All components
2. Data flow between components
3. Potential bottlenecks"""
prompt = """Before answering, analyze the image systematically:
<thinking>
1. What is the overall subject?
2. What are the key elements?
3. How do elements relate to each other?
4. What details stand out?
</thinking>
Then provide your answer based on this analysis."""
prompt = """Extract information from this receipt and return as JSON:
{
"vendor": "",
"date": "",
"items": [{"name": "", "price": 0}],
"total": 0
}"""
from PIL import Image
import io
def optimize_for_claude(image_path, max_dimension=1568):
"""Resize image to reduce token usage by 30-50%"""
with Image.open(image_path) as img:
# Calculate new dimensions
ratio = min(max_dimension / img.width, max_dimension / img.height)
if ratio < 1:
new_size = (int(img.width * ratio), int(img.height * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Convert to bytes
buffer = io.BytesIO()
img.save(buffer, format="JPEG", quality=85)
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
prompt = """Extract all text from this image.
Preserve the original formatting and structure as much as possible.
If text is unclear, indicate with [unclear]."""
prompt = """Extract the table data from this image.
Return as a markdown table with proper headers and alignment."""
prompt = """Analyze this chart:
1. What type of chart is this?
2. What are the axes/labels?
3. What are the key data points?
4. What trends or patterns are visible?"""
development
Enhanced plan-authoring skill with Pre-Writing context gathering, task metadata, non-TDD templates, Red Flags, telemetry, and an automated plan linter. Use when you have a spec or requirements for a multi-step task, before touching code.
tools
Documentation intelligence engine with graph-based API docs, algorithm library, and drift detection
tools
Ultraplan cloud planning — kick off a plan in the cloud from your terminal, review and revise in the browser, then execute remotely or send back to CLI
tools
--- name: mcp description: Configure MCP servers for Claude Code — stdio vs HTTP, authentication, Tools/Resources/Prompts distinction, channels (CI webhook, mobile relay, Discord bridge, fakechat), and cost of always-loaded tools. Use this skill whenever adding an MCP server, debugging connection issues, choosing between MCP Tools vs Prompts vs Resources, installing channel servers, or managing .mcp.json. Triggers on: "MCP server", "mcp config", "add Obsidian MCP", "install context7", "channels"