dist/plugins/ai-provider-claude-vision/skills/ai-provider-claude-vision/SKILL.md
Image understanding and document analysis with Claude's multimodal capabilities -- image input formats, PDF processing, multi-image patterns, structured extraction, and token cost estimation
npx skillsauth add agents-inc/skills ai-provider-claude-visionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Quick Guide: Use
type: "image"content blocks for images (base64, URL, or file_id) andtype: "document"content blocks for PDFs. Supported image formats: JPEG, PNG, GIF, WebP. Images before text in the content array improves results. Token cost formula:tokens = (width * height) / 750. Images are auto-resized if the long edge exceeds 1568px or exceeds ~1600 tokens. PDFs usetype: "document"withmedia_type: "application/pdf". No OCR library needed -- Claude reads text directly from images and PDFs.
<critical_requirements>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use type: "image" for images and type: "document" for PDFs -- they are different content block types)
(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)
(You MUST always provide max_tokens in every request -- it is required and has no default)
(You MUST iterate over response.content blocks -- never assume a single text block in the response)
(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)
</critical_requirements>
Auto-detection: Claude vision, image analysis, image input, base64 image, URL image, type image, type document, media_type image/jpeg, media_type image/png, image/webp, image/gif, application/pdf, PDF processing, document extraction, multimodal, multi-image, image comparison, chart analysis, screenshot analysis, image understanding, visual content, vision API
When to use:
Key patterns covered:
When NOT to use:
Claude's vision capabilities treat images and documents as first-class content blocks alongside text. There is no separate "vision API" -- you add image or document blocks to the same Messages API you already use for text.
Core principles:
messages array, interleaved with text. They are not uploaded separately or referenced by URL-only.documents first, query last improves text prompts. Claude processes visual content better when it sees the image before the question.tokens = (width * height) / 750. Downsizing images before sending saves tokens without losing meaningful detail for most use cases.When to use vision:
When NOT to use:
Read a local file, encode to base64, send as type: "image" content block. Image block before text block.
// Image block first, text prompt second, iterate response content blocks
content: [
{
type: "image",
source: { type: "base64", media_type: "image/png", data: imageData },
},
{ type: "text", text: "Describe what you see in this image." },
];
Why good: Image before text improves results, explicit media_type, structured content blocks
// BAD: base64 as text string -- Claude cannot interpret raw base64
content: "What's in this image? " + imageData;
Why bad: Passing base64 as text string instead of image content block, Claude cannot interpret raw base64 text as an image
See: examples/core.md for full runnable examples with base64, URL, and Files API
Three source types for images. Choose based on where your image lives.
// URL source -- simplest, smallest payload
source: { type: "url", url: "https://example.com/chart.png" }
// Base64 source -- local files
source: { type: "base64", media_type: "image/jpeg", data: base64String }
// Files API source (beta) -- upload once, reuse across requests
source: { type: "file", file_id: "file_abc123" }
When to use: URL for hosted images, base64 for local files, Files API for multi-turn or repeated use
See: examples/core.md for full examples of each source type
PDFs use type: "document" -- different from type: "image". This is the most common mistake.
// Correct: type "document" for PDFs
{ type: "document", source: { type: "base64", media_type: "application/pdf", data: pdfData } }
// WRONG: type "image" for PDFs -- causes API errors
{ type: "image", source: { type: "base64", media_type: "application/pdf", data: pdfData } }
Why good: type: "document" enables dual processing (text extraction + page rendering)
Why bad: Using type: "image" for PDFs causes API errors. PDFs require type: "document".
See: examples/core.md for base64 and URL PDF examples, examples/extraction.md for PDF caching
Label images with text blocks so Claude can reference them clearly.
content: [
{ type: "text", text: "Image 1:" },
{
type: "image",
source: { type: "base64", media_type: "image/jpeg", data: image1 },
},
{ type: "text", text: "Image 2:" },
{
type: "image",
source: { type: "base64", media_type: "image/jpeg", data: image2 },
},
{
type: "text",
text: "Compare these two images and describe the differences.",
},
];
Why good: Labels let Claude reference specific images unambiguously
Why bad (without labels): Claude may confuse which image is which when no labels are provided
See: examples/core.md for full multi-image example
Token formula: tokens = (width * height) / 750. Auto-resize triggers at 1568px long edge or ~1.15 megapixels.
const TOKENS_PER_PIXEL_DIVISOR = 750;
const MAX_LONG_EDGE_PX = 1568;
const MAX_MEGAPIXELS = 1.15;
function estimateImageTokens(width: number, height: number): number {
let w = width,
h = height;
const longEdge = Math.max(w, h);
const mp = (w * h) / 1_000_000;
if (longEdge > MAX_LONG_EDGE_PX || mp > MAX_MEGAPIXELS) {
const scale = Math.min(
MAX_LONG_EDGE_PX / longEdge,
Math.sqrt(MAX_MEGAPIXELS / mp),
);
w = Math.round(width * scale);
h = Math.round(height * scale);
}
return Math.ceil((w * h) / TOKENS_PER_PIXEL_DIVISOR);
}
// 200x200: ~54 tokens | 1000x1000: ~1334 | 4000x3000: ~1590 (auto-resized)
Why good: Named constants, accounts for auto-resize, documents the formula
See: examples/core.md for full estimateImageTokens() utility and countTokens() usage, reference.md for the complete size/token/cost table
Combine vision with messages.parse() and Zod schemas for typed extraction.
import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod";
import { z } from "zod";
const ReceiptData = z.object({
merchant: z.string(),
date: z.string(),
items: z.array(
z.object({ name: z.string(), quantity: z.number(), price: z.number() }),
),
total: z.number(),
currency: z.string(),
});
const response = await client.messages.parse({
model: "claude-sonnet-4-6",
max_tokens: MAX_TOKENS,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/jpeg",
data: receiptImage,
},
},
{
type: "text",
text: "Extract all receipt information from this image.",
},
],
},
],
output_config: { format: zodOutputFormat(ReceiptData) },
});
const receipt = response.parsed_output; // fully typed
Why good: Zod schema for type-safe extraction, messages.parse() for auto-validation, image before text
See: examples/extraction.md for receipt, chart, form, comparison, and multi-document extraction patterns
</patterns>Image resolution vs token cost:
200x200 -> ~54 tokens ($0.00016/image at Sonnet 4.6 pricing)
1000x1000 -> ~1334 tokens ($0.004/image)
1092x1092 -> ~1590 tokens ($0.0048/image) -- max 1:1 without auto-resize
4000x3000 -> ~1590 tokens (auto-resized to fit 1568px long edge)
file_idcache_control: { type: "ephemeral" } when asking multiple questions about the same documentclient.messages.countTokens()) before expensive requests to estimate costs<decision_framework>
Where is your image?
+-- Local file -> Base64 encode with readFileSync().toString("base64")
+-- Public URL -> Use type: "url" source (simplest, smallest payload)
+-- Already uploaded -> Use type: "file" source with file_id (Files API, beta)
+-- Multiple requests -> Upload once via Files API, reuse file_id
What type of file?
+-- JPEG, PNG, GIF, WebP -> type: "image"
+-- PDF -> type: "document" with media_type: "application/pdf"
+-- Other formats -> Convert to a supported format first
What kind of analysis?
+-- Brief description -> 256-512 max_tokens
+-- Detailed analysis -> 1024-2048 max_tokens
+-- Document summarization -> 2048-4096 max_tokens
+-- Structured extraction -> 1024 max_tokens (JSON output is compact)
</decision_framework>
<red_flags>
High Priority Issues:
type: "image" for PDFs -- PDFs require type: "document" with media_type: "application/pdf"max_tokens -- required on every request, no defaultMedium Priority Issues:
cache_control when asking multiple questions about the same PDF -- each request re-processes the full documentCommon Mistakes:
Gotchas & Edge Cases:
betas: ["files-api-2025-04-14"])</red_flags>
<critical_reminders>
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use type: "image" for images and type: "document" for PDFs -- they are different content block types)
(You MUST place images and documents BEFORE text in the content array -- Claude performs better with visual content first)
(You MUST always provide max_tokens in every request -- it is required and has no default)
(You MUST iterate over response.content blocks -- never assume a single text block in the response)
(You MUST use named constants for max_tokens, token budgets, and pixel limits -- no magic numbers)
Failure to follow these rules will produce API errors, degraded vision quality, unexpected token costs, or runtime crashes from untyped content blocks.
</critical_reminders>
development
Material Design component library for Vue 3
development
VitePress 1.x — Vue-powered static site generator for documentation sites, built on Vite
tools
Docusaurus 3.x documentation framework — site configuration, docs/blog plugins, sidebars, versioning, MDX, swizzling, and deployment
development
TanStack Form patterns - useForm, form.Field, validators, arrays, linked fields, createFormHook, type safety