vision-bench/SKILL.md
Score and compare images using vision LLMs as judges. YAML-defined criteria presets for 11 use cases (text-to-image, photorealism, document OCR, charts, UI, portrait, product, scientific, invoice, alt-text, artistic style). Supports OpenAI, Anthropic, Gemini, Mistral, and OpenRouter as judge providers. Keys auto-decrypted via SOPS + age.
npx skillsauth add glebis/claude-skills vision-benchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Compare images by scoring them with one or more vision LLM judges against structured rubric criteria.
# Install dependencies
pip install pyyaml openai anthropic mistralai
# Score a single image
python bench.py image.png --criteria photorealism --judge gemini-2.5-flash
# Compare two AI-generated images
python bench.py img_a.png img_b.png \
--criteria text_to_image \
--prompt "a fox in a snowy forest" \
--judge gpt-4o
# Multi-judge consensus
python bench.py img.png \
--criteria portrait \
--judges gpt-4o gemini-2.5-flash claude-opus-4-5-20251022
# OpenRouter models (any vision-capable model)
python bench.py img_a.png img_b.png \
--criteria artistic_style \
--judges "openrouter/meta-llama/llama-4-maverick" "openrouter/mistralai/pixtral-large-2411"
# List all presets
python bench.py --list-presets
# Save report to file
python bench.py img.png --criteria chart_analysis --save report.md
| Preset | Use Case |
|--------|----------|
| text_to_image | Compare AI image generators (Midjourney, DALL-E, Flux) |
| photorealism | How convincingly an image looks like a photo |
| artistic_style | Style consistency, composition, color harmony |
| portrait | AI-generated portrait quality and realism |
| product_photo | E-commerce product image quality |
| document_ocr | Document text extraction and layout understanding |
| chart_analysis | Chart and data visualization comprehension |
| invoice | Financial document field extraction accuracy |
| ui_screenshot | App/web screenshot understanding |
| scientific | Scientific/medical image accuracy |
| alt_text | Accessibility image description quality |
Custom criteria: pass any .yaml file as --criteria path/to/my.yaml.
| Prefix | Provider | Example |
|--------|----------|---------|
| gpt-, o1, o3, o4 | OpenAI | gpt-4o |
| claude- | Anthropic | claude-sonnet-4-5-20251022 |
| gemini- | Google Gemini | gemini-2.5-flash |
| pixtral-, mistral-, ministral- | Mistral | pixtral-12b-2409 |
| openrouter/ | OpenRouter (any model) | openrouter/meta-llama/llama-4-maverick |
Keys are loaded from secrets.enc.yaml (SOPS + age encrypted) with fallback to environment variables.
Supported keys: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, OPENROUTER_API_KEY
To encrypt your own keys:
sops --config .sops.yaml --encrypt --input-type yaml --output-type yaml secrets.yaml > secrets.enc.yaml
--output markdown (default) · --output json · --output table
bench.py — CLI entry pointjudge.py — Multi-provider LLM judge logicreport.py — Report generationvault.py — SOPS secrets decryptioncriteria/ — 11 YAML preset files.sops.yaml — Age key config for encryptionsecrets.enc.yaml — Encrypted API keysdevelopment
This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.
development
Create Tufte-inspired data reports and infographic dashboards as standalone HTML files. Uses EB Garamond for text, Monaspace Argon for numbers, Chart.js for interactive charts, and inline SVG sparklines. Produces publication-quality reports with 2-column narrative+data layouts, status dashboards, scroll animations, and responsive mobile support. Use this skill whenever the user wants to create a data report, activity dashboard, infographic, personal analytics page, health tracker visualization, or any document that combines narrative text with interactive charts and tables. Also triggers for "make a report like Tufte", "create an infographic", "build a dashboard", "visualize my data", or requests for beautiful data-driven documents.
documentation
Cut a software release and maintain a tiered compatibility policy. Use when the user wants to release, ship a version, bump the version, tag a release, write a changelog, or update COMPATIBILITY. Config-driven via release.config.json; bumps version files, runs a readiness gate, updates COMPATIBILITY.md tiers and deprecations, tags (→ release workflow), and reports closed issues. Teaches the underlying standards as it runs.
development
Sync and manage bilingual (EN/RU) library content for agency-docs. Use when adding, updating, or reviewing library articles. Handles translation, sync checks, and Russian stylistic review.