skills/clip-aware-embeddings/SKILL.md
Semantic image-text matching with CLIP and alternatives. Use for image search, zero-shot classification, similarity matching. NOT for counting objects, fine-grained classification (celebrities, car models), spatial reasoning, or compositional queries. Activate on "CLIP", "embeddings", "image similarity", "semantic search", "zero-shot classification", "image-text matching".
npx skillsauth add curiositech/windags-skills clip-aware-embeddingsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Smart image-text matching that knows when CLIP works and when to use alternatives.
| MCP | Purpose | |-----|---------| | Firecrawl | Research latest CLIP alternatives and benchmarks | | Hugging Face (if configured) | Access model cards and documentation |
Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗
✅ Use for:
❌ Do NOT use for:
pip install transformers pillow torch sentence-transformers --break-system-packages
Validation: Run python scripts/validate_setup.py
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# Embed images
images = [Image.open(f"img{i}.jpg") for i in range(10)]
inputs = processor(images=images, return_tensors="pt")
image_features = model.get_image_features(**inputs)
# Search with text
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)
# Compute similarity
similarity = (image_features @ text_features.T).softmax(dim=0)
❌ Wrong:
# Using CLIP to count cars in an image
prompt = "How many cars are in this image?"
# CLIP cannot count - it will give nonsense results
Why wrong: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.
✓ Right:
from transformers import DetrImageProcessor, DetrForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
# Detect objects
results = model(**processor(images=image, return_tensors="pt"))
# Filter for cars and count
car_detections = [d for d in results if d['label'] == 'car']
count = len(car_detections)
How to detect: If query contains "how many", "count", or numeric questions → Use object detection
❌ Wrong:
# Trying to identify specific celebrities with CLIP
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
# CLIP will perform poorly - not trained for fine-grained face ID
Why wrong: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.
✓ Right:
# Use a fine-tuned face recognition model
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained(
"microsoft/resnet-50" # Then fine-tune on celebrity dataset
)
# Or use dedicated face recognition: ArcFace, CosFace
How to detect: If query asks to distinguish between similar items in same category → Use specialized model
❌ Wrong:
# CLIP cannot understand spatial relationships
prompts = [
"cat to the left of dog",
"cat to the right of dog"
]
# Will give nearly identical scores
Why wrong: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.
✓ Right:
# Use a spatial reasoning model
# Examples: GQA models, Visual Genome models, SWIG
from swig_model import SpatialRelationModel
model = SpatialRelationModel()
result = model.predict_relation(image, "cat", "dog")
# Returns: "left", "right", "above", "below", etc.
How to detect: If query contains directional words (left, right, above, under, next to) → Use spatial model
❌ Wrong:
prompts = [
"red car and blue truck",
"blue car and red truck"
]
# CLIP often gives similar scores for both
Why wrong: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.
✓ Right - Use PC-CLIP or DCSMs:
# PC-CLIP: Fine-tuned for pairwise comparisons
from pc_clip import PCCLIPModel
model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
# Or use DCSMs (Dense Cosine Similarity Maps)
How to detect: If query has multiple objects with different attributes → Use compositional model
LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.
Before using CLIP, check if it's appropriate:
python scripts/validate_clip_usage.py \
--query "your query here" \
--check-all
Returns:
# Good use of CLIP
queries = ["beach", "mountain", "city skyline"]
# Works well for broad semantic concepts
# Good: Broad categories
categories = ["indoor", "outdoor", "nature", "urban"]
# CLIP excels at this
# Use object detection instead
from transformers import DetrImageProcessor, DetrForObjectDetection
# See /references/object_detection.md
# Use specialized models
# See /references/fine_grained_models.md
# Use spatial relation models
# See /references/spatial_models.md
Check:
Validation:
python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"
Possible causes:
Solution: Try broader query or use alternative model
| Model | Best For | Avoid For | |-------|----------|-----------| | CLIP ViT-L/14 | Semantic search, broad categories | Counting, fine-grained, spatial | | DETR | Object detection, counting | Semantic similarity | | DINOv2 | Fine-grained features | Text-image matching | | PC-CLIP | Attribute binding, comparisons | General embedding | | DCSMs | Compositional reasoning | Simple similarity |
CLIP models:
Inference time (single image, CPU):
/references/clip_limitations.md - Detailed analysis of CLIP's failures/references/alternatives.md - When to use what model/references/compositional_reasoning.md - DCSMs and PC-CLIP deep dive/scripts/validate_clip_usage.py - Pre-flight validation tool/scripts/diagnose_clip_issue.py - Debug unexpected resultsSee CHANGELOG.md for version history.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.