Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/clip-aware-embeddings

Name: clip-aware-embeddings
Author: curiositech

skills/clip-aware-embeddings/SKILL.md

npx skillsauth add curiositech/windags-skills clip-aware-embeddings

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

CLIP-Aware Image Embeddings

Smart image-text matching that knows when CLIP works and when to use alternatives.

MCP Integrations

| MCP | Purpose | |-----|---------| | Firecrawl | Research latest CLIP alternatives and benchmarks | | Hugging Face (if configured) | Access model cards and documentation |

Quick Decision Tree

Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗

When to Use This Skill

✅ Use for:

Semantic image search
Broad category classification
Image similarity matching
Zero-shot tasks on new categories

❌ Do NOT use for:

Counting objects in images
Fine-grained classification
Spatial understanding
Attribute binding
Negation handling

Installation

pip install transformers pillow torch sentence-transformers --break-system-packages

Validation: Run python scripts/validate_setup.py

Basic Usage

Image Search

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Embed images
images = [Image.open(f"img{i}.jpg") for i in range(10)]
inputs = processor(images=images, return_tensors="pt")
image_features = model.get_image_features(**inputs)

# Search with text
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)

# Compute similarity
similarity = (image_features @ text_features.T).softmax(dim=0)

Common Anti-Patterns

Anti-Pattern 1: "CLIP for Everything"

❌ Wrong:

# Using CLIP to count cars in an image
prompt = "How many cars are in this image?"
# CLIP cannot count - it will give nonsense results

Why wrong: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.

✓ Right:

from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

# Detect objects
results = model(**processor(images=image, return_tensors="pt"))
# Filter for cars and count
car_detections = [d for d in results if d['label'] == 'car']
count = len(car_detections)

How to detect: If query contains "how many", "count", or numeric questions → Use object detection

Anti-Pattern 2: Fine-Grained Classification

❌ Wrong:

# Trying to identify specific celebrities with CLIP
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
# CLIP will perform poorly - not trained for fine-grained face ID

Why wrong: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.

✓ Right:

# Use a fine-tuned face recognition model
from transformers import AutoFeatureExtractor, AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained(
    "microsoft/resnet-50"  # Then fine-tune on celebrity dataset
)
# Or use dedicated face recognition: ArcFace, CosFace

How to detect: If query asks to distinguish between similar items in same category → Use specialized model

Anti-Pattern 3: Spatial Understanding

❌ Wrong:

# CLIP cannot understand spatial relationships
prompts = [
    "cat to the left of dog",
    "cat to the right of dog"
]
# Will give nearly identical scores

Why wrong: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.

✓ Right:

# Use a spatial reasoning model
# Examples: GQA models, Visual Genome models, SWIG
from swig_model import SpatialRelationModel

model = SpatialRelationModel()
result = model.predict_relation(image, "cat", "dog")
# Returns: "left", "right", "above", "below", etc.

How to detect: If query contains directional words (left, right, above, under, next to) → Use spatial model

Anti-Pattern 4: Attribute Binding

❌ Wrong:

prompts = [
    "red car and blue truck",
    "blue car and red truck"
]
# CLIP often gives similar scores for both

Why wrong: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.

✓ Right - Use PC-CLIP or DCSMs:

# PC-CLIP: Fine-tuned for pairwise comparisons
from pc_clip import PCCLIPModel

model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
# Or use DCSMs (Dense Cosine Similarity Maps)

How to detect: If query has multiple objects with different attributes → Use compositional model

Evolution Timeline

2021: CLIP Released

Revolutionary: zero-shot, 400M image-text pairs
Widely adopted for everything
Limitations not yet understood

2022-2023: Limitations Discovered

Cannot count objects
Poor at fine-grained classification
Fails spatial reasoning
Can't bind attributes

2024: Alternatives Emerge

DCSMs: Preserve patch/token topology
PC-CLIP: Trained on pairwise comparisons
SpLiCE: Sparse interpretable embeddings

2025: Current Best Practices

Use CLIP for what it's good at
Task-specific models for limitations
Compositional models for complex queries

LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.

Validation Script

Before using CLIP, check if it's appropriate:

python scripts/validate_clip_usage.py \
    --query "your query here" \
    --check-all

Returns:

✅ CLIP is appropriate
❌ Use alternative (with suggestion)

Task-Specific Guidance

Image Search (CLIP ✓)

# Good use of CLIP
queries = ["beach", "mountain", "city skyline"]
# Works well for broad semantic concepts

Zero-Shot Classification (CLIP ✓)

# Good: Broad categories
categories = ["indoor", "outdoor", "nature", "urban"]
# CLIP excels at this

Object Counting (CLIP ✗)

# Use object detection instead
from transformers import DetrImageProcessor, DetrForObjectDetection
# See /references/object_detection.md

Fine-Grained Classification (CLIP ✗)

# Use specialized models
# See /references/fine_grained_models.md

Spatial Reasoning (CLIP ✗)

# Use spatial relation models
# See /references/spatial_models.md

Troubleshooting

Issue: CLIP gives unexpected results

Check:

Is this a counting task? → Use object detection
Fine-grained classification? → Use specialized model
Spatial query? → Use spatial model
Multiple objects with attributes? → Use compositional model

Validation:

python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"

Issue: Low similarity scores

Possible causes:

Query too specific (CLIP works better with broad concepts)
Fine-grained task (not CLIP's strength)
Need to adjust threshold

Solution: Try broader query or use alternative model

Model Selection Guide

| Model | Best For | Avoid For | |-------|----------|-----------| | CLIP ViT-L/14 | Semantic search, broad categories | Counting, fine-grained, spatial | | DETR | Object detection, counting | Semantic similarity | | DINOv2 | Fine-grained features | Text-image matching | | PC-CLIP | Attribute binding, comparisons | General embedding | | DCSMs | Compositional reasoning | Simple similarity |

Performance Notes

CLIP models:

ViT-B/32: Fast, lower quality
ViT-L/14: Balanced (recommended)
ViT-g-14: Highest quality, slower

Inference time (single image, CPU):

ViT-B/32: ~100ms
ViT-L/14: ~300ms
ViT-g-14: ~1000ms

curiositech/clip-aware-embeddings

skills/clip-aware-embeddings/SKILL.md

Semantic image-text matching with CLIP and alternatives. Use for image search, zero-shot classification, similarity matching. NOT for counting objects, fine-grained classification (celebrities, car models), spatial reasoning, or compositional queries. Activate on "CLIP", "embeddings", "image similarity", "semantic search", "zero-shot classification", "image-text matching".

tools

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills clip-aware-embeddings

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 8:39 AM61.5s1 file scanned

SKILL.md

license:: Apache-2.0
name:: clip-aware-embeddings
description:: Semantic image-text matching with CLIP and alternatives. Use for image search, zero-shot classification, similarity matching. NOT for counting objects, fine-grained classification (celebrities, car models), spatial reasoning, or compositional queries. Activate on "CLIP", "embeddings", "image similarity", "semantic search", "zero-shot classification", "image-text matching".
allowed-tools:: Read,Write,Edit,Bash
category:: AI & Machine Learning
- skill:: collage-layout-expert
reason:: Semantic image matching for layouts

CLIP-Aware Image Embeddings

Smart image-text matching that knows when CLIP works and when to use alternatives.

MCP Integrations

| MCP | Purpose | |-----|---------| | Firecrawl | Research latest CLIP alternatives and benchmarks | | Hugging Face (if configured) | Access model cards and documentation |

Quick Decision Tree

Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗

When to Use This Skill

✅ Use for:

Semantic image search
Broad category classification
Image similarity matching
Zero-shot tasks on new categories

❌ Do NOT use for:

Counting objects in images
Fine-grained classification
Spatial understanding
Attribute binding
Negation handling

Installation

pip install transformers pillow torch sentence-transformers --break-system-packages

Validation: Run python scripts/validate_setup.py

Basic Usage

Image Search

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Embed images
images = [Image.open(f"img{i}.jpg") for i in range(10)]
inputs = processor(images=images, return_tensors="pt")
image_features = model.get_image_features(**inputs)

# Search with text
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)

# Compute similarity
similarity = (image_features @ text_features.T).softmax(dim=0)

Common Anti-Patterns

Anti-Pattern 1: "CLIP for Everything"

❌ Wrong:

# Using CLIP to count cars in an image
prompt = "How many cars are in this image?"
# CLIP cannot count - it will give nonsense results

Why wrong: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.

✓ Right:

from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

# Detect objects
results = model(**processor(images=image, return_tensors="pt"))
# Filter for cars and count
car_detections = [d for d in results if d['label'] == 'car']
count = len(car_detections)

How to detect: If query contains "how many", "count", or numeric questions → Use object detection

Anti-Pattern 2: Fine-Grained Classification

❌ Wrong:

# Trying to identify specific celebrities with CLIP
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
# CLIP will perform poorly - not trained for fine-grained face ID

Why wrong: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.

✓ Right:

# Use a fine-tuned face recognition model
from transformers import AutoFeatureExtractor, AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained(
    "microsoft/resnet-50"  # Then fine-tune on celebrity dataset
)
# Or use dedicated face recognition: ArcFace, CosFace

How to detect: If query asks to distinguish between similar items in same category → Use specialized model

Anti-Pattern 3: Spatial Understanding

❌ Wrong:

# CLIP cannot understand spatial relationships
prompts = [
    "cat to the left of dog",
    "cat to the right of dog"
]
# Will give nearly identical scores

Why wrong: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.

✓ Right:

# Use a spatial reasoning model
# Examples: GQA models, Visual Genome models, SWIG
from swig_model import SpatialRelationModel

model = SpatialRelationModel()
result = model.predict_relation(image, "cat", "dog")
# Returns: "left", "right", "above", "below", etc.

How to detect: If query contains directional words (left, right, above, under, next to) → Use spatial model

Anti-Pattern 4: Attribute Binding

❌ Wrong:

prompts = [
    "red car and blue truck",
    "blue car and red truck"
]
# CLIP often gives similar scores for both

Why wrong: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.

✓ Right - Use PC-CLIP or DCSMs:

# PC-CLIP: Fine-tuned for pairwise comparisons
from pc_clip import PCCLIPModel

model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
# Or use DCSMs (Dense Cosine Similarity Maps)

How to detect: If query has multiple objects with different attributes → Use compositional model

Evolution Timeline

2021: CLIP Released

Revolutionary: zero-shot, 400M image-text pairs
Widely adopted for everything
Limitations not yet understood

2022-2023: Limitations Discovered

Cannot count objects
Poor at fine-grained classification
Fails spatial reasoning
Can't bind attributes

2024: Alternatives Emerge

DCSMs: Preserve patch/token topology
PC-CLIP: Trained on pairwise comparisons
SpLiCE: Sparse interpretable embeddings

2025: Current Best Practices

Use CLIP for what it's good at
Task-specific models for limitations
Compositional models for complex queries

LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.

Validation Script

Before using CLIP, check if it's appropriate:

python scripts/validate_clip_usage.py \
    --query "your query here" \
    --check-all

Returns:

✅ CLIP is appropriate
❌ Use alternative (with suggestion)

Task-Specific Guidance

Image Search (CLIP ✓)

# Good use of CLIP
queries = ["beach", "mountain", "city skyline"]
# Works well for broad semantic concepts

Zero-Shot Classification (CLIP ✓)

# Good: Broad categories
categories = ["indoor", "outdoor", "nature", "urban"]
# CLIP excels at this

Object Counting (CLIP ✗)

# Use object detection instead
from transformers import DetrImageProcessor, DetrForObjectDetection
# See /references/object_detection.md

Fine-Grained Classification (CLIP ✗)

# Use specialized models
# See /references/fine_grained_models.md

Spatial Reasoning (CLIP ✗)

# Use spatial relation models
# See /references/spatial_models.md

Troubleshooting

Issue: CLIP gives unexpected results

Check:

Is this a counting task? → Use object detection
Fine-grained classification? → Use specialized model
Spatial query? → Use spatial model
Multiple objects with attributes? → Use compositional model

Validation:

python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"

Issue: Low similarity scores

Possible causes:

Query too specific (CLIP works better with broad concepts)
Fine-grained task (not CLIP's strength)
Need to adjust threshold

Solution: Try broader query or use alternative model

Model Selection Guide

Performance Notes

CLIP models:

ViT-B/32: Fast, lower quality
ViT-L/14: Balanced (recommended)
ViT-g-14: Highest quality, slower

Inference time (single image, CPU):

ViT-B/32: ~100ms
ViT-L/14: ~300ms
ViT-g-14: ~1000ms

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/clip-aware-embeddings ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

curiositech/clip-aware-embeddings

$ install --global

Security Scan Results

SKILL.md

CLIP-Aware Image Embeddings

MCP Integrations

Quick Decision Tree

When to Use This Skill

Installation

Basic Usage

Image Search

Common Anti-Patterns

Anti-Pattern 1: "CLIP for Everything"

Anti-Pattern 2: Fine-Grained Classification

Anti-Pattern 3: Spatial Understanding

Anti-Pattern 4: Attribute Binding

Evolution Timeline

2021: CLIP Released

2022-2023: Limitations Discovered

2024: Alternatives Emerge

2025: Current Best Practices

Validation Script

Task-Specific Guidance

Image Search (CLIP ✓)

Zero-Shot Classification (CLIP ✓)

Object Counting (CLIP ✗)

Fine-Grained Classification (CLIP ✗)

Spatial Reasoning (CLIP ✗)

Troubleshooting

Issue: CLIP gives unexpected results

Issue: Low similarity scores

Model Selection Guide

Performance Notes

Further Reading

Related Skills

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

curiositech/clip-aware-embeddings

$ install --global

Security Scan Results

SKILL.md

CLIP-Aware Image Embeddings

MCP Integrations

Quick Decision Tree

When to Use This Skill

Installation

Basic Usage

Image Search

Common Anti-Patterns

Anti-Pattern 1: "CLIP for Everything"

Anti-Pattern 2: Fine-Grained Classification

Anti-Pattern 3: Spatial Understanding

Anti-Pattern 4: Attribute Binding

Evolution Timeline

2021: CLIP Released

2022-2023: Limitations Discovered

2024: Alternatives Emerge

2025: Current Best Practices

Validation Script

Task-Specific Guidance

Image Search (CLIP ✓)

Zero-Shot Classification (CLIP ✓)

Object Counting (CLIP ✗)

Fine-Grained Classification (CLIP ✗)

Spatial Reasoning (CLIP ✗)

Troubleshooting

Issue: CLIP gives unexpected results

Issue: Low similarity scores

Model Selection Guide

Performance Notes

Further Reading

Related Skills

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy