skills/multimodal-embedding-generator/SKILL.md
Generate cross-modal embeddings with CLIP, SigLIP, and ImageBind for text-image-audio search. Activate on: multimodal search, text-to-image search, cross-modal embeddings, CLIP embeddings, visual search. NOT for: text-only embeddings (ai-engineer), image classification (computer-vision-pipeline).
npx skillsauth add curiositech/windags-skills multimodal-embedding-generatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate unified embeddings across text, images, and audio using CLIP, SigLIP, and ImageBind for cross-modal retrieval and search.
Activate on: "multimodal search", "text-to-image search", "image-to-text retrieval", "cross-modal embeddings", "CLIP embeddings", "visual search engine", "SigLIP", "ImageBind", "find similar images by description"
NOT for: Text-only embedding and RAG (ai-engineer), image classification or object detection (computer-vision-pipeline), or image generation from text (image-generation-workflow-engine)
| Domain | Technologies | Notes | |--------|-------------|-------| | Text-Image | SigLIP, CLIP (ViT-L/14, ViT-bigG), OpenCLIP | SigLIP preferred for 2026: better zero-shot accuracy | | 6-Modality | ImageBind (Meta) | Text, image, audio, depth, thermal, IMU | | Local Inference | transformers, open_clip, torch | GPU or MPS (Apple Silicon) | | API-Based | Voyage AI multimodal, Cohere embed-v4 | Managed, no GPU needed | | Indexing | Pinecone, Qdrant, Weaviate, pgvector | Same vector DB for all modalities |
Text ──→ [SigLIP Text Encoder] ──┐
├──→ [Normalize] ──→ [Vector DB]
Image ──→ [SigLIP Vision Encoder]─┘ │ │
L2 normalize single index,
to unit sphere modality in metadata
Query (any modality) ──→ [Encode] ──→ [Vector DB Search] ──→ Results (any modality)
# SigLIP cross-modal embedding
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("google/siglip-large-patch16-384")
processor = AutoProcessor.from_pretrained("google/siglip-large-patch16-384")
def embed_image(image):
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
emb = model.get_image_features(**inputs)
return torch.nn.functional.normalize(emb, dim=-1).squeeze().numpy()
def embed_text(text: str):
inputs = processor(text=text, return_tensors="pt", padding=True)
with torch.no_grad():
emb = model.get_text_features(**inputs)
return torch.nn.functional.normalize(emb, dim=-1).squeeze().numpy()
# Same vector space: cosine similarity works across modalities
Modalities:
Text ───────┐
Image ──────┤
Audio ──────┤
Depth ──────┼──→ [ImageBind Encoder] ──→ [Shared 1024-dim Space] ──→ [Vector DB]
Thermal ────┤
IMU ────────┘
Use case: "Find the video clip that sounds like this audio sample"
Audio query → ImageBind → nearest neighbors → returns video/image/text matches
Document with images
├── Text chunks ──→ [Text Embedder] ──────────→ [Vector DB: text namespace]
└── Figures/diagrams ──→ [SigLIP Vision] ──→ [Vector DB: image namespace]
Query ──→ [Text Embed] ──→ search text namespace ──┐
└──→ [Vision Embed] ──→ search image namespace──┼──→ [Rerank + Fuse] ──→ Answer
│
reciprocal rank fusion
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.