skills/embedding-generator/SKILL.md
Generate and manage text embeddings for semantic search, clustering, and similarity tasks
npx skillsauth add jmsktm/claude-settings Embedding GeneratorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Embedding Generator skill helps you create, manage, and utilize text embeddings for semantic search, similarity matching, clustering, and classification tasks. It guides you through selecting appropriate embedding models, preprocessing text for optimal vectorization, and storing/querying embeddings efficiently.
Text embeddings transform words, sentences, or documents into dense numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, enabling powerful AI applications like semantic search, recommendations, and content understanding.
This skill covers everything from choosing the right model (OpenAI, Cohere, sentence-transformers, etc.) to implementing production-ready embedding pipelines with proper batching, caching, and quality validation.
# Example pipeline structure
def embedding_pipeline(texts):
cleaned = preprocess(texts)
chunks = chunk_if_needed(cleaned)
batches = create_batches(chunks, batch_size=100)
embeddings = []
for batch in batches:
result = model.embed(batch)
embeddings.extend(result)
return embeddings
| Action | Command/Trigger | |--------|-----------------| | Generate embeddings | "Generate embeddings for these texts" | | Choose model | "Which embedding model for [use case]" | | Compare models | "Compare embedding models" | | Optimize pipeline | "Speed up embedding generation" | | Validate quality | "Check embedding quality" | | Chunk documents | "How to chunk for embeddings" |
Match Model to Use Case: Query-document search needs asymmetric models; clustering needs symmetric
Chunk Intelligently: Long texts must be chunked, but chunking strategy matters
Batch for Efficiency: API calls are expensive; batch aggressively
Cache Embeddings: Don't regenerate what you've already computed
Normalize Vectors: Cosine similarity requires normalized vectors
Validate Quality: Spot-check embeddings before production use
Combine semantic and size-based chunking:
def hybrid_chunk(text, max_tokens=512):
# First: Split on semantic boundaries
sections = split_on_headers_paragraphs(text)
# Then: Split large sections on size
chunks = []
for section in sections:
if token_count(section) > max_tokens:
chunks.extend(split_with_overlap(section, max_tokens))
else:
chunks.append(section)
return chunks
Generate multiple query embeddings for robust search:
Original: "machine learning frameworks"
Expanded: [
"machine learning frameworks",
"ML libraries and tools",
"deep learning software",
"AI development platforms"
]
When storage or speed is critical:
- PCA: Fast, linear reduction
- UMAP: Preserves local structure
- Matryoshka embeddings: Models with variable-size outputs
For multilingual applications:
- Use multilingual models (mBERT, XLM-R, Cohere multilingual)
- Translate queries to embedding language
- Align embedding spaces post-hoc
data-ai
Optimize YouTube videos for SEO, thumbnails, descriptions, and audience retention
testing
Design and facilitate effective workshops with agendas, activities, and outcomes
data-ai
Design and optimize AI-powered workflows for complex tasks
data-ai
Design and implement automated workflows to eliminate repetitive tasks and streamline processes