Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jmsktm/Embedding Generator

Name: Embedding Generator
Author: jmsktm

skills/embedding-generator/SKILL.md

npx skillsauth add jmsktm/claude-settings Embedding Generator

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Embedding Generator

The Embedding Generator skill helps you create, manage, and utilize text embeddings for semantic search, similarity matching, clustering, and classification tasks. It guides you through selecting appropriate embedding models, preprocessing text for optimal vectorization, and storing/querying embeddings efficiently.

Text embeddings transform words, sentences, or documents into dense numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, enabling powerful AI applications like semantic search, recommendations, and content understanding.

This skill covers everything from choosing the right model (OpenAI, Cohere, sentence-transformers, etc.) to implementing production-ready embedding pipelines with proper batching, caching, and quality validation.

Core Workflows

Workflow 1: Generate Embeddings for Text Corpus

Analyze the text corpus:
- Content type (documents, sentences, queries)
- Average length and variation
- Language(s) present
- Domain specificity
Select embedding model:
- Consider dimensionality vs performance tradeoff
- Match model to content type
- Evaluate cost and latency constraints
Preprocess text:
- Clean and normalize
- Chunk long documents appropriately
- Handle special characters and formatting
Generate embeddings with batching
Validate quality with spot checks
Store in appropriate vector database

Workflow 2: Choose Embedding Model

Gather requirements:
- Use case (search, clustering, classification)
- Latency requirements
- Cost constraints
- Accuracy needs
Compare models: | Model | Dims | Speed | Quality | Cost | |-------|------|-------|---------|------| | OpenAI text-embedding-3-small | 1536 | Fast | Good | $$ | | OpenAI text-embedding-3-large | 3072 | Fast | Best | $$$ | | Cohere embed-english-v3 | 1024 | Fast | Great | $$ | | sentence-transformers | 384-768 | Varies | Good | Free | | Voyage AI | 1024 | Fast | Great | $$ |
Benchmark on representative samples
Document decision rationale

Workflow 3: Implement Embedding Pipeline

Design pipeline architecture:
- Input preprocessing
- Batching strategy
- Error handling
- Caching layer

Implement core components:

# Example pipeline structure
def embedding_pipeline(texts):
    cleaned = preprocess(texts)
    chunks = chunk_if_needed(cleaned)
    batches = create_batches(chunks, batch_size=100)
    embeddings = []
    for batch in batches:
        result = model.embed(batch)
        embeddings.extend(result)
    return embeddings

Add monitoring and logging
Test with edge cases
Optimize for production scale

Quick Reference

| Action | Command/Trigger | |--------|-----------------| | Generate embeddings | "Generate embeddings for these texts" | | Choose model | "Which embedding model for [use case]" | | Compare models | "Compare embedding models" | | Optimize pipeline | "Speed up embedding generation" | | Validate quality | "Check embedding quality" | | Chunk documents | "How to chunk for embeddings" |

Best Practices

Match Model to Use Case: Query-document search needs asymmetric models; clustering needs symmetric
- Search: Use models trained on query-passage pairs
- Clustering: Use models with good sentence-level representations
Chunk Intelligently: Long texts must be chunked, but chunking strategy matters
- Preserve semantic units (paragraphs, sections)
- Use overlapping chunks for continuity (10-20% overlap)
- Keep chunk size within model's sweet spot (typically 256-512 tokens)
Batch for Efficiency: API calls are expensive; batch aggressively
- OpenAI: Up to 2048 texts per batch
- Use async/concurrent processing for speed
- Implement exponential backoff for rate limits
Cache Embeddings: Don't regenerate what you've already computed
- Hash text to create cache keys
- Store embeddings with metadata
- Invalidate cache when model changes
Normalize Vectors: Cosine similarity requires normalized vectors
- Most models output normalized vectors
- Verify or normalize explicitly for consistency
Validate Quality: Spot-check embeddings before production use
- Test similarity between known-similar texts
- Check that distances make semantic sense
- Compare against baseline or ground truth

Advanced Techniques

Hybrid Chunking Strategy

Combine semantic and size-based chunking:

def hybrid_chunk(text, max_tokens=512):
    # First: Split on semantic boundaries
    sections = split_on_headers_paragraphs(text)

    # Then: Split large sections on size
    chunks = []
    for section in sections:
        if token_count(section) > max_tokens:
            chunks.extend(split_with_overlap(section, max_tokens))
        else:
            chunks.append(section)
    return chunks

Query Expansion for Better Retrieval

Generate multiple query embeddings for robust search:

Original: "machine learning frameworks"
Expanded: [
  "machine learning frameworks",
  "ML libraries and tools",
  "deep learning software",
  "AI development platforms"
]

Dimensionality Reduction

When storage or speed is critical:

- PCA: Fast, linear reduction
- UMAP: Preserves local structure
- Matryoshka embeddings: Models with variable-size outputs

Cross-Lingual Embeddings

For multilingual applications:

- Use multilingual models (mBERT, XLM-R, Cohere multilingual)
- Translate queries to embedding language
- Align embedding spaces post-hoc

Common Pitfalls to Avoid

Using the wrong model type (asymmetric vs symmetric) for your use case
Chunking in ways that break semantic meaning (mid-sentence, mid-paragraph)
Not accounting for rate limits in production systems
Storing embeddings without metadata needed for filtering
Regenerating embeddings unnecessarily (implement caching)
Mixing embeddings from different models in the same index
Ignoring the impact of text preprocessing on embedding quality

jmsktm/Embedding Generator

skills/embedding-generator/SKILL.md

Generate and manage text embeddings for semantic search, clustering, and similarity tasks

2 stars

data-ai

Updated Apr 6, 2026

$ install --global

skillsauth

npx skillsauth add jmsktm/claude-settings Embedding Generator

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 6, 2026, 10:32 AM13.9s1 file scanned

SKILL.md

name:: Embedding Generator
slug:: embedding-generator
description:: Generate and manage text embeddings for semantic search, clustering, and similarity tasks
category:: ai-ml
complexity:: intermediate
version:: 1.0.0
author:: ID8Labs

Embedding Generator

Core Workflows

Workflow 1: Generate Embeddings for Text Corpus

Analyze the text corpus:
- Content type (documents, sentences, queries)
- Average length and variation
- Language(s) present
- Domain specificity
Select embedding model:
- Consider dimensionality vs performance tradeoff
- Match model to content type
- Evaluate cost and latency constraints
Preprocess text:
- Clean and normalize
- Chunk long documents appropriately
- Handle special characters and formatting
Generate embeddings with batching
Validate quality with spot checks
Store in appropriate vector database

Workflow 2: Choose Embedding Model

Gather requirements:
- Use case (search, clustering, classification)
- Latency requirements
- Cost constraints
- Accuracy needs
Compare models: | Model | Dims | Speed | Quality | Cost | |-------|------|-------|---------|------| | OpenAI text-embedding-3-small | 1536 | Fast | Good | $$ | | OpenAI text-embedding-3-large | 3072 | Fast | Best | $$$ | | Cohere embed-english-v3 | 1024 | Fast | Great | $$ | | sentence-transformers | 384-768 | Varies | Good | Free | | Voyage AI | 1024 | Fast | Great | $$ |
Benchmark on representative samples
Document decision rationale

Workflow 3: Implement Embedding Pipeline

Design pipeline architecture:
- Input preprocessing
- Batching strategy
- Error handling
- Caching layer

Implement core components:

# Example pipeline structure
def embedding_pipeline(texts):
    cleaned = preprocess(texts)
    chunks = chunk_if_needed(cleaned)
    batches = create_batches(chunks, batch_size=100)
    embeddings = []
    for batch in batches:
        result = model.embed(batch)
        embeddings.extend(result)
    return embeddings

Add monitoring and logging
Test with edge cases
Optimize for production scale

Quick Reference

Best Practices

Match Model to Use Case: Query-document search needs asymmetric models; clustering needs symmetric
- Search: Use models trained on query-passage pairs
- Clustering: Use models with good sentence-level representations
Chunk Intelligently: Long texts must be chunked, but chunking strategy matters
- Preserve semantic units (paragraphs, sections)
- Use overlapping chunks for continuity (10-20% overlap)
- Keep chunk size within model's sweet spot (typically 256-512 tokens)
Batch for Efficiency: API calls are expensive; batch aggressively
- OpenAI: Up to 2048 texts per batch
- Use async/concurrent processing for speed
- Implement exponential backoff for rate limits
Cache Embeddings: Don't regenerate what you've already computed
- Hash text to create cache keys
- Store embeddings with metadata
- Invalidate cache when model changes
Normalize Vectors: Cosine similarity requires normalized vectors
- Most models output normalized vectors
- Verify or normalize explicitly for consistency
Validate Quality: Spot-check embeddings before production use
- Test similarity between known-similar texts
- Check that distances make semantic sense
- Compare against baseline or ground truth

Advanced Techniques

Hybrid Chunking Strategy

Combine semantic and size-based chunking:

def hybrid_chunk(text, max_tokens=512):
    # First: Split on semantic boundaries
    sections = split_on_headers_paragraphs(text)

    # Then: Split large sections on size
    chunks = []
    for section in sections:
        if token_count(section) > max_tokens:
            chunks.extend(split_with_overlap(section, max_tokens))
        else:
            chunks.append(section)
    return chunks

Query Expansion for Better Retrieval

Generate multiple query embeddings for robust search:

Original: "machine learning frameworks"
Expanded: [
  "machine learning frameworks",
  "ML libraries and tools",
  "deep learning software",
  "AI development platforms"
]

Dimensionality Reduction

When storage or speed is critical:

- PCA: Fast, linear reduction
- UMAP: Preserves local structure
- Matryoshka embeddings: Models with variable-size outputs

Cross-Lingual Embeddings

For multilingual applications:

- Use multilingual models (mBERT, XLM-R, Cohere multilingual)
- Translate queries to embedding language
- Align embedding spaces post-hoc

Common Pitfalls to Avoid

Using the wrong model type (asymmetric vs symmetric) for your use case
Chunking in ways that break semantic meaning (mid-sentence, mid-paragraph)
Not accounting for rate limits in production systems
Storing embeddings without metadata needed for filtering
Regenerating embeddings unnecessarily (implement caching)
Mixing embeddings from different models in the same index
Ignoring the impact of text preprocessing on embedding quality

Related Skills

jmsktm/YouTube Optimizer

data-ai

VerifiedTrustedCommunity

Optimize YouTube videos for SEO, thumbnails, descriptions, and audience retention

2SKILL.mdUpdated Apr 6, 2026

jmsktm/YouTube Optimizer

jmsktm/Workshop Facilitator

testing

VerifiedTrustedCommunity

Design and facilitate effective workshops with agendas, activities, and outcomes

2SKILL.mdUpdated Apr 6, 2026

jmsktm/Workshop Facilitator

jmsktm/Workflow Designer

data-ai

VerifiedTrustedCommunity

Design and optimize AI-powered workflows for complex tasks

2SKILL.mdUpdated Apr 6, 2026

jmsktm/Workflow Designer

jmsktm/Workflow Automator

data-ai

VerifiedTrustedCommunity

Design and implement automated workflows to eliminate repetitive tasks and streamline processes

2SKILL.mdUpdated Apr 6, 2026

jmsktm/Workflow Automator

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jmsktm/claude-settings.git

# Copy into Claude Code skills folder (global)
cp -r claude-settings/skills/embedding-generator ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jmsktm/claude-settings

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT