skills/nlp-pipeline-builder/SKILL.md
Build natural language processing pipelines for text analysis and understanding
npx skillsauth add jmsktm/claude-settings NLP Pipeline BuilderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The NLP Pipeline Builder skill guides you through designing and implementing natural language processing pipelines that transform raw text into structured, actionable insights. From preprocessing to advanced analysis, this skill covers the full spectrum of NLP tasks and helps you choose the right approach for your specific needs.
Modern NLP offers multiple paradigms: rule-based approaches, classical ML, and deep learning/LLMs. This skill helps you navigate these options, building pipelines that balance accuracy, latency, cost, and maintainability. Whether you need real-time processing at scale or deep analysis of specific documents, this skill ensures your pipeline is fit for purpose.
From tokenization to semantic analysis, from single documents to streaming text, this skill helps you build robust NLP systems that handle real-world text with all its messiness and complexity.
Standard NLP Pipeline:
Text → Preprocessing → Tokenization → Feature Extraction → Task Model → Output
Example stages:
- Preprocessing: cleaning, normalization
- Linguistic: tokenization, POS, NER, parsing
- Semantic: embeddings, topic modeling
- Task-specific: classification, extraction, generation
def clean_text(text):
# Normalize unicode
text = unicodedata.normalize("NFKC", text)
# Remove or replace problematic characters
text = remove_control_characters(text)
# Normalize whitespace
text = " ".join(text.split())
# Optionally: lowercase, remove punctuation, etc.
# (depends on downstream tasks)
return text
class NLPPipeline:
def __init__(self, config):
self.preprocessor = TextPreprocessor(config)
self.tokenizer = load_tokenizer(config.tokenizer)
self.models = {
"ner": load_model(config.ner_model),
"sentiment": load_model(config.sentiment_model),
"classification": load_model(config.classifier)
}
self.cache = ResultCache() if config.use_cache else None
def process(self, text, tasks=None):
tasks = tasks or ["all"]
# Preprocessing
cleaned = self.preprocessor.clean(text)
tokens = self.tokenizer.tokenize(cleaned)
# Run requested analyses
results = {"text": text, "tokens": tokens}
for task, model in self.models.items():
if task in tasks or "all" in tasks:
results[task] = model.predict(tokens)
return results
| Action | Command/Trigger | |--------|-----------------| | Design pipeline | "Design NLP pipeline for [task]" | | Preprocess text | "How to preprocess [text type]" | | Choose tokenizer | "Best tokenizer for [use case]" | | Extract entities | "Extract entities from text" | | Classify text | "Build text classifier" | | Scale pipeline | "Scale NLP to [volume]" |
Understand Your Text: Different text requires different treatment
Preserve What Matters: Preprocessing shouldn't destroy information
Handle Encoding Correctly: Unicode is tricky
Batch for Efficiency: Model inference is expensive
Fail Gracefully: Text is messy and unpredictable
Version Your Pipeline: Reproducibility matters
Chain extractors for complex information:
class ExtractionPipeline:
def __init__(self):
self.ner = NERModel()
self.relation = RelationExtractor()
self.coreference = CoreferenceResolver()
def extract(self, text):
# Stage 1: Named Entity Recognition
entities = self.ner.extract(text)
# Stage 2: Coreference Resolution
resolved = self.coreference.resolve(text, entities)
# Stage 3: Relation Extraction
relations = self.relation.extract(text, resolved)
# Stage 4: Build knowledge graph
graph = build_graph(resolved, relations)
return {
"entities": resolved,
"relations": relations,
"graph": graph
}
Use LLMs where they add value, classical where they don't:
class HybridPipeline:
def process(self, text):
# Fast classical preprocessing
cleaned = classical_clean(text)
sentences = classical_sentence_split(cleaned)
# Classical NER (fast, predictable)
entities = classical_ner(sentences)
# LLM for complex tasks (slower, more capable)
sentiment = llm_sentiment(text) # Nuanced sentiment
summary = llm_summarize(text) # Abstractive summary
return {
"sentences": sentences,
"entities": entities, # Classical
"sentiment": sentiment, # LLM
"summary": summary # LLM
}
Handle continuous text streams:
class StreamingNLP:
def __init__(self, batch_size=32, timeout_ms=100):
self.batch_size = batch_size
self.timeout_ms = timeout_ms
self.buffer = []
self.last_process_time = time.time()
async def add(self, text):
self.buffer.append(text)
# Process if batch full or timeout
if len(self.buffer) >= self.batch_size:
return await self.flush()
elif (time.time() - self.last_process_time) * 1000 > self.timeout_ms:
return await self.flush()
async def flush(self):
if not self.buffer:
return []
batch = self.buffer
self.buffer = []
self.last_process_time = time.time()
# Batch process
results = await self.pipeline.process_batch(batch)
return results
Handle multilingual text:
class MultilingualPipeline:
def __init__(self):
self.detector = LanguageDetector()
self.pipelines = {
"en": EnglishPipeline(),
"es": SpanishPipeline(),
"zh": ChinesePipeline(),
"default": UniversalPipeline()
}
def process(self, text):
lang = self.detector.detect(text)
pipeline = self.pipelines.get(lang, self.pipelines["default"])
return {
"language": lang,
"results": pipeline.process(text)
}
data-ai
Optimize YouTube videos for SEO, thumbnails, descriptions, and audience retention
testing
Design and facilitate effective workshops with agendas, activities, and outcomes
data-ai
Design and optimize AI-powered workflows for complex tasks
data-ai
Design and implement automated workflows to eliminate repetitive tasks and streamline processes