NLP Pipeline Builder

The NLP Pipeline Builder skill guides you through designing and implementing natural language processing pipelines that transform raw text into structured, actionable insights. From preprocessing to advanced analysis, this skill covers the full spectrum of NLP tasks and helps you choose the right approach for your specific needs.

Modern NLP offers multiple paradigms: rule-based approaches, classical ML, and deep learning/LLMs. This skill helps you navigate these options, building pipelines that balance accuracy, latency, cost, and maintainability. Whether you need real-time processing at scale or deep analysis of specific documents, this skill ensures your pipeline is fit for purpose.

From tokenization to semantic analysis, from single documents to streaming text, this skill helps you build robust NLP systems that handle real-world text with all its messiness and complexity.

Core Workflows

Workflow 1: Design NLP Pipeline Architecture

Define requirements:
- Input: What text? What format? What volume?
- Output: What information to extract?
- Constraints: Latency, accuracy, cost

Select pipeline stages:

Standard NLP Pipeline:
Text → Preprocessing → Tokenization → Feature Extraction → Task Model → Output

Example stages:
- Preprocessing: cleaning, normalization
- Linguistic: tokenization, POS, NER, parsing
- Semantic: embeddings, topic modeling
- Task-specific: classification, extraction, generation

Choose approach per stage: | Stage | Classical | Deep Learning | LLM | |-------|-----------|---------------|-----| | Tokenization | Regex, NLTK | SentencePiece | Model-specific | | NER | CRF, rules | BiLSTM-CRF, BERT | Prompt-based | | Classification | SVM, NB | CNN, BERT | Zero/few-shot | | Extraction | Regex, patterns | Seq2Seq | Prompt-based |
Design error handling and fallbacks
Document architecture

Workflow 2: Implement Text Preprocessing

Clean text:

def clean_text(text):
    # Normalize unicode
    text = unicodedata.normalize("NFKC", text)

    # Remove or replace problematic characters
    text = remove_control_characters(text)

    # Normalize whitespace
    text = " ".join(text.split())

    # Optionally: lowercase, remove punctuation, etc.
    # (depends on downstream tasks)

    return text

Segment into units:
- Sentence splitting
- Paragraph detection
- Document structuring
Tokenize appropriately:
- Word tokenization for analysis
- Subword tokenization for models
- Language-specific considerations
Normalize for consistency:
- Case normalization
- Lemmatization/stemming
- Handling contractions, abbreviations

Workflow 3: Build Production NLP System

Set up processing infrastructure:

class NLPPipeline:
    def __init__(self, config):
        self.preprocessor = TextPreprocessor(config)
        self.tokenizer = load_tokenizer(config.tokenizer)
        self.models = {
            "ner": load_model(config.ner_model),
            "sentiment": load_model(config.sentiment_model),
            "classification": load_model(config.classifier)
        }
        self.cache = ResultCache() if config.use_cache else None

    def process(self, text, tasks=None):
        tasks = tasks or ["all"]

        # Preprocessing
        cleaned = self.preprocessor.clean(text)
        tokens = self.tokenizer.tokenize(cleaned)

        # Run requested analyses
        results = {"text": text, "tokens": tokens}
        for task, model in self.models.items():
            if task in tasks or "all" in tasks:
                results[task] = model.predict(tokens)

        return results

Implement batching for throughput
Add caching for repeated inputs
Set up monitoring and logging
Test with diverse inputs

Quick Reference

| Action | Command/Trigger | |--------|-----------------| | Design pipeline | "Design NLP pipeline for [task]" | | Preprocess text | "How to preprocess [text type]" | | Choose tokenizer | "Best tokenizer for [use case]" | | Extract entities | "Extract entities from text" | | Classify text | "Build text classifier" | | Scale pipeline | "Scale NLP to [volume]" |

Best Practices

Understand Your Text: Different text requires different treatment
- Social media: informal, abbreviations, emoji
- Legal/medical: domain terms, structure
- Multilingual: language detection, appropriate tools
Preserve What Matters: Preprocessing shouldn't destroy information
- Don't lowercase if case is meaningful
- Keep punctuation if it affects meaning
- Document all transformations
Handle Encoding Correctly: Unicode is tricky
- Always normalize (NFKC recommended)
- Handle encoding errors gracefully
- Test with diverse scripts and characters
Batch for Efficiency: Model inference is expensive
- Batch inputs for GPU utilization
- Balance batch size vs latency
- Use async processing where appropriate
Fail Gracefully: Text is messy and unpredictable
- Handle empty, too-long, or malformed inputs
- Provide sensible defaults for edge cases
- Log failures for analysis
Version Your Pipeline: Reproducibility matters
- Pin model versions
- Document preprocessing steps
- Track configuration changes

Advanced Techniques

Multi-Stage Extraction Pipeline

Chain extractors for complex information:

class ExtractionPipeline:
    def __init__(self):
        self.ner = NERModel()
        self.relation = RelationExtractor()
        self.coreference = CoreferenceResolver()

    def extract(self, text):
        # Stage 1: Named Entity Recognition
        entities = self.ner.extract(text)

        # Stage 2: Coreference Resolution
        resolved = self.coreference.resolve(text, entities)

        # Stage 3: Relation Extraction
        relations = self.relation.extract(text, resolved)

        # Stage 4: Build knowledge graph
        graph = build_graph(resolved, relations)

        return {
            "entities": resolved,
            "relations": relations,
            "graph": graph
        }

Hybrid Classical + LLM Pipeline

Use LLMs where they add value, classical where they don't:

class HybridPipeline:
    def process(self, text):
        # Fast classical preprocessing
        cleaned = classical_clean(text)
        sentences = classical_sentence_split(cleaned)

        # Classical NER (fast, predictable)
        entities = classical_ner(sentences)

        # LLM for complex tasks (slower, more capable)
        sentiment = llm_sentiment(text)  # Nuanced sentiment
        summary = llm_summarize(text)    # Abstractive summary

        return {
            "sentences": sentences,
            "entities": entities,  # Classical
            "sentiment": sentiment,  # LLM
            "summary": summary  # LLM
        }

Streaming Text Processing

Handle continuous text streams:

class StreamingNLP:
    def __init__(self, batch_size=32, timeout_ms=100):
        self.batch_size = batch_size
        self.timeout_ms = timeout_ms
        self.buffer = []
        self.last_process_time = time.time()

    async def add(self, text):
        self.buffer.append(text)

        # Process if batch full or timeout
        if len(self.buffer) >= self.batch_size:
            return await self.flush()
        elif (time.time() - self.last_process_time) * 1000 > self.timeout_ms:
            return await self.flush()

    async def flush(self):
        if not self.buffer:
            return []

        batch = self.buffer
        self.buffer = []
        self.last_process_time = time.time()

        # Batch process
        results = await self.pipeline.process_batch(batch)
        return results

Language Detection and Routing

Handle multilingual text:

class MultilingualPipeline:
    def __init__(self):
        self.detector = LanguageDetector()
        self.pipelines = {
            "en": EnglishPipeline(),
            "es": SpanishPipeline(),
            "zh": ChinesePipeline(),
            "default": UniversalPipeline()
        }

    def process(self, text):
        lang = self.detector.detect(text)
        pipeline = self.pipelines.get(lang, self.pipelines["default"])

        return {
            "language": lang,
            "results": pipeline.process(text)
        }

Common Pitfalls to Avoid

Over-preprocessing and destroying meaningful information
Ignoring Unicode normalization and encoding issues
Using word tokenizers for languages without spaces
Not handling edge cases (empty text, very long text)
Assuming English-only when users may send other languages
Running expensive models on every input when caching would help
Not batching model inference for throughput
Ignoring the latency impact of pipeline stages

NLP Pipeline Builder

From tokenization to semantic analysis, from single documents to streaming text, this skill helps you build robust NLP systems that handle real-world text with all its messiness and complexity.

Core Workflows

Workflow 1: Design NLP Pipeline Architecture

Define requirements:
- Input: What text? What format? What volume?
- Output: What information to extract?
- Constraints: Latency, accuracy, cost

Select pipeline stages:

Standard NLP Pipeline:
Text → Preprocessing → Tokenization → Feature Extraction → Task Model → Output

Example stages:
- Preprocessing: cleaning, normalization
- Linguistic: tokenization, POS, NER, parsing
- Semantic: embeddings, topic modeling
- Task-specific: classification, extraction, generation

Choose approach per stage: | Stage | Classical | Deep Learning | LLM | |-------|-----------|---------------|-----| | Tokenization | Regex, NLTK | SentencePiece | Model-specific | | NER | CRF, rules | BiLSTM-CRF, BERT | Prompt-based | | Classification | SVM, NB | CNN, BERT | Zero/few-shot | | Extraction | Regex, patterns | Seq2Seq | Prompt-based |
Design error handling and fallbacks
Document architecture

Workflow 2: Implement Text Preprocessing

Clean text:

def clean_text(text):
    # Normalize unicode
    text = unicodedata.normalize("NFKC", text)

    # Remove or replace problematic characters
    text = remove_control_characters(text)

    # Normalize whitespace
    text = " ".join(text.split())

    # Optionally: lowercase, remove punctuation, etc.
    # (depends on downstream tasks)

    return text

Segment into units:
- Sentence splitting
- Paragraph detection
- Document structuring
Tokenize appropriately:
- Word tokenization for analysis
- Subword tokenization for models
- Language-specific considerations
Normalize for consistency:
- Case normalization
- Lemmatization/stemming
- Handling contractions, abbreviations

Workflow 3: Build Production NLP System

Set up processing infrastructure:

class NLPPipeline:
    def __init__(self, config):
        self.preprocessor = TextPreprocessor(config)
        self.tokenizer = load_tokenizer(config.tokenizer)
        self.models = {
            "ner": load_model(config.ner_model),
            "sentiment": load_model(config.sentiment_model),
            "classification": load_model(config.classifier)
        }
        self.cache = ResultCache() if config.use_cache else None

    def process(self, text, tasks=None):
        tasks = tasks or ["all"]

        # Preprocessing
        cleaned = self.preprocessor.clean(text)
        tokens = self.tokenizer.tokenize(cleaned)

        # Run requested analyses
        results = {"text": text, "tokens": tokens}
        for task, model in self.models.items():
            if task in tasks or "all" in tasks:
                results[task] = model.predict(tokens)

        return results

Implement batching for throughput
Add caching for repeated inputs
Set up monitoring and logging
Test with diverse inputs

Quick Reference

Best Practices

Understand Your Text: Different text requires different treatment
- Social media: informal, abbreviations, emoji
- Legal/medical: domain terms, structure
- Multilingual: language detection, appropriate tools
Preserve What Matters: Preprocessing shouldn't destroy information
- Don't lowercase if case is meaningful
- Keep punctuation if it affects meaning
- Document all transformations
Handle Encoding Correctly: Unicode is tricky
- Always normalize (NFKC recommended)
- Handle encoding errors gracefully
- Test with diverse scripts and characters
Batch for Efficiency: Model inference is expensive
- Batch inputs for GPU utilization
- Balance batch size vs latency
- Use async processing where appropriate
Fail Gracefully: Text is messy and unpredictable
- Handle empty, too-long, or malformed inputs
- Provide sensible defaults for edge cases
- Log failures for analysis
Version Your Pipeline: Reproducibility matters
- Pin model versions
- Document preprocessing steps
- Track configuration changes

Advanced Techniques

Multi-Stage Extraction Pipeline

Chain extractors for complex information:

class ExtractionPipeline:
    def __init__(self):
        self.ner = NERModel()
        self.relation = RelationExtractor()
        self.coreference = CoreferenceResolver()

    def extract(self, text):
        # Stage 1: Named Entity Recognition
        entities = self.ner.extract(text)

        # Stage 2: Coreference Resolution
        resolved = self.coreference.resolve(text, entities)

        # Stage 3: Relation Extraction
        relations = self.relation.extract(text, resolved)

        # Stage 4: Build knowledge graph
        graph = build_graph(resolved, relations)

        return {
            "entities": resolved,
            "relations": relations,
            "graph": graph
        }

Hybrid Classical + LLM Pipeline

Use LLMs where they add value, classical where they don't:

class HybridPipeline:
    def process(self, text):
        # Fast classical preprocessing
        cleaned = classical_clean(text)
        sentences = classical_sentence_split(cleaned)

        # Classical NER (fast, predictable)
        entities = classical_ner(sentences)

        # LLM for complex tasks (slower, more capable)
        sentiment = llm_sentiment(text)  # Nuanced sentiment
        summary = llm_summarize(text)    # Abstractive summary

        return {
            "sentences": sentences,
            "entities": entities,  # Classical
            "sentiment": sentiment,  # LLM
            "summary": summary  # LLM
        }

Streaming Text Processing

Handle continuous text streams:

class StreamingNLP:
    def __init__(self, batch_size=32, timeout_ms=100):
        self.batch_size = batch_size
        self.timeout_ms = timeout_ms
        self.buffer = []
        self.last_process_time = time.time()

    async def add(self, text):
        self.buffer.append(text)

        # Process if batch full or timeout
        if len(self.buffer) >= self.batch_size:
            return await self.flush()
        elif (time.time() - self.last_process_time) * 1000 > self.timeout_ms:
            return await self.flush()

    async def flush(self):
        if not self.buffer:
            return []

        batch = self.buffer
        self.buffer = []
        self.last_process_time = time.time()

        # Batch process
        results = await self.pipeline.process_batch(batch)
        return results

Language Detection and Routing

Handle multilingual text:

class MultilingualPipeline:
    def __init__(self):
        self.detector = LanguageDetector()
        self.pipelines = {
            "en": EnglishPipeline(),
            "es": SpanishPipeline(),
            "zh": ChinesePipeline(),
            "default": UniversalPipeline()
        }

    def process(self, text):
        lang = self.detector.detect(text)
        pipeline = self.pipelines.get(lang, self.pipelines["default"])

        return {
            "language": lang,
            "results": pipeline.process(text)
        }

Common Pitfalls to Avoid

Over-preprocessing and destroying meaningful information
Ignoring Unicode normalization and encoding issues
Using word tokenizers for languages without spaces
Not handling edge cases (empty text, very long text)
Assuming English-only when users may send other languages
Running expensive models on every input when caching would help
Not batching model inference for throughput
Ignoring the latency impact of pipeline stages

Adoption

jmsktm/NLP Pipeline Builder

$ install --global

Security Scan Results

SKILL.md

NLP Pipeline Builder

Core Workflows

Workflow 1: Design NLP Pipeline Architecture

Workflow 2: Implement Text Preprocessing

Workflow 3: Build Production NLP System

Quick Reference

Best Practices

Advanced Techniques

Multi-Stage Extraction Pipeline

Hybrid Classical + LLM Pipeline

Streaming Text Processing

Language Detection and Routing

Common Pitfalls to Avoid

Related Skills

jmsktm/YouTube Optimizer

jmsktm/Workshop Facilitator

jmsktm/Workflow Designer

jmsktm/Workflow Automator

jmsktm/NLP Pipeline Builder

$ install --global

Security Scan Results

SKILL.md

NLP Pipeline Builder

Core Workflows

Workflow 1: Design NLP Pipeline Architecture

Workflow 2: Implement Text Preprocessing

Workflow 3: Build Production NLP System

Quick Reference

Best Practices

Advanced Techniques

Multi-Stage Extraction Pipeline

Hybrid Classical + LLM Pipeline

Streaming Text Processing

Language Detection and Routing

Common Pitfalls to Avoid

Related Skills

jmsktm/YouTube Optimizer

jmsktm/Workshop Facilitator

jmsktm/Workflow Designer

jmsktm/Workflow Automator