skills/entity-extractor/SKILL.md
Extract named entities from text with high accuracy and customization
npx skillsauth add jmsktm/claude-settings Entity ExtractorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Entity Extractor skill guides you through implementing named entity recognition (NER) systems that identify and classify entities in text. From people and organizations to domain-specific entities like products, medical terms, or financial instruments, this skill covers extraction approaches from simple pattern matching to advanced neural models.
Entity extraction is a foundational NLP task that powers applications from search engines to knowledge graphs. Getting it right requires understanding your domain, choosing appropriate techniques, and handling the inherent ambiguity in natural language.
Whether you need to extract standard entity types, define custom entities for your domain, or build relation extraction on top of entity recognition, this skill ensures your extraction pipeline is accurate and maintainable.
import spacy
class EntityExtractor:
def __init__(self, model="en_core_web_trf"):
self.nlp = spacy.load(model)
def extract(self, text):
doc = self.nlp(text)
entities = []
for ent in doc.ents:
entities.append({
"text": ent.text,
"type": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
"confidence": getattr(ent, "confidence", None)
})
return entities
def extract_batch(self, texts):
docs = list(self.nlp.pipe(texts))
return [self.extract_from_doc(doc) for doc in docs]
# Format for spaCy training
TRAIN_DATA = [
("Apple released the new iPhone today.", {
"entities": [(0, 5, "ORG"), (24, 30, "PRODUCT")]
}),
("Dr. Smith prescribed metformin for diabetes.", {
"entities": [(0, 9, "PERSON"), (21, 30, "DRUG"), (35, 43, "CONDITION")]
})
]
# spaCy config for NER training
config = {
"training": {
"optimizer": {"learn_rate": 0.001},
"batch_size": {"@schedules": "compounding", "start": 4, "stop": 32}
},
"components": {
"ner": {
"factory": "ner",
"model": {"@architectures": "spacy.TransitionBasedParser"}
}
}
}
python -m spacy train config.cfg --output ./models --paths.train ./train.spacy --paths.dev ./dev.spacy
| Action | Command/Trigger | |--------|-----------------| | Extract entities | "Extract entities from [text]" | | Choose NER model | "Best NER for [domain]" | | Custom entities | "Train custom entity recognizer" | | Evaluate NER | "Evaluate entity extraction quality" | | Handle ambiguity | "Resolve ambiguous entities" | | Entity linking | "Link entities to knowledge base" |
Start with Pre-trained: Don't train from scratch unnecessarily
Define Clear Guidelines: Entity boundaries are ambiguous
Handle Nested Entities: Some entities contain others
Normalize Extracted Entities: Raw text has variations
Evaluate Granularly: Aggregate metrics hide issues
Consider Context Window: Models have context limits
Use language models for flexible extraction:
def llm_extract_entities(text, entity_types):
prompt = f"""Extract named entities from the following text.
Text: "{text}"
Entity types to extract:
{chr(10).join(f"- {t}: {desc}" for t, desc in entity_types.items())}
Return a JSON array of entities:
[{{"text": "entity text", "type": "ENTITY_TYPE", "start": 0, "end": 10}}]
Only include entities that clearly match the specified types.
"""
response = llm.complete(prompt, response_format={"type": "json_object"})
return json.loads(response)["entities"]
# Example usage
entity_types = {
"COMPANY": "Business organizations",
"PRODUCT": "Commercial products or services",
"PERSON": "Individual people's names"
}
entities = llm_extract_entities(text, entity_types)
Combine patterns with neural extraction:
class HybridExtractor:
def __init__(self):
self.ml_extractor = spacy.load("en_core_web_trf")
self.patterns = load_pattern_rules()
def extract(self, text):
# ML extraction
ml_entities = self.ml_extractor(text).ents
# Pattern-based extraction
pattern_entities = apply_patterns(text, self.patterns)
# Merge with priority rules
merged = merge_entities(
ml_entities,
pattern_entities,
priority="pattern" # Patterns override ML when overlap
)
return merged
def add_pattern(self, pattern, entity_type):
"""Add domain-specific pattern."""
self.patterns.append({
"pattern": pattern,
"type": entity_type
})
Connect extracted entities to knowledge bases:
def link_entity(entity_text, entity_type, knowledge_base):
"""
Link extracted entity to canonical entry in knowledge base.
"""
# Generate candidates
candidates = knowledge_base.search(
query=entity_text,
type_filter=entity_type,
limit=10
)
if not candidates:
return {"entity": entity_text, "linked": None}
# Score candidates
scored = []
for candidate in candidates:
score = compute_linking_score(
entity_text,
candidate.name,
candidate.aliases
)
scored.append((candidate, score))
# Select best match
best = max(scored, key=lambda x: x[1])
if best[1] > LINKING_THRESHOLD:
return {
"entity": entity_text,
"linked": best[0].id,
"canonical_name": best[0].name,
"confidence": best[1]
}
else:
return {"entity": entity_text, "linked": None}
Extract relationships between entities:
def extract_relations(text, entities):
"""
Given extracted entities, find relations between them.
"""
prompt = f"""Given this text and extracted entities, identify relationships.
Text: "{text}"
Entities found:
{json.dumps(entities, indent=2)}
Identify relationships between entities. Return JSON:
[{{
"subject": "entity text",
"relation": "relationship type",
"object": "entity text",
"confidence": 0.9
}}]
Common relation types: WORKS_FOR, LOCATED_IN, FOUNDED, ACQUIRED, PARTNER_OF
"""
response = llm.complete(prompt)
return json.loads(response)
Efficiently improve extraction with targeted labeling:
def active_learning_sample(unlabeled_texts, model, n_samples=100):
"""
Select texts that would be most valuable to label.
"""
uncertainties = []
for text in unlabeled_texts:
doc = model(text)
# Calculate uncertainty (various strategies)
uncertainty = calculate_ner_uncertainty(doc)
uncertainties.append((text, uncertainty))
# Select most uncertain
uncertainties.sort(key=lambda x: x[1], reverse=True)
return [text for text, _ in uncertainties[:n_samples]]
def calculate_ner_uncertainty(doc):
"""
Calculate uncertainty based on entity confidence scores.
"""
if not doc.ents:
return 0.5 # No entities - medium uncertainty
confidences = [ent._.confidence for ent in doc.ents if hasattr(ent._, "confidence")]
if not confidences:
return 0.5
# High uncertainty = low confidence entities
return 1 - min(confidences)
data-ai
Optimize YouTube videos for SEO, thumbnails, descriptions, and audience retention
testing
Design and facilitate effective workshops with agendas, activities, and outcomes
data-ai
Design and optimize AI-powered workflows for complex tasks
data-ai
Design and implement automated workflows to eliminate repetitive tasks and streamline processes