skills/assessment-generative-named-entity/SKILL.md
Build generative NER systems using LLMs with optimal output formats and prompt engineering. Use when: 'extract entities from text', 'build a NER pipeline with an LLM', 'named entity recognition with generative models', 'format NER output as XML or bracketed', 'fine-tune a model for entity extraction', 'nested entity recognition'.
npx skillsauth add ndpvt-web/arxiv-claude-skills assessment-generative-named-entityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design, implement, and optimize Named Entity Recognition (NER) pipelines that use large language models as generative entity extractors rather than traditional sequence labeling. Based on a systematic evaluation of eight open-source LLMs across four NER benchmarks, this skill teaches the precise prompt structures, output format choices, and fine-tuning configurations that achieve F1 scores competitive with encoder-based models (93.85 F1 on CoNLL2003) while supporting both flat and nested entities in a single unified approach.
Generative NER reformulates entity recognition from token-level classification into text generation. Instead of assigning a BIO tag to every token, the LLM receives an instruction containing: (1) a task description specifying the output format, (2) semantic definitions for each entity label, and (3) the input sentence. The model then generates the complete output -- either the original sentence with inline entity markup, or a structured JSON object listing extracted entities. This approach unifies flat NER (non-overlapping entities) and nested NER (entities within entities) under one framework.
The critical finding is that output format choice dominates performance. Inline bracketed ([Entity | LABEL]) and inline XML (<LABEL>Entity</LABEL>) formats achieve the highest F1 scores (91-94 on standard benchmarks), while offset-based JSON formats that require character positions collapse to under 40 F1. This is because inline formats preserve the original sentence structure, letting the model leverage its language modeling strengths, whereas offset computation requires precise counting that generative models handle poorly. For nested entities, XML is strictly superior since bracketed notation cannot represent overlapping spans.
Parameter-efficient fine-tuning with LoRA (rank=256, alpha=512) on as few as 14K training examples produces models that match or exceed encoder-based baselines. Crucially, this fine-tuning preserves the model's general capabilities -- benchmarks like MMLU and TruthfulQA remain stable, and entity-heavy tasks like DROP actually improve by 25-45 F1 points.
Define the entity schema with semantic label descriptions. For each entity type, write a 1-2 sentence definition specifying what qualifies and what doesn't. Example: ORG: Collective entities such as companies, institutions, and governmental bodies. Excludes informal groups. These definitions go directly into the prompt and measurably improve extraction quality.
Select the output format based on entity structure:
[Entity Text | LABEL]<LABEL>Entity Text</LABEL>{"LABEL": ["entity1", "entity2"]}Construct the instruction prompt with three components:
For zero-shot/few-shot usage (no fine-tuning): Provide 2-3 annotated examples in the prompt demonstrating the exact output format. Ensure examples cover edge cases like multi-word entities and sentences with no entities.
For fine-tuning: Prepare training data as instruction-output pairs in JSONL format. Each entry contains the full instruction prompt as input and the formatted entity output as the target. Use LLaMA-Factory or similar frameworks with LoRA (r=256, alpha=512, lr=2e-5, 2 epochs, batch size 8).
Run inference and parse the generated output. Write a parser specific to your chosen format -- regex for bracketed (\[([^|]+)\|\s*(\w+)\]), XML parser for XML tags, or json.loads for JSON formats. Handle malformed output gracefully.
Evaluate using entity-level micro-F1. An entity is correct only if both its text span and label match exactly. Compute precision, recall, and F1 per label and overall. Track error categories: wrong type, boundary errors, omitted mentions.
Iterate on label definitions for the worst-performing entity types. If MISC entities have low recall, expand the definition with more examples of what qualifies. If boundary errors dominate, add examples showing exact span boundaries in your few-shot demonstrations.
Example 1: Building a flat NER prompt for news text
User: "Extract named entities from news articles using an LLM. The entities are person, organization, location, and miscellaneous."
Approach:
Output prompt template:
Your task is to identify all named entities in the input sentence. Rewrite
the sentence by enclosing each entity in brackets with its label:
[Entity Text | LABEL].
Entity types:
- PER: Named individuals, including real people and fictional characters.
- ORG: Companies, institutions, governmental bodies, and other collective entities.
- LOC: Countries, cities, geographic features, and named structures.
- MISC: Nationalities, religions, languages, events, and other proper nouns
not covered by PER, ORG, or LOC.
If the sentence contains no entities, output it unchanged.
Input: EU rejects German call to boycott British lamb.
Output: [EU | ORG] rejects [German | MISC] call to boycott [British | MISC] lamb.
Input: Peter Blackburn works at Reuters in Brussels.
Output: [Peter Blackburn | PER] works at [Reuters | ORG] in [Brussels | LOC].
Input: {sentence}
Output:
Parsing code:
import re
def parse_bracketed_ner(text: str) -> list[dict]:
"""Parse inline bracketed NER output into structured entities."""
entities = []
for match in re.finditer(r'\[([^|\]]+)\|\s*([A-Z]+)\]', text):
entities.append({
"text": match.group(1).strip(),
"label": match.group(2).strip(),
})
return entities
# Example:
output = "[EU | ORG] rejects [German | MISC] call to boycott [British | MISC] lamb."
print(parse_bracketed_ner(output))
# [{"text": "EU", "label": "ORG"}, {"text": "German", "label": "MISC"}, {"text": "British", "label": "MISC"}]
Example 2: Nested NER for biomedical text using XML format
User: "I need to extract protein and DNA entities from biomedical abstracts. Some DNA mentions contain protein names inside them."
Approach:
Output prompt template:
Extract all biomedical named entities from the input sentence. Mark entities
using XML tags: <LABEL>entity text</LABEL>. Entities may be nested --
place inner entity tags inside outer entity tags.
Entity types:
- Protein: Named proteins, transcription factors, enzymes, and protein families.
- DNA: Named DNA sequences, genes, promoters, and binding sites.
- RNA: Named RNA molecules including mRNA, tRNA, and regulatory RNA.
- Cell_line: Specific cultured cell lines used in experiments.
- Cell_type: General categories of cells (e.g., T cells, macrophages).
Input: PU.1 binds to a myeloid PU.1 binding site in the M-CSF receptor promoter.
Output: <Protein>PU.1</Protein> binds to a <DNA>myeloid <Protein>PU.1</Protein> binding site</DNA> in the <DNA>M-CSF receptor promoter</DNA>.
Input: {sentence}
Output:
Parsing code:
import re
from dataclasses import dataclass
@dataclass
class Entity:
text: str
label: str
start: int
end: int
LABELS = {"Protein", "DNA", "RNA", "Cell_line", "Cell_type"}
def parse_xml_ner(tagged: str) -> list[Entity]:
"""Parse XML-tagged NER output, handling nested entities."""
entities = []
tag_pattern = re.compile(r'<(/?)(' + '|'.join(LABELS) + r')>')
# Strip tags to get plain text, tracking entity spans
plain_chars = []
tag_stack = [] # (label, start_in_plain)
i = 0
while i < len(tagged):
m = tag_pattern.match(tagged, i)
if m:
is_close = m.group(1) == '/'
label = m.group(2)
if not is_close:
tag_stack.append((label, len(plain_chars)))
else:
if tag_stack and tag_stack[-1][0] == label:
lbl, start = tag_stack.pop()
text = ''.join(plain_chars[start:])
entities.append(Entity(text=text, label=lbl,
start=start, end=len(plain_chars)))
i = m.end()
else:
plain_chars.append(tagged[i])
i += 1
return entities
Example 3: Category-grouped JSON for structured pipelines
User: "I want entity extraction that returns clean JSON I can feed into a database, not inline markup."
Approach:
Output prompt template:
Extract all named entities from the input sentence. Return a JSON object
where keys are entity type labels and values are lists of entity text strings.
Include only labels that have at least one entity. If no entities exist,
return {}.
Entity types: PER, ORG, LOC, MISC
Input: EU rejects German call to boycott British lamb.
Output: {"ORG": ["EU"], "MISC": ["German", "British"]}
Input: The match ended in a 1-1 draw.
Output: {}
Input: {sentence}
Output:
Note: Category-grouped JSON loses positional information and deduplicates by default. If the same entity appears twice in a sentence, include it twice in the list.
Malformed output: The model may produce unclosed brackets, mismatched XML tags, or invalid JSON. Implement fallback parsing: attempt strict parsing first, then try regex-based extraction, then return an empty result with a warning.
Wrong entity types: The most common error (38% of mistakes). The model correctly identifies entity boundaries but assigns the wrong label. Mitigate by adding disambiguation notes to label definitions (e.g., "ORG does not include sports teams mentioned as locations").
Boundary errors: The model captures too much or too little of an entity span. Address by including boundary-edge examples in few-shot prompts (e.g., showing that "the United Nations" should extract as "United Nations" without the article).
Omitted entities: Especially common for rare entity types or entities in unusual syntactic positions. If recall is low for a specific type, add more few-shot examples featuring that type.
Hallucinated entities: The model invents entities not in the input text. For inline formats, verify every extracted entity string appears in the original input. For JSON formats, cross-reference extracted text against the source sentence.
Assessment of Generative Named Entity Recognition in the Era of Large Language Models -- Systematic evaluation of output formats, prompt structures, and LoRA fine-tuning for converting open-source LLMs into competitive NER systems. Code: github.com/szu-tera/LLMs4NER.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".