skills/knowledge-graph-construction/SKILL.md
Designs and builds knowledge graphs from unstructured or semi-structured data sources. Guides through data model selection (LPG, RDF, hypergraph, temporal), schema design, entity/relation extraction pipelines, and layered architecture construction. Use when designing knowledge graphs, choosing between LPG vs RDF, planning entity extraction, designing graph schemas, aligning ontologies, building a KG for RAG, or when user mentions knowledge graph construction.
npx skillsauth add lyndonkl/claude knowledge-graph-constructionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Copy this checklist and work through each step:
KG Construction Progress:
- [ ] Step 1: Identify data sources and domain scope
- [ ] Step 2: Select graph data model
- [ ] Step 3: Design schema and ontology
- [ ] Step 4: Configure extraction pipeline
- [ ] Step 5: Define layered architecture
- [ ] Step 6: Validate and quality-check the graph
Step 1: Identify data sources and domain scope
Catalog the input data: document types (papers, clinical notes, web pages, logs), volume, update frequency, and language. Define the domain boundary -- what entity types and relation types matter for the target use case. Determine whether the KG will serve RAG retrieval, reasoning/inference, analytics, or a combination. This scoping step prevents over-extraction and keeps the schema focused.
Step 2: Select graph data model
Choose the underlying data model using the Architecture Selection Guide. Key trade-offs: LPG for flexibility and rapid prototyping, RDF/OWL for standards-based interoperability and inference, Hypergraphs for complex N-ary relations, Temporal Graphs for time-evolving knowledge. Consider query language, tooling maturity, and vector integration needs. For detailed model comparisons, see Data Models Reference.
Step 3: Design schema and ontology
Define node types (entity classes), edge types (relation classes), and property schemas. Apply patterns from Schema Patterns: entity-relation for simple domains, event reification for N-ary relations, layered tiers for multi-source integration. Decide on controlled vocabularies, cardinality constraints, and whether to adopt or extend an existing ontology (e.g., Schema.org, UMLS, SNOMED). For methodology details, see Methodology Reference.
Step 4: Configure extraction pipeline
Build the pipeline that populates the graph. Core components: LLM-assisted entity extraction with multi-round verification, relation extraction via prompt-based or dependency-parsing methods, entity normalization (synonym merging, ontology linking), and schema enforcement through post-processing validation. Use few-shot examples in prompts to improve extraction consistency. Include a second-pass LLM verification to catch missed entities. For full pipeline design, see Methodology Reference.
Step 5: Define layered architecture
Structure the KG into tiers for maintainability and trust. A common pattern: Layer 1 (instance data) holds user-specific or case-specific entities and relations; Layer 2 (domain knowledge) holds curated facts from literature or domain experts; Layer 3 (canonical ontology) holds the formal schema and upper ontology. Add provenance and evidence layering so every fact traces back to its source document, extraction method, and confidence score. Temporal subgraphs capture time-indexed state for domains where knowledge evolves.
Step 6: Validate and quality-check the graph
Run validation at multiple levels: schema conformance (do all nodes and edges match declared types?), coverage (are expected entity types populated?), consistency (no contradictory edges), and completeness (sample-based human review). Use a second LLM as a validator to fact-check extracted triples against source documents. Compute graph statistics (node degree distribution, connected components, orphan nodes) to identify extraction gaps. Quality criteria are defined in Quality Rubric.
| Model | Flexibility | Standardization | Reasoning | Vector Integration | Query Language | Best For | |-------|-------------|-----------------|-----------|-------------------|----------------|----------| | LPG | High | Low | Limited | Native (Neo4j) | Cypher, Gremlin | Rapid development, RAG pipelines | | RDF/OWL | Medium | High | Full (OWL-DL) | Via extensions | SPARQL | Interoperability, ontology-heavy domains | | Hypergraph | High | Low | Limited | Custom | Custom APIs | N-ary relations, multi-entity events | | Temporal | Medium | Low | Time-based | Via extensions | Temporal Cypher | Evolving knowledge, episodic memory |
| Domain | Recommended Model | Rationale | |--------|-------------------|-----------| | Biomedical / Clinical | RDF/OWL | UMLS/SNOMED ontologies, reasoning needed | | Enterprise / RAG | LPG | Fast iteration, vector search integration | | Event-centric (news, logs) | Hypergraph or Temporal | Multi-participant events, time evolution | | Legal / Compliance | RDF/OWL | Formal reasoning, provenance chains | | Scientific Literature | LPG + Layered | Flexible extraction, layered trust |
The simplest pattern. Nodes represent entities, edges represent binary relations. Properties on nodes hold attributes; properties on edges hold relation metadata (confidence, source, timestamp).
(:Person {name, role}) -[:WORKS_AT {since}]-> (:Organization {name, type})
(:Drug {name, class}) -[:TREATS {efficacy}]-> (:Disease {name, icd_code})
Best for: domains with primarily binary relationships and moderate complexity.
Model N-ary relations and complex events as first-class nodes. An event node connects to all participants via typed role edges. This avoids information loss from forcing N-ary relations into binary edges.
(:ClinicalTrial {id, phase, start_date})
-[:HAS_DRUG]-> (:Drug {name})
-[:HAS_CONDITION]-> (:Disease {name})
-[:HAS_OUTCOME]-> (:Outcome {measure, value})
-[:CONDUCTED_BY]-> (:Organization {name})
Best for: events with multiple participants, clinical data, news events, financial transactions.
Separate the graph into trust-differentiated layers that can be queried independently or together.
Layer 3 (Canonical Ontology): Formal class hierarchy, relation definitions, constraints
Layer 2 (Domain Knowledge): Curated facts from literature, expert-validated
Layer 1 (Instance Data): Extracted from user documents, case-specific, lower confidence
Cross-layer edges link instances to domain concepts and domain concepts to ontology classes. Provenance metadata on every edge records: source document, extraction method, confidence score, and timestamp.
Best for: multi-source integration, RAG with trust scoring, enterprise knowledge management.
KNOWLEDGE GRAPH CONSTRUCTION SPECIFICATION
============================================
Domain: [Target domain and scope]
Use Case: [RAG / Reasoning / Analytics / Hybrid]
Data Sources: [List of input data types and volumes]
Data Model: [LPG / RDF / Hypergraph / Temporal]
Query Language: [Cypher / SPARQL / Gremlin / Custom]
Storage Backend: [Neo4j / Amazon Neptune / Virtuoso / etc.]
Schema Definition:
Node Types:
1. [EntityType] - [description]
Properties: [list with types]
2. [EntityType] - [description]
Properties: [list with types]
3. [Continue for each node type...]
Edge Types:
1. [RelationType] (source -> target) - [description]
Properties: [list with types]
2. [Continue for each edge type...]
Constraints:
- [Cardinality, uniqueness, required properties]
Extraction Pipeline:
1. Entity Extraction
- Method: [LLM-assisted / NER / Hybrid]
- Prompt template: [summary or reference]
- Verification: [Multi-round / Second-LLM / Manual sample]
2. Relation Extraction
- Method: [Prompt-based / Dependency parsing / Hybrid]
- Few-shot examples: [count and source]
3. Normalization
- Deduplication: [method]
- Ontology linking: [target ontology]
- Synonym resolution: [approach]
Layered Architecture:
Layer 1 (Instance): [description of instance-level data]
Layer 2 (Domain): [description of curated domain knowledge]
Layer 3 (Ontology): [description of formal schema]
Provenance: [How source/confidence/timestamp are tracked]
Validation Plan:
- Schema conformance: [automated checks]
- Coverage: [expected entity/relation counts]
- Consistency: [contradiction detection method]
- Human review: [sampling strategy]
Estimated Scale: [node count, edge count, properties per node]
Key Dependencies: [libraries, APIs, ontologies]
NEXT STEPS:
- Implement extraction pipeline on sample data
- Populate graph and run validation suite
- Iterate schema based on extraction results
- Integrate with downstream application (RAG, reasoning, etc.)
testing
--- name: advisory-edit description: A strict advisory-only editing discipline for a writer who dictates ("speaks out") essays and wants help WITHOUT having their voice changed. The editor directs structure, flags grammar, and suggests strategic language — but never modifies the writer's text unless the writer explicitly says "apply" / "make that change" / "rewrite this." Produces a line-referenced, suggestion-only critique where every item is marked the writer's call. Four passes: structural, l
testing
Provides the house style for analyst-grade strategist writing — third-person register with sparing first-person, no em dashes, no "not X, not Y, not Z" negation cascades, numbered footnote citations rather than inline source parentheticals, specific opinion-signaling phrases, and topic-forward paragraph structure modeled on voice patterns observed in Damodaran's Musings on Markets and Thompson's Stratechery. Use when consolidating working notes into a finished long-form strategist or analyst report that must read as written by a senior human analyst rather than an AI assistant.
testing
Renders a markdown report to a PDF using pandoc with xelatex (11pt serif body, 1-inch margins, numbered footnotes, formal heading hierarchy). Requires a one-time install of pandoc and a LaTeX engine on the user's machine — basictex on macOS or texlive-xetex on Linux. Does not attempt automatic install. Fails loudly with the exact install commands if pandoc or xelatex is missing on the user's PATH. Use when producing a finished strategist or analyst report PDF from a polished markdown source.
testing
Produces step-by-step computational walkthroughs of vector and matrix operations as a sequence of numbered "frames", showing the explicit state at each step. The text-equivalent of a 3Blue1Brown animation — each frame shows what changed and why, so the learner can re-trace the operation by hand. Use when the learner needs to *see* a computation unfold (eigenvalue computation, attention with 3 tokens, gradient descent step, SVD on a 2×2, layer norm on a 3-vector, softmax of a small input), when an explanation has been given but the learner needs to ground it in a worked example, or when introducing an operation that's intimidating in symbol form but trivial in pencil-and-paper form.