skills/graphrag-evaluation/SKILL.md
Evaluates GraphRAG systems across knowledge graph completeness, retrieval relevance, answer correctness, reasoning depth, and hallucination prevention. Provides structured evaluation frameworks, metric selection guidance, and testing protocols. Use when evaluating GraphRAG quality, benchmarking multi-step reasoning, measuring hallucination reduction, or when user mentions evaluate GraphRAG, quality metrics, answer correctness, test my GraphRAG, or measure RAG performance.
npx skillsauth add lyndonkl/claude graphrag-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Copy this checklist and work through each step:
Define what aspects of your GraphRAG system you need to evaluate and why. Determine whether you are evaluating the full pipeline or specific components (KG construction, retrieval, generation). Clarify the use case context: domain, query complexity, expected reasoning depth.
See methodology.md for the full evaluation dimensions framework.
Choose metrics appropriate to your evaluation scope. Not every evaluation requires every metric. Match metrics to your system's maturity and the questions you need answered.
See the Metric Selection Guide below and methodology.md for detailed metric definitions.
Build test sets that cover your evaluation dimensions. Include single-hop factual queries, multi-hop reasoning queries, constraint satisfaction queries, temporal reasoning queries, comparative queries, and negative queries (questions the system should not answer).
See methodology.md for baseline comparison approaches and statistical significance testing.
Evaluate how well your system handles multi-step reasoning. Verify that each reasoning step is grounded in retrieved KG evidence. Check for error propagation where an incorrect intermediate step leads to wrong conclusions.
See reasoning-patterns.md for chain validation, pattern matching, hypothesis verification, and causal reasoning evaluation.
Quantify both intrinsic hallucination (contradicts retrieved evidence) and extrinsic hallucination (claims not supported by any retrieved source). Measure the KG grounding rate: what percentage of generated claims are traceable to knowledge graph entities and relations.
See methodology.md for hallucination detection approaches and comparison protocols.
Run identical test sets against baseline systems: pure vector RAG, LLM-only (no retrieval), and alternative graph configurations. Use controlled ablation studies to isolate the contribution of each component.
See methodology.md for baseline comparison and ablation study design.
Compile findings into the structured output template below. Include metric values, baseline comparisons, identified weaknesses, and prioritized recommendations.
See rubric_evaluation.json for the scoring rubric (minimum passing score: 3.0).
| Dimension | What It Measures | Key Metrics | Priority | |---|---|---|---| | KG Quality | Completeness and accuracy of the knowledge graph | Entity coverage, relation completeness, schema consistency | High | | Retrieval Quality | Effectiveness of graph-based retrieval | Context recall (C-Rec), context precision, multi-hop coverage | High | | Answer Correctness | Accuracy and completeness of generated answers | Factual accuracy, answer completeness, citation accuracy | Critical | | Hallucination Rate | Frequency of unsupported or contradicted claims | Intrinsic hallucination rate, extrinsic hallucination rate, KG grounding rate | Critical | | Reasoning Depth | Ability to perform multi-step reasoning correctly | Multi-hop accuracy, stepwise verification score, error propagation rate | Medium-High |
Choose metrics based on your evaluation goals:
Quick Health Check (minimal effort):
Standard Evaluation (recommended):
Comprehensive Benchmark (production readiness):
# GraphRAG Evaluation Report
## 1. System Under Evaluation
- System name and version:
- Domain:
- KG size (entities/relations):
- Evaluation date:
## 2. Evaluation Scope
- Dimensions evaluated:
- Test set size and composition:
- Baseline systems:
## 3. KG Quality Results
- Entity coverage: ____%
- Relation completeness: ____%
- Schema consistency score: ____
- Notable gaps:
## 4. Retrieval Quality Results
- Context recall (C-Rec): ____
- Context precision: ____
- Multi-hop coverage: ____%
- Latency (p50/p95/p99): ____
## 5. Answer Correctness Results
- Factual accuracy: ____%
- Answer completeness: ____%
- Citation accuracy: ____%
## 6. Hallucination Analysis
- Intrinsic hallucination rate: ____%
- Extrinsic hallucination rate: ____%
- KG grounding rate: ____%
- Comparison with/without graph augmentation:
## 7. Reasoning Depth Results
- Single-hop accuracy: ____%
- Multi-hop accuracy: ____%
- Stepwise reasoning correctness: ____%
- Error propagation incidents: ____
## 8. Baseline Comparison
| Metric | GraphRAG | Pure Vector RAG | LLM Only |
|--------|----------|-----------------|----------|
| Answer correctness | | | |
| Hallucination rate | | | |
| Multi-hop accuracy | | | |
## 9. Statistical Significance
- Test used:
- Confidence level:
- Significant improvements:
- Non-significant differences:
## 10. Identified Weaknesses
1.
2.
3.
## 11. Recommendations
| Priority | Recommendation | Expected Impact | Effort |
|----------|---------------|-----------------|--------|
| | | | |
## 12. Rubric Score
- Metric Coverage: __ / 5
- Measurement Rigor: __ / 5
- Baseline Comparison: __ / 5
- Reasoning Depth: __ / 5
- Actionable Recommendations: __ / 5
- **Weighted Total: __ / 5.0** (minimum passing: 3.0)
development
--- name: zettel-note description: The note-writing discipline for this vault's evergreen knowledge graph, modeled on a Zettelkasten reading companion and governed by the vault conventions. Enforces declarative-claim titles, one claim per note (atomicity), own-words prose with no block quotes, the piped [[slug|Title]] link form, the labeled link-relationship vocabulary (Confirms/Contradicts/Extends/Context/Prerequisite/Builds-on/Applies/Example-of/Contrasts-with), 3-6 links per note, and search-
development
Plans between-round FIFA World Cup Fantasy transfers — budgets the round's free transfer(s), forces out players whose nation has been eliminated, chases fixture-swing drops, upgrades on value, and decides when a rebuild is large enough to fire the Wildcard instead of spending free transfers one at a time. Ranks candidate in/out pairs by EV gain over each player's remaining survival horizon (delta xEV weighted by progression_carry) MINUS transfer cost (a free transfer is cheap, a points hit is real, churning the squad for marginal swings is a critic flag), and tags forced/fixture/upgrade priority. Emits a `transfer-plan` signal. Use when called by wc-squad-architect (whose transfer work this skill is the engine for) and by the strategists in the populate stage when their candidate is transfer-adjacent rather than a full rebuild.
testing
Reads and updates the FIFA World Cup Fantasy tournament state machine (footballfantasy/context/tournament-state.md) — the temporal backbone tracking phase (pre-tournament → group MD1-3 → R32 → R16 → QF → SF → final), budget ($100m group / $105m knockouts), nation cap (3 group, loosening in knockouts), chips remaining, surviving nations, each owned player's elimination-risk horizon, and deadlines. Validates state on load (count/feasibility checks), applies phase transitions, and appends to the append-only state log (never silent overwrite). Use to load state at the start of a run and to commit state changes after the manager makes a move.
development
Validates and persists FIFA World Cup Fantasy signal files to signals/YYYY-MM-DD-<type>.md. Checks the required frontmatter (type, round, date, emitted_by, confidence, source_urls), range-checks declared numeric signals, confirms every factual claim carries a source URL or "manager-provided", rejects unknown signal types, and refuses to persist a signal that fails validation (logging the failure instead). Keeps the inter-agent signal layer auditable so downstream agents can trust what they read and never re-derive it. Use whenever an agent or skill writes a signal.