skills/domain-specific-knowledge-graphs-rag-enhanced/SKILL.md
Build scope-matched knowledge graph RAG pipelines where retrieval precision beats breadth. Constructs domain-specific KGs from scientific literature, selects scope-aligned subgraphs for retrieval, and injects focused context into LLM prompts — avoiding the accuracy loss caused by union-based retrieval. Use when: "Build a knowledge graph RAG pipeline for medical questions", "Add domain-specific retrieval to my LLM app", "My RAG pipeline returns too much irrelevant context", "Help me scope my knowledge graph to match my query domain", "Design a biomedical QA system with knowledge graphs", "Reduce noise in my retrieval-augmented generation system"
npx skillsauth add ndpvt-web/arxiv-claude-skills domain-specific-knowledge-graphs-rag-enhancedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to design and build RAG systems that use precision-first, scope-matched knowledge graphs instead of naive "retrieve everything" approaches. The core insight from Anuyah et al. (2026) is that narrowly scoped KGs aligned to the query domain consistently outperform broad graph unions, which introduce distractors that degrade accuracy — especially for smaller models. Claude applies this to help users construct domain-specific KGs from literature, select the right subgraph per query, and wire it into a RAG pipeline that actually improves LLM output rather than hurting it.
Scope-matched KG-RAG rejects the assumption that more retrieval context is better. The paper constructs three PubMed-derived knowledge graphs — G1 (Type 2 Diabetes), G2 (Alzheimer's Disease), and G3 (combined AD+T2DM) — and tests them across seven LLMs in six retrieval configurations (No-RAG, individual graphs, pairwise unions, full union). The decisive finding: when the KG's scope aligns with the probe's domain, accuracy improves consistently (e.g., Mixtral macro-F1 jumped from 0.80 to 0.89 with scope-matched G2). When scope is misaligned or graphs are unioned indiscriminately, distractors flood the context and accuracy drops — Llama-3.3-70B fell from 0.96 to 0.71 when given the wrong graph.
The KG construction uses CoDe-KG, a pipeline that applies co-reference resolution (resolving synonyms like "T2DM" and "type 2 diabetes"), syntactic decomposition (breaking complex sentences into atomic clauses), and relation extraction with source tagging (paper ID, sentence ID, clause ID for traceability). Edges are limited to causal relations (causes, because, etc.), and entity names are canonicalized. This produces clean, typed triples like (insulin_resistance, causes, neuronal_tau_phosphorylation).
A critical secondary finding: model size determines whether RAG helps at all. Larger models (70B+) often matched or exceeded KG-RAG performance on broad-domain probes using parametric knowledge alone. Smaller/mid-sized models (7B-32B) showed the clearest gains from well-scoped retrieval. Temperature had minimal impact — low temperatures (0.0-0.3) were consistently best. This means the first design decision isn't "what to retrieve" but "does this model even need retrieval for this domain?"
Define the query domain precisely. Identify the specific subdomain your system must answer questions about. Narrow beats broad — "Alzheimer's disease mechanisms" is better than "neuroscience." Write down the entity types (diseases, genes, drugs, biomarkers) and relation types (causes, treats, inhibits) you need.
Assess whether RAG is needed for your model. If using a 70B+ parameter model on well-studied topics, run a No-RAG baseline first. Only add KG-RAG if the baseline shows gaps. For 7B-32B models, scope-matched retrieval is almost always beneficial.
Construct domain-scoped knowledge graphs from literature. For each subdomain, build a separate KG:
Store each KG separately — do not merge prematurely. Use a graph database (Neo4j) or in-memory structure (NetworkX, dictionary of adjacency lists). Keep G1, G2, ..., Gn as independent graphs so you can select the right one at query time.
Build a scope-matching retrieval layer. Given an incoming query, determine which KG(s) are scope-aligned before retrieval:
Format retrieved triples as structured context. Inject the retrieved subgraph into the prompt as a clearly delimited block of factual statements. Use a format like:
Retrieved domain knowledge (causal relationships):
- insulin_resistance → causes → neuronal_tau_phosphorylation
- amyloid_beta_accumulation → causes → synaptic_dysfunction
Keep the context focused — 10-30 relevant triples outperform 200 loosely related ones.
Use a zero-shot instruction prompt that constrains the output format. For structured QA, use:
You are answering a domain-specific question. Use ONLY the retrieved
knowledge below and your training to answer. If the retrieved knowledge
conflicts with your prior understanding, prefer the retrieved knowledge.
For open-ended questions, instruct the model to cite which retrieved triples support its answer.
Set decoding temperature low (0.0-0.3). Higher temperatures rarely help in domain-specific KG-RAG and often introduce hallucinated reasoning chains.
Build diagnostic probes to evaluate your pipeline. Create targeted test questions at three difficulty levels:
Iterate on scope boundaries. If evaluation shows accuracy drops on certain question types, check whether the retrieved triples are scope-aligned. Common failure modes: directionality flips (A causes B vs. B causes A), chain ordering errors, and negation/exception misreads.
Example 1: Building a scoped medical QA pipeline
User: "I'm building a QA system for oncologists about drug interactions in breast cancer treatment. I have 5,000 PubMed abstracts. How should I set up the RAG pipeline?"
Approach:
(tamoxifen, inhibits, estrogen_receptor_alpha), (CYP2D6_polymorphism, reduces_efficacy_of, tamoxifen)Output structure:
# Knowledge graph construction
from dataclasses import dataclass
@dataclass
class Triple:
head: str
relation: str
tail: str
source_paper: str
source_sentence: int
# Scoped retrieval function
def retrieve_scoped(query_entities: list[str], kg: dict[str, list[Triple]], top_k: int = 20) -> list[Triple]:
"""Retrieve triples where head or tail matches query entities."""
matches = []
for entity in query_entities:
canonical = canonicalize(entity)
if canonical in kg:
matches.extend(kg[canonical])
# Rank by relevance: direct matches first, then one-hop neighbors
return sorted(matches, key=lambda t: relevance_score(t, query_entities))[:top_k]
# Prompt injection
def build_prompt(question: str, triples: list[Triple]) -> str:
context = "\n".join(f"- {t.head} → {t.relation} → {t.tail}" for t in triples)
return f"""Retrieved domain knowledge (causal relationships):
{context}
Based on the above knowledge, answer the following question.
Question: {question}
Answer:"""
Example 2: Diagnosing why RAG is hurting accuracy
User: "I added a knowledge graph to my RAG pipeline but my LLM's accuracy dropped from 92% to 78%. What's going wrong?"
Approach:
Diagnostic checklist:
[ ] Is the KG scope narrower than or equal to the query domain?
[ ] Are retrieved triples directly relevant (inspect 20 random retrievals)?
[ ] Is the model large enough that No-RAG already performs well?
[ ] Are multiple KGs being unioned without filtering?
[ ] Are vague/generic triples ("it causes problems") being filtered out?
[ ] Is temperature set to 0.0-0.3?
Example 3: Deciding scope boundaries for multi-domain queries
User: "My users ask questions that span diabetes AND cardiovascular disease. Should I build one combined KG or two separate ones?"
Approach:
insulin_resistance → increases_risk_of → atherosclerosis) without importing domain-specific noisedef compute_kg_intersection(kg1: dict, kg2: dict, similarity_threshold: float = 0.65) -> dict:
"""Find triples with entities present in both KGs via embedding similarity."""
intersection = {}
kg2_entities = {e: embed(e) for e in kg2.keys()}
for entity, triples in kg1.items():
entity_emb = embed(entity)
for kg2_entity, kg2_emb in kg2_entities.items():
if cosine_sim(entity_emb, kg2_emb) >= similarity_threshold:
canonical = pick_canonical(entity, kg2_entity)
intersection[canonical] = triples + kg2.get(kg2_entity, [])
return intersection
| Failure Mode | Cause | Fix | |---|---|---| | Accuracy drops after adding RAG | Scope mismatch between KG and query domain | Narrow the KG or add a scope classifier before retrieval | | Directionality errors (A→B vs B→A) | Causal direction lost during extraction | Enforce directed edge validation; use multi-hop probes to test | | Duplicate/conflicting triples | Entity synonyms not resolved | Apply co-reference resolution and canonical name normalization | | Vague triples pollute context | Extraction captured pronouns/generic terms | Add rule-based filter: reject triples where head or tail is a pronoun, demonstrative, or <4 characters | | Model ignores retrieved context | Prompt doesn't emphasize retrieved knowledge | Add explicit instruction: "Prefer the retrieved knowledge over your training data for this question" | | Cross-domain queries return nothing | No intersection between domain KGs | Lower similarity threshold (try 0.50) or verify KGs actually share entities |
causes/because edges. Domains requiring hierarchical (is-a), compositional (part-of), or temporal relations need schema extensions.Anuyah, S., Kaushik, M. M., Dai, H., Shiradkar, R., & Durresi, A. (2026). Domain-Specific Knowledge Graphs in RAG-Enhanced Healthcare LLMs. arXiv:2601.15429v1. https://arxiv.org/abs/2601.15429v1
Key takeaway: Scope-matched retrieval from narrow domain KGs consistently outperforms broad graph unions. Precision of retrieved context matters more than volume — and for large models, No-RAG may already be sufficient.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".