skills/rag-auditor/SKILL.md
Evaluates RAG pipeline quality across retrieval (precision, recall, MRR) and generation (groundedness, hallucination rate). Triggers on: "audit RAG pipeline", "RAG quality", "hallucination detection", "why is RAG failing", "grounding check". NOT for general architecture audits, use architecture-reviewer.
npx skillsauth add mathews-tom/armory rag-auditorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.
| File | Contents | Load When |
| ---------------------------------- | -------------------------------------------------------------------------- | ---------------------------- |
| references/retrieval-metrics.md | Precision@K, Recall@K, MRR, NDCG definitions and calculation | Always |
| references/generation-metrics.md | Groundedness, completeness, hallucination detection methods | Generation evaluation needed |
| references/failure-taxonomy.md | RAG failure categories: retrieval, generation, chunking, embedding | Failure diagnosis needed |
| references/diagnostic-queries.md | Designing evaluation query sets, known-answer questions, difficulty levels | Evaluation setup |
Document the RAG pipeline configuration:
Create a diverse set of test queries:
| Query Type | Purpose | Count | | ---------------------- | ------------------------------------------- | ----- | | Known-answer (factoid) | Measure retrieval + generation accuracy | 10+ | | Multi-hop | Require combining info from multiple chunks | 5+ | | Unanswerable | Not in the corpus — should abstain | 3+ | | Ambiguous | Multiple valid interpretations | 3+ | | Recent/updated | Test freshness | 2+ |
For each query, document the expected answer and the source chunk(s).
For each test query, measure:
For each test query with retrieved context:
For every incorrect or low-quality response, classify the root cause:
| Failure Type | Diagnosis | Indicator | | -------------------- | -------------------------------------------------- | ---------------------------------------- | | Retrieval failure | Relevant chunks not retrieved | Low Recall@K | | Ranking failure | Relevant chunk retrieved but ranked low | Low MRR, high Recall | | Chunk boundary issue | Answer split across chunk boundaries | Partial matches in multiple chunks | | Embedding mismatch | Query semantics don't match chunk embeddings | Relevant chunk has low similarity score | | Generation failure | Correct context but wrong answer | High retrieval scores, low groundedness | | Hallucination | Model invents facts not in context | Claims not traceable to any chunk | | Over-abstention | Model refuses to answer when context is sufficient | Unanswered with relevant context present |
Based on failure analysis, recommend specific improvements:
| Failure Pattern | Recommendation | | --------------------- | -------------------------------------------------------------- | | Chunk boundary issues | Increase overlap, try semantic chunking | | Low Precision@K | Reduce K, add reranking stage | | Low Recall@K | Increase K, try hybrid search | | Embedding mismatch | Try different embedding model, add query expansion | | Hallucination | Strengthen grounding instruction in prompt, reduce temperature | | Over-abstention | Soften abstention criteria in prompt |
## RAG Audit Report
### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |
### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}
### Retrieval Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |
### Generation Quality
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |
### Failure Analysis
| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |
### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}
### Sample Failures
#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}
| Problem | Resolution | | --------------------------------- | ------------------------------------------------------------------------------------------------------- | | No known-answer queries available | Help design them from the document corpus. Pick 10 facts and formulate questions. | | Pipeline access not available | Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples. | | Corpus is too large to review | Sample-based evaluation. Select representative documents and generate queries from them. | | Multiple failure types co-exist | Address retrieval failures first. Generation quality cannot exceed retrieval quality. |
Push back if:
testing
Create, review, and restyle data visualizations using Edward Tufte principles: high data-ink ratio, direct labels, range-frame axes, small multiples, accessible color, responsive charts, and honest comparisons. Triggers on: "create a chart", "style this chart", "review this graph", "Tufte chart", "data visualization", "Recharts", "Plotly", "matplotlib", "Chart.js", "ECharts", "D3". Use when generating or critiquing charts, dashboards, sparklines, and data tables.
testing
Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.
development
Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.
testing
Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.