Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mathews-tom/rag-auditor

Name: rag-auditor
Author: mathews-tom

skills/rag-auditor/SKILL.md

npx skillsauth add mathews-tom/armory rag-auditor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

RAG Auditor

Systematic RAG pipeline evaluation across the full retrieval-generation chain: designs evaluation query sets, measures retrieval metrics (Precision@K, Recall@K, MRR), evaluates generation quality (groundedness, completeness, hallucination rate), diagnoses component-level failures, and recommends targeted improvements.

Reference Files

| File | Contents | Load When | | ---------------------------------- | -------------------------------------------------------------------------- | ---------------------------- | | references/retrieval-metrics.md | Precision@K, Recall@K, MRR, NDCG definitions and calculation | Always | | references/generation-metrics.md | Groundedness, completeness, hallucination detection methods | Generation evaluation needed | | references/failure-taxonomy.md | RAG failure categories: retrieval, generation, chunking, embedding | Failure diagnosis needed | | references/diagnostic-queries.md | Designing evaluation query sets, known-answer questions, difficulty levels | Evaluation setup |

Prerequisites

Access to the RAG pipeline (or its outputs for post-hoc evaluation)
A set of test queries with known-correct answers
Understanding of the pipeline components (embedding model, retriever, generator)

Workflow

Phase 1: Pipeline Inventory

Document the RAG pipeline configuration:

Document source — What documents are indexed? Format, count, size.
Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
Embedding — Model name and version, dimensionality.
Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
Generation — Model, prompt template, context window usage.

Phase 2: Design Evaluation Queries

Create a diverse set of test queries:

| Query Type | Purpose | Count | | ---------------------- | ------------------------------------------- | ----- | | Known-answer (factoid) | Measure retrieval + generation accuracy | 10+ | | Multi-hop | Require combining info from multiple chunks | 5+ | | Unanswerable | Not in the corpus — should abstain | 3+ | | Ambiguous | Multiple valid interpretations | 3+ | | Recent/updated | Test freshness | 2+ |

For each query, document the expected answer and the source chunk(s).

Phase 3: Evaluate Retrieval

For each test query, measure:

Precision@K — Of the K retrieved chunks, how many are relevant?
Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.

Phase 4: Evaluate Generation

For each test query with retrieved context:

Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
Hallucination detection — Identify specific claims not supported by context.
Abstention — For unanswerable queries, does the model correctly say "I don't know"?

Phase 5: Diagnose Failures

For every incorrect or low-quality response, classify the root cause:

| Failure Type | Diagnosis | Indicator | | -------------------- | -------------------------------------------------- | ---------------------------------------- | | Retrieval failure | Relevant chunks not retrieved | Low Recall@K | | Ranking failure | Relevant chunk retrieved but ranked low | Low MRR, high Recall | | Chunk boundary issue | Answer split across chunk boundaries | Partial matches in multiple chunks | | Embedding mismatch | Query semantics don't match chunk embeddings | Relevant chunk has low similarity score | | Generation failure | Correct context but wrong answer | High retrieval scores, low groundedness | | Hallucination | Model invents facts not in context | Claims not traceable to any chunk | | Over-abstention | Model refuses to answer when context is sufficient | Unanswered with relevant context present |

Phase 6: Recommendations

Based on failure analysis, recommend specific improvements:

| Failure Pattern | Recommendation | | --------------------- | -------------------------------------------------------------- | | Chunk boundary issues | Increase overlap, try semantic chunking | | Low Precision@K | Reduce K, add reranking stage | | Low Recall@K | Increase K, try hybrid search | | Embedding mismatch | Try different embedding model, add query expansion | | Hallucination | Strengthen grounding instruction in prompt, reduce temperature | | Over-abstention | Soften abstention criteria in prompt |

Output Format

## RAG Audit Report

### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |

### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}

### Retrieval Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |

### Generation Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |

### Failure Analysis

| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |

### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}

### Sample Failures

#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}

Calibration Rules

Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.

Error Handling

| Problem | Resolution | | --------------------------------- | ------------------------------------------------------------------------------------------------------- | | No known-answer queries available | Help design them from the document corpus. Pick 10 facts and formulate questions. | | Pipeline access not available | Work from recorded inputs/outputs. Post-hoc evaluation is possible with query-context-response triples. | | Corpus is too large to review | Sample-based evaluation. Select representative documents and generate queries from them. | | Multiple failure types co-exist | Address retrieval failures first. Generation quality cannot exceed retrieval quality. |

When NOT to Audit

Push back if:

The pipeline hasn't been built yet — design it first, audit after
The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
The user wants to compare embedding models — that's a benchmark task, not an audit

mathews-tom/rag-auditor

skills/rag-auditor/SKILL.md

Evaluates RAG pipeline quality across retrieval (precision, recall, MRR) and generation (groundedness, hallucination rate). Triggers on: "audit RAG pipeline", "RAG quality", "hallucination detection", "why is RAG failing", "grounding check". NOT for general architecture audits, use architecture-reviewer.

221 stars

testing

Updated May 4, 2026

$ install --global

skillsauth

npx skillsauth add mathews-tom/armory rag-auditor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 4, 2026, 7:08 AM177.8s6 files scanned

SKILL.md

name:: rag-auditor
description:: Evaluates RAG pipeline quality across retrieval (precision, recall, MRR) and generation (groundedness, hallucination rate). Triggers on: "audit RAG pipeline", "RAG quality", "hallucination detection", "why is RAG failing", "grounding check". NOT for general architecture audits, use architecture-reviewer.
version:: 1.1.1
category:: review
tags:: [rag, retrieval, hallucination, grounding]
difficulty:: advanced
phase:: build

RAG Auditor

Reference Files

Prerequisites

Access to the RAG pipeline (or its outputs for post-hoc evaluation)
A set of test queries with known-correct answers
Understanding of the pipeline components (embedding model, retriever, generator)

Workflow

Phase 1: Pipeline Inventory

Document the RAG pipeline configuration:

Document source — What documents are indexed? Format, count, size.
Chunking — Strategy (fixed-size, semantic, paragraph), chunk size, overlap.
Embedding — Model name and version, dimensionality.
Vector store — Type (FAISS, Pinecone, Chroma, pgvector), index type.
Retrieval — Method (similarity, hybrid, reranking), top-K parameter.
Generation — Model, prompt template, context window usage.

Phase 2: Design Evaluation Queries

Create a diverse set of test queries:

For each query, document the expected answer and the source chunk(s).

Phase 3: Evaluate Retrieval

For each test query, measure:

Precision@K — Of the K retrieved chunks, how many are relevant?
Recall@K — Of all relevant chunks in the corpus, how many were retrieved?
MRR (Mean Reciprocal Rank) — How high is the first relevant chunk ranked?
Chunk relevance — Score each retrieved chunk: Relevant, Partially Relevant, Irrelevant.

Phase 4: Evaluate Generation

For each test query with retrieved context:

Groundedness — Is every claim in the response supported by the retrieved context? Score: 0 (hallucinated) to 1 (fully grounded).
Completeness — Does the response use all relevant information from the context? Score: 0 (ignored context) to 1 (complete).
Hallucination detection — Identify specific claims not supported by context.
Abstention — For unanswerable queries, does the model correctly say "I don't know"?

Phase 5: Diagnose Failures

For every incorrect or low-quality response, classify the root cause:

Phase 6: Recommendations

Based on failure analysis, recommend specific improvements:

Output Format

## RAG Audit Report

### Pipeline Configuration
| Component | Value |
|-----------|-------|
| Documents | {N} ({format}) |
| Chunking | {strategy}, {size} tokens, {overlap}% overlap |
| Embedding | {model} ({dimensions}d) |
| Retrieval | {method}, K={N} |
| Generation | {model}, temperature={T} |

### Evaluation Dataset
- **Total queries:** {N}
- **Known-answer:** {N}
- **Multi-hop:** {N}
- **Unanswerable:** {N}

### Retrieval Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Precision@{K} | {score} | {target} | {Pass/Fail} |
| Recall@{K} | {score} | {target} | {Pass/Fail} |
| MRR | {score} | {target} | {Pass/Fail} |

### Generation Quality

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Groundedness | {score} | {target} | {Pass/Fail} |
| Completeness | {score} | {target} | {Pass/Fail} |
| Hallucination rate | {score} | {target} | {Pass/Fail} |
| Abstention accuracy | {score} | {target} | {Pass/Fail} |

### Failure Analysis

| # | Query | Failure Type | Root Cause | Recommendation |
|---|-------|-------------|------------|----------------|
| 1 | {query} | {type} | {cause} | {fix} |

### Recommendations (Priority Order)
1. **{Recommendation}** — addresses {N} failures, expected impact: {description}
2. **{Recommendation}** — addresses {N} failures, expected impact: {description}

### Sample Failures

#### Query: "{query}"
- **Expected:** {answer}
- **Retrieved chunks:** {chunk summaries with relevance scores}
- **Generated:** {response}
- **Issue:** {diagnosis}

Calibration Rules

Component isolation. Evaluate retrieval and generation independently. A great retriever with a bad generator looks like retrieval failure if you only check end output.
Known answers first. Start with factoid questions where the correct answer is unambiguous. Multi-hop and ambiguous queries are harder to evaluate.
Quantify, don't qualify. "Retrieval is bad" is not a finding. "Precision@5 is 0.3 (target: 0.8) with 70% of failures due to chunk boundary splits" is actionable.
Sample failures deeply. Aggregate metrics identify WHERE the problem is. Individual failure analysis identifies WHY.

Error Handling

When NOT to Audit

Push back if:

The pipeline hasn't been built yet — design it first, audit after
The corpus has fewer than 10 documents — too small for meaningful retrieval evaluation
The user wants to compare embedding models — that's a benchmark task, not an audit

Related Skills

mathews-tom/chart-clarity

testing

VerifiedTrustedCommunity

Create, review, and restyle data visualizations using Edward Tufte principles: high data-ink ratio, direct labels, range-frame axes, small multiples, accessible color, responsive charts, and honest comparisons. Triggers on: "create a chart", "style this chart", "review this graph", "Tufte chart", "data visualization", "Recharts", "Plotly", "matplotlib", "Chart.js", "ECharts", "D3". Use when generating or critiquing charts, dashboards, sparklines, and data tables.

242SKILL.mdUpdated Jun 6, 2026

mathews-tom/chart-clarity

mathews-tom/stacked-prs

testing

VerifiedTrustedCommunity

Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.

242SKILL.mdUpdated May 23, 2026

mathews-tom/stacked-prs

mathews-tom/project-context-setup

development

VerifiedTrustedCommunity

Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.

230SKILL.mdUpdated May 12, 2026

mathews-tom/project-context-setup

mathews-tom/task-decomposer

testing

VerifiedTrustedCommunity

Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.

230SKILL.mdUpdated Apr 6, 2026

mathews-tom/task-decomposer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mathews-tom/armory.git

# Copy into Claude Code skills folder (global)
cp -r armory/skills/rag-auditor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mathews-tom/armory

221 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT