skills/council/oracle/ai-evaluation/SKILL.md
Use when designing an evaluation framework for AI/LLM features. Covers golden dataset creation, automated scoring rubrics, hallucination detection, regression testing infrastructure, and production monitoring. Do not use for prompt design (use prompt-engineering) or RAG pipeline architecture (use rag-architecture).
npx skillsauth add dtsong/my-claude-setup ai-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design an evaluation framework for AI/LLM features, including golden dataset creation, automated scoring rubrics, hallucination detection, and regression testing infrastructure.
Reads feature specifications, evaluation requirements, and existing infrastructure details for framework design. Does not execute model inference, create production datasets, or access live model endpoints directly.
No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.
Identify what "good" means for this feature:
Create a high-quality evaluation dataset:
Create scoring rubrics that can run without human review:
Build specific checks for fabricated content:
Build a CI/CD-compatible evaluation pipeline:
Plan ongoing quality monitoring:
Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.
# AI Evaluation Framework
## Evaluation Dimensions
| Dimension | Weight | Scoring Method | Threshold |
|-----------|--------|---------------|-----------|
| Correctness | 40% | LLM-as-judge (1-5) | ≥ 4.0 |
| Faithfulness | 30% | Reference validation | ≥ 95% |
| Relevance | 20% | Semantic similarity | ≥ 0.85 |
| Format compliance | 10% | Rule-based | 100% |
## Golden Dataset
**Size:** [N examples]
**Distribution:**
| Category | Count | Description |
|----------|-------|-------------|
| Common cases | N | [Description] |
| Edge cases | N | [Description] |
| Adversarial | N | [Description] |
**Storage:** [Location in repo]
**Versioning:** [Approach]
## Scoring Rubric
### Correctness (LLM-as-Judge)
| Score | Criteria |
|-------|---------|
| 5 | Perfect — matches expected output in all aspects |
| 4 | Good — minor differences that don't affect usefulness |
| 3 | Acceptable — correct core answer with some issues |
| 2 | Poor — partially correct but missing key information |
| 1 | Wrong — incorrect or misleading answer |
## Hallucination Detection
| Check | Method | Severity |
|-------|--------|----------|
| Reference validation | [Approach] | Critical |
| Entity verification | [Approach] | High |
| Contradiction detection | [Approach] | High |
## Regression Testing Pipeline
[Code change] → [Smoke test (20 examples)] → [Pass?] → [Merge] [Release] → [Full eval (200+ examples)] → [Pass threshold?] → [Deploy]
**Threshold:** Composite score ≥ [X] to pass
**Reporting:** [Where results are published]
## Production Monitoring
| Metric | Sample Rate | Alert Threshold |
|--------|------------|-----------------|
| Composite score | 5% of requests | < [X] |
| Hallucination rate | 5% of requests | > [X%] |
| User satisfaction | All feedback | < [X] thumbs-up rate |
testing
Use to convert a Word .docx file to PDF and/or verify its page count. Triggers on: converting docx to pdf, rendering a document, checking how many pages a docx produces, or asserting a page-count constraint (e.g. a resume must stay 2 pages). Wraps LibreOffice headless conversion.
development
Security audit checklist for web applications. Use when reviewing, auditing, or hardening a web app's security posture. Covers rate limiting, auth headers, IP blocking, CORS, security middleware, input validation, file upload limits, ORM usage, and password hashing. Triggers on requests like "review security", "harden this app", "security audit", "check for vulnerabilities", or when building/reviewing API endpoints.
development
Interactive wizard to craft effective prompts using Claude Code best practices
tools
Use when batch labeling, prioritizing, and assigning GitHub issues during triage sessions.