Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

dtsong/ai-evaluation

Name: ai-evaluation
Author: dtsong

skills/council/oracle/ai-evaluation/SKILL.md

npx skillsauth add dtsong/my-claude-setup ai-evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AI Evaluation

Purpose

Design an evaluation framework for AI/LLM features, including golden dataset creation, automated scoring rubrics, hallucination detection, and regression testing infrastructure.

Scope Constraints

Reads feature specifications, evaluation requirements, and existing infrastructure details for framework design. Does not execute model inference, create production datasets, or access live model endpoints directly.

Inputs

AI feature being evaluated (what it does, expected behavior)
Input data examples and edge cases
Quality requirements (accuracy thresholds, hallucination tolerance)
Existing evaluation infrastructure (if any)
Production monitoring requirements

Input Sanitization

No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.

Procedure

Progress Checklist

[ ] Step 1: Define evaluation dimensions
[ ] Step 2: Build golden dataset
[ ] Step 3: Design automated scoring
[ ] Step 4: Design hallucination detection
[ ] Step 5: Design regression testing
[ ] Step 6: Design production monitoring

Step 1: Define Evaluation Dimensions

Identify what "good" means for this feature:

Correctness: Does the output match the expected answer?
Faithfulness: Does the output only use information from the provided context?
Relevance: Does the output answer the actual question asked?
Completeness: Does the output cover all aspects of the question?
Format compliance: Does the output match the expected structure?
Safety: Does the output avoid harmful, biased, or inappropriate content?

Step 2: Build Golden Dataset

Create a high-quality evaluation dataset:

Size: Minimum 50 examples, ideally 200+ for statistical significance
Distribution: Cover common cases (60%), edge cases (25%), adversarial cases (15%)
Labeling: Each example has input, expected output, and scoring criteria
Source: Real user queries (anonymized) + synthetically generated edge cases
Versioning: Dataset is version-controlled alongside the code

Step 3: Design Automated Scoring

Create scoring rubrics that can run without human review:

Exact match: For classification, extraction, or structured output (score: 0 or 1)
Semantic similarity: Embedding-based comparison of generated vs expected (score: 0-1)
LLM-as-judge: Use a stronger model to evaluate the output (score: 1-5 rubric)
Rule-based checks: Required fields present, format valid, no PII leaked
Composite score: Weighted combination of individual dimensions

Step 4: Design Hallucination Detection

Build specific checks for fabricated content:

Reference validation: Every cited fact must trace back to a source document
Entity verification: Named entities (people, dates, numbers) must appear in context
Confidence calibration: When the model says "I'm not sure," is it actually uncertain?
Contradiction detection: Does the output contradict the provided context?
Fabrication patterns: Common hallucination patterns to flag (fake URLs, invented citations)

Step 5: Design Regression Testing

Build a CI/CD-compatible evaluation pipeline:

Trigger: Run on prompt changes, model upgrades, or code changes affecting AI features
Threshold enforcement: Fail the build if eval score drops below threshold
Comparison reporting: Show score delta vs previous version, highlight regressions
Fast vs full: Quick smoke test (20 examples) for every commit, full eval (200+) for releases

Step 6: Design Production Monitoring

Plan ongoing quality monitoring:

Sampling: Evaluate X% of production requests against automated scoring
Feedback loop: User thumbs-up/down, explicit corrections
Drift detection: Score distribution shift over time (model degradation, data drift)
Alerting: Score drops below threshold, hallucination rate spikes, latency increases

Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.

Output Format

# AI Evaluation Framework

## Evaluation Dimensions
| Dimension | Weight | Scoring Method | Threshold |
|-----------|--------|---------------|-----------|
| Correctness | 40% | LLM-as-judge (1-5) | ≥ 4.0 |
| Faithfulness | 30% | Reference validation | ≥ 95% |
| Relevance | 20% | Semantic similarity | ≥ 0.85 |
| Format compliance | 10% | Rule-based | 100% |

## Golden Dataset
**Size:** [N examples]
**Distribution:**
| Category | Count | Description |
|----------|-------|-------------|
| Common cases | N | [Description] |
| Edge cases | N | [Description] |
| Adversarial | N | [Description] |

**Storage:** [Location in repo]
**Versioning:** [Approach]

## Scoring Rubric
### Correctness (LLM-as-Judge)
| Score | Criteria |
|-------|---------|
| 5 | Perfect — matches expected output in all aspects |
| 4 | Good — minor differences that don't affect usefulness |
| 3 | Acceptable — correct core answer with some issues |
| 2 | Poor — partially correct but missing key information |
| 1 | Wrong — incorrect or misleading answer |

## Hallucination Detection
| Check | Method | Severity |
|-------|--------|----------|
| Reference validation | [Approach] | Critical |
| Entity verification | [Approach] | High |
| Contradiction detection | [Approach] | High |

## Regression Testing Pipeline

[Code change] → [Smoke test (20 examples)] → [Pass?] → [Merge] [Release] → [Full eval (200+ examples)] → [Pass threshold?] → [Deploy]


**Threshold:** Composite score ≥ [X] to pass
**Reporting:** [Where results are published]

## Production Monitoring
| Metric | Sample Rate | Alert Threshold |
|--------|------------|-----------------|
| Composite score | 5% of requests | < [X] |
| Hallucination rate | 5% of requests | > [X%] |
| User satisfaction | All feedback | < [X] thumbs-up rate |

Handoff

Hand off to prompt-engineering if evaluation results reveal prompt design deficiencies requiring structured redesign.
Hand off to rag-architecture if evaluation findings indicate retrieval quality or chunking strategy issues.

Quality Checks

[ ] Golden dataset has at least 50 examples covering common and edge cases
[ ] Scoring rubric has clear criteria for each score level (not subjective)
[ ] Hallucination detection checks references against source documents
[ ] Regression testing is automated and blocks deploys on score drops
[ ] Production monitoring includes both automated scoring and user feedback
[ ] Eval dataset is version-controlled alongside the code

Evolution Notes

dtsong/ai-evaluation

skills/council/oracle/ai-evaluation/SKILL.md

Use when designing an evaluation framework for AI/LLM features. Covers golden dataset creation, automated scoring rubrics, hallucination detection, regression testing infrastructure, and production monitoring. Do not use for prompt design (use prompt-engineering) or RAG pipeline architecture (use rag-architecture).

5 stars

development

Updated Jul 15, 2026

$ install --global

skillsauth

npx skillsauth add dtsong/my-claude-setup ai-evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 15, 2026, 4:14 AM107.4s1 file scanned

SKILL.md

name:: ai-evaluation
department:: oracle
description:: Use when designing an evaluation framework for AI/LLM features. Covers golden dataset creation, automated scoring rubrics, hallucination detection, regression testing infrastructure, and production monitoring. Do not use for prompt design (use prompt-engineering) or RAG pipeline architecture (use rag-architecture).
version:: 1

AI Evaluation

Purpose

Design an evaluation framework for AI/LLM features, including golden dataset creation, automated scoring rubrics, hallucination detection, and regression testing infrastructure.

Scope Constraints

Inputs

AI feature being evaluated (what it does, expected behavior)
Input data examples and edge cases
Quality requirements (accuracy thresholds, hallucination tolerance)
Existing evaluation infrastructure (if any)
Production monitoring requirements

Input Sanitization

No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.

Procedure

Progress Checklist

[ ] Step 1: Define evaluation dimensions
[ ] Step 2: Build golden dataset
[ ] Step 3: Design automated scoring
[ ] Step 4: Design hallucination detection
[ ] Step 5: Design regression testing
[ ] Step 6: Design production monitoring

Step 1: Define Evaluation Dimensions

Identify what "good" means for this feature:

Correctness: Does the output match the expected answer?
Faithfulness: Does the output only use information from the provided context?
Relevance: Does the output answer the actual question asked?
Completeness: Does the output cover all aspects of the question?
Format compliance: Does the output match the expected structure?
Safety: Does the output avoid harmful, biased, or inappropriate content?

Step 2: Build Golden Dataset

Create a high-quality evaluation dataset:

Size: Minimum 50 examples, ideally 200+ for statistical significance
Distribution: Cover common cases (60%), edge cases (25%), adversarial cases (15%)
Labeling: Each example has input, expected output, and scoring criteria
Source: Real user queries (anonymized) + synthetically generated edge cases
Versioning: Dataset is version-controlled alongside the code

Step 3: Design Automated Scoring

Create scoring rubrics that can run without human review:

Exact match: For classification, extraction, or structured output (score: 0 or 1)
Semantic similarity: Embedding-based comparison of generated vs expected (score: 0-1)
LLM-as-judge: Use a stronger model to evaluate the output (score: 1-5 rubric)
Rule-based checks: Required fields present, format valid, no PII leaked
Composite score: Weighted combination of individual dimensions

Step 4: Design Hallucination Detection

Build specific checks for fabricated content:

Reference validation: Every cited fact must trace back to a source document
Entity verification: Named entities (people, dates, numbers) must appear in context
Confidence calibration: When the model says "I'm not sure," is it actually uncertain?
Contradiction detection: Does the output contradict the provided context?
Fabrication patterns: Common hallucination patterns to flag (fake URLs, invented citations)

Step 5: Design Regression Testing

Build a CI/CD-compatible evaluation pipeline:

Trigger: Run on prompt changes, model upgrades, or code changes affecting AI features
Threshold enforcement: Fail the build if eval score drops below threshold
Comparison reporting: Show score delta vs previous version, highlight regressions
Fast vs full: Quick smoke test (20 examples) for every commit, full eval (200+) for releases

Step 6: Design Production Monitoring

Plan ongoing quality monitoring:

Sampling: Evaluate X% of production requests against automated scoring
Feedback loop: User thumbs-up/down, explicit corrections
Drift detection: Score distribution shift over time (model degradation, data drift)
Alerting: Score drops below threshold, hallucination rate spikes, latency increases

Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.

Output Format

# AI Evaluation Framework

## Evaluation Dimensions
| Dimension | Weight | Scoring Method | Threshold |
|-----------|--------|---------------|-----------|
| Correctness | 40% | LLM-as-judge (1-5) | ≥ 4.0 |
| Faithfulness | 30% | Reference validation | ≥ 95% |
| Relevance | 20% | Semantic similarity | ≥ 0.85 |
| Format compliance | 10% | Rule-based | 100% |

## Golden Dataset
**Size:** [N examples]
**Distribution:**
| Category | Count | Description |
|----------|-------|-------------|
| Common cases | N | [Description] |
| Edge cases | N | [Description] |
| Adversarial | N | [Description] |

**Storage:** [Location in repo]
**Versioning:** [Approach]

## Scoring Rubric
### Correctness (LLM-as-Judge)
| Score | Criteria |
|-------|---------|
| 5 | Perfect — matches expected output in all aspects |
| 4 | Good — minor differences that don't affect usefulness |
| 3 | Acceptable — correct core answer with some issues |
| 2 | Poor — partially correct but missing key information |
| 1 | Wrong — incorrect or misleading answer |

## Hallucination Detection
| Check | Method | Severity |
|-------|--------|----------|
| Reference validation | [Approach] | Critical |
| Entity verification | [Approach] | High |
| Contradiction detection | [Approach] | High |

## Regression Testing Pipeline

[Code change] → [Smoke test (20 examples)] → [Pass?] → [Merge] [Release] → [Full eval (200+ examples)] → [Pass threshold?] → [Deploy]


**Threshold:** Composite score ≥ [X] to pass
**Reporting:** [Where results are published]

## Production Monitoring
| Metric | Sample Rate | Alert Threshold |
|--------|------------|-----------------|
| Composite score | 5% of requests | < [X] |
| Hallucination rate | 5% of requests | > [X%] |
| User satisfaction | All feedback | < [X] thumbs-up rate |

Handoff

Hand off to prompt-engineering if evaluation results reveal prompt design deficiencies requiring structured redesign.
Hand off to rag-architecture if evaluation findings indicate retrieval quality or chunking strategy issues.

Quality Checks

[ ] Golden dataset has at least 50 examples covering common and edge cases
[ ] Scoring rubric has clear criteria for each score level (not subjective)
[ ] Hallucination detection checks references against source documents
[ ] Regression testing is automated and blocks deploys on score drops
[ ] Production monitoring includes both automated scoring and user feedback
[ ] Eval dataset is version-controlled alongside the code

Evolution Notes

Related Skills

dtsong/enterprise-search-strategy

development

VerifiedTrustedCommunity

Use when the council needs to surface organizational knowledge buried across multiple internal sources (wikis, design docs, ADRs, past tickets, postmortems, chat archives, code repos). Plans where to look, what to cross-reference, and how to synthesize findings into evidence the council can act on. Do not use for external market research (use competitive-analysis), library evaluation (use library-evaluation), or technology trend assessment (use technology-radar).

5SKILL.mdUpdated Jun 23, 2026

dtsong/enterprise-search-strategy

dtsong/docx-to-pdf

testing

VerifiedTrustedCommunity

Use to convert a Word .docx file to PDF and/or verify its page count. Triggers on: converting docx to pdf, rendering a document, checking how many pages a docx produces, or asserting a page-count constraint (e.g. a resume must stay 2 pages). Wraps LibreOffice headless conversion.

5SKILL.mdUpdated Jun 11, 2026

dtsong/web-security-hardening

development

VerifiedTrustedCommunity

Security audit checklist for web applications. Use when reviewing, auditing, or hardening a web app's security posture. Covers rate limiting, auth headers, IP blocking, CORS, security middleware, input validation, file upload limits, ORM usage, and password hashing. Triggers on requests like "review security", "harden this app", "security audit", "check for vulnerabilities", or when building/reviewing API endpoints.

5SKILL.mdUpdated Apr 28, 2026

dtsong/web-security-hardening

dtsong/prompt-wizard

development

VerifiedTrustedCommunity

Interactive wizard to craft effective prompts using Claude Code best practices

5SKILL.mdUpdated Apr 28, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/dtsong/my-claude-setup.git

# Copy into Claude Code skills folder (global)
cp -r my-claude-setup/skills/council/oracle/ai-evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

dtsong/my-claude-setup

5 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT