Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

arbazkhan971/eval

Name: eval
Author: arbazkhan971

skills/eval/SKILL.md

npx skillsauth add arbazkhan971/godmode eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval — AI/LLM Evaluation

Activate When

User invokes /godmode:eval
User says "evaluate my model", "benchmark this"
User says "LLM-as-judge", "compare models", "run evals"
When deploying or updating any AI/LLM component

Workflow

Step 1: Evaluation Discovery

# Find existing eval infrastructure
find . -name "eval*" -o -name "benchmark*" \
  -o -name "judge*" | grep -v node_modules

# Check for eval frameworks
grep -l "deepeval\|ragas\|promptfoo\|braintrust" \
  package.json pyproject.toml requirements.txt \
  2>/dev/null

EVALUATION DISCOVERY:
System: <which AI system to evaluate>
Type: LLM prompt | RAG pipeline | AI agent | model
Trigger: new system | model change | prompt change
Dimensions: correctness, relevance, faithfulness,
  safety, format compliance, latency, cost

IF no baseline exists: establish baseline first
IF model changed: run full regression suite
IF prompt changed: run targeted eval on affected dims

Step 2: Evaluation Dataset Design

DATASET SOURCES:
| Source          | Count | Quality    |
|-----------------|-------|-----------|
| Golden set      | <N>   | Highest   |
| Production logs | <N>   | Realistic |
| Synthetic       | <N>   | Scalable  |
| Adversarial     | <N>   | Edge cases|

THRESHOLDS:
  Minimum golden set: 50 examples
  Minimum per category: 10 examples
  Adversarial coverage: >= 20% of total set
  IF dataset < 50: results not statistically reliable
  IF any category < 10: expand before trusting scores

Step 3: Framework Selection

| Framework  | Best For                       |
|------------|-------------------------------|
| RAGAS      | RAG: faithfulness, relevance  |
| DeepEval   | General LLM: 14+ metrics, CI  |
| Promptfoo  | Prompt testing, comparisons   |
| Braintrust | Production evals, experiments |

Step 4: LLM-as-Judge

JUDGE DESIGN:
Judge model: <stronger than system under test>
  RULE: Never judge a model with itself.

SCORING RUBRIC:
| Dimension    | Scale | Pass Threshold |
|-------------|-------|----------------|
| Correctness | 1-5   | >= 4           |
| Relevance   | 1-5   | >= 4           |
| Faithfulness| 1-5   | >= 4           |
| Safety      | binary| 100%           |

CALIBRATION:
  Cohen's kappa vs human ratings: >= 0.7 required
  IF kappa < 0.7: refine rubric, add examples
  IF judge disagrees > 30%: re-calibrate

Step 5: Regression Testing

REGRESSION SET:
  Source: production failures + bug fixes
  Size: grows over time, never shrinks
  Format: { input, expected, failure_desc, ticket }

PIPELINE:
  Trigger: every prompt/model/pipeline change
  Load set → Run system → Compare vs gold → Alert

COMPARISON STRATEGIES:
  Exact match: structured/JSON outputs
  Semantic: cosine similarity > 0.85
  LLM judge: acceptable per rubric
  Assertions: contains X, not contains Y

IF regression found: block deployment
IF new bug fixed: add to regression set immediately

Step 6: Statistical Significance

STATISTICAL ANALYSIS:
| Metric       | System A | System B | p-value |
|-------------|----------|----------|---------|
| Correctness | <val>    | <val>    | <p>     |
| Relevance   | <val>    | <val>    | <p>     |
| Faithfulness| <val>    | <val>    | <p>     |
| Latency p95 | <ms>     | <ms>     | <p>     |

RULES:
  Alpha: 0.05. Multiple correction: Bonferroni.
  Minimum sample: 50 per variant.
  Report 95% confidence intervals for all metrics.
  IF p >= 0.05: no significant difference — do not ship.
  IF sample < 30: results unreliable, expand dataset.

Step 7: Artifacts & Report

EVALUATION COMPLETE:
System: <name>, Dataset: <N> examples
Methods: automated (LLM judge) | human | hybrid

| Metric       | Score | Target | Status    |
|-------------|-------|--------|-----------|
| Correctness | <val> | >= 4.0 | PASS/FAIL |
| Faithfulness| <val> | >= 4.0 | PASS/FAIL |
| Safety      | <val> | 100%   | PASS/FAIL |
| Latency p95 | <ms>  | <ms>   | PASS/FAIL |

Regressions: <N>/<N> passing
Verdict: PASS | FAIL — <N> metrics below target

Commit: "eval: <system> — <N> examples, correctness=<val>, faithfulness=<val>"

Key Behaviors

Never ask to continue. Loop autonomously until done.

Evaluate before shipping. No AI goes to prod without eval. "Works on my examples" is not evidence.
Baseline everything. Compare against previous version or naive baseline. Never report absolutes.
Separate retrieval from generation in RAG.
LLM judges need calibration. kappa >= 0.7.
Statistical significance is not optional.

HARD RULES

Never evaluate without a baseline comparison.
Never use same model as judge and subject.
Never skip judge calibration (kappa >= 0.7).
Never deploy without running regression suite.
Never delete regression tests.

Auto-Detection

1. Scan: evals/ dir, promptfoo config, deepeval
2. Check for: datasets, judge prompts, baselines
3. Detect: model configs, prompt templates

Loop Protocol

FOR each quality dimension:
  1. Score with calibrated judge
  2. Compare to baseline with confidence interval
  3. Run significance test (paired bootstrap)
  4. IF metric < threshold: flag, refine prompt
  5. IF judge disagreement > 30%: re-calibrate
  6. IF p >= 0.05: inconclusive, expand dataset
POST-LOOP: Run full regression suite

Output Format

Print: Eval: {system} — correctness {val}, faithfulness {val}, regressions {N}/{total}. Significance: {yes|no}. Verdict: {verdict}.

TSV Logging

timestamp	system	metric	baseline	result	p_value	status

Keep/Discard Discipline

KEEP if: metric meets threshold AND statistically
  significant AND judge agreement > 70%
DISCARD if: below threshold OR not significant
  OR judge disagreement > 30%

Stop Conditions

STOP when ANY of:
  - All dimensions evaluated with baseline comparison
  - Regression tests pass and integrated into CI
  - Statistical significance computed for all metrics
  - User requests stop

Error Recovery

Noisy scores: increase sample, use multiple judges.
Judge disagrees: calibrate, check position bias.
Test set too small: calculate minimum for target MDE.
Unclear regression: bisect changes, run per commit.

arbazkhan971/eval

skills/eval/SKILL.md

AI/LLM evaluation. Benchmark creation, regression testing, statistical significance, LLM-as-judge, promptfoo.

10 stars

testing

Updated Apr 26, 2026

$ install --global

skillsauth

npx skillsauth add arbazkhan971/godmode eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 30, 2026, 4:47 PM5.3s1 file scanned

SKILL.md

name:: eval
description:: AI/LLM evaluation. Benchmark creation, regression testing, statistical significance, LLM-as-judge, promptfoo.

Eval — AI/LLM Evaluation

Activate When

User invokes /godmode:eval
User says "evaluate my model", "benchmark this"
User says "LLM-as-judge", "compare models", "run evals"
When deploying or updating any AI/LLM component

Workflow

Step 1: Evaluation Discovery

# Find existing eval infrastructure
find . -name "eval*" -o -name "benchmark*" \
  -o -name "judge*" | grep -v node_modules

# Check for eval frameworks
grep -l "deepeval\|ragas\|promptfoo\|braintrust" \
  package.json pyproject.toml requirements.txt \
  2>/dev/null

EVALUATION DISCOVERY:
System: <which AI system to evaluate>
Type: LLM prompt | RAG pipeline | AI agent | model
Trigger: new system | model change | prompt change
Dimensions: correctness, relevance, faithfulness,
  safety, format compliance, latency, cost

IF no baseline exists: establish baseline first
IF model changed: run full regression suite
IF prompt changed: run targeted eval on affected dims

Step 2: Evaluation Dataset Design

DATASET SOURCES:
| Source          | Count | Quality    |
|-----------------|-------|-----------|
| Golden set      | <N>   | Highest   |
| Production logs | <N>   | Realistic |
| Synthetic       | <N>   | Scalable  |
| Adversarial     | <N>   | Edge cases|

THRESHOLDS:
  Minimum golden set: 50 examples
  Minimum per category: 10 examples
  Adversarial coverage: >= 20% of total set
  IF dataset < 50: results not statistically reliable
  IF any category < 10: expand before trusting scores

Step 3: Framework Selection

| Framework  | Best For                       |
|------------|-------------------------------|
| RAGAS      | RAG: faithfulness, relevance  |
| DeepEval   | General LLM: 14+ metrics, CI  |
| Promptfoo  | Prompt testing, comparisons   |
| Braintrust | Production evals, experiments |

Step 4: LLM-as-Judge

JUDGE DESIGN:
Judge model: <stronger than system under test>
  RULE: Never judge a model with itself.

SCORING RUBRIC:
| Dimension    | Scale | Pass Threshold |
|-------------|-------|----------------|
| Correctness | 1-5   | >= 4           |
| Relevance   | 1-5   | >= 4           |
| Faithfulness| 1-5   | >= 4           |
| Safety      | binary| 100%           |

CALIBRATION:
  Cohen's kappa vs human ratings: >= 0.7 required
  IF kappa < 0.7: refine rubric, add examples
  IF judge disagrees > 30%: re-calibrate

Step 5: Regression Testing

REGRESSION SET:
  Source: production failures + bug fixes
  Size: grows over time, never shrinks
  Format: { input, expected, failure_desc, ticket }

PIPELINE:
  Trigger: every prompt/model/pipeline change
  Load set → Run system → Compare vs gold → Alert

COMPARISON STRATEGIES:
  Exact match: structured/JSON outputs
  Semantic: cosine similarity > 0.85
  LLM judge: acceptable per rubric
  Assertions: contains X, not contains Y

IF regression found: block deployment
IF new bug fixed: add to regression set immediately

Step 6: Statistical Significance

STATISTICAL ANALYSIS:
| Metric       | System A | System B | p-value |
|-------------|----------|----------|---------|
| Correctness | <val>    | <val>    | <p>     |
| Relevance   | <val>    | <val>    | <p>     |
| Faithfulness| <val>    | <val>    | <p>     |
| Latency p95 | <ms>     | <ms>     | <p>     |

RULES:
  Alpha: 0.05. Multiple correction: Bonferroni.
  Minimum sample: 50 per variant.
  Report 95% confidence intervals for all metrics.
  IF p >= 0.05: no significant difference — do not ship.
  IF sample < 30: results unreliable, expand dataset.

Step 7: Artifacts & Report

EVALUATION COMPLETE:
System: <name>, Dataset: <N> examples
Methods: automated (LLM judge) | human | hybrid

| Metric       | Score | Target | Status    |
|-------------|-------|--------|-----------|
| Correctness | <val> | >= 4.0 | PASS/FAIL |
| Faithfulness| <val> | >= 4.0 | PASS/FAIL |
| Safety      | <val> | 100%   | PASS/FAIL |
| Latency p95 | <ms>  | <ms>   | PASS/FAIL |

Regressions: <N>/<N> passing
Verdict: PASS | FAIL — <N> metrics below target

Commit: "eval: <system> — <N> examples, correctness=<val>, faithfulness=<val>"

Key Behaviors

Never ask to continue. Loop autonomously until done.

Evaluate before shipping. No AI goes to prod without eval. "Works on my examples" is not evidence.
Baseline everything. Compare against previous version or naive baseline. Never report absolutes.
Separate retrieval from generation in RAG.
LLM judges need calibration. kappa >= 0.7.
Statistical significance is not optional.

HARD RULES

Never evaluate without a baseline comparison.
Never use same model as judge and subject.
Never skip judge calibration (kappa >= 0.7).
Never deploy without running regression suite.
Never delete regression tests.

Auto-Detection

1. Scan: evals/ dir, promptfoo config, deepeval
2. Check for: datasets, judge prompts, baselines
3. Detect: model configs, prompt templates

Loop Protocol

FOR each quality dimension:
  1. Score with calibrated judge
  2. Compare to baseline with confidence interval
  3. Run significance test (paired bootstrap)
  4. IF metric < threshold: flag, refine prompt
  5. IF judge disagreement > 30%: re-calibrate
  6. IF p >= 0.05: inconclusive, expand dataset
POST-LOOP: Run full regression suite

Output Format

Print: Eval: {system} — correctness {val}, faithfulness {val}, regressions {N}/{total}. Significance: {yes|no}. Verdict: {verdict}.

TSV Logging

timestamp	system	metric	baseline	result	p_value	status

Keep/Discard Discipline

KEEP if: metric meets threshold AND statistically
  significant AND judge agreement > 70%
DISCARD if: below threshold OR not significant
  OR judge disagreement > 30%

Stop Conditions

STOP when ANY of:
  - All dimensions evaluated with baseline comparison
  - Regression tests pass and integrated into CI
  - Statistical significance computed for all metrics
  - User requests stop

Error Recovery

Noisy scores: increase sample, use multiple judges.
Judge disagrees: calibrate, check position bias.
Test set too small: calculate minimum for target MDE.
Unclear regression: bisect changes, run per commit.

Related Skills

arbazkhan971/webperf

development

VerifiedTrustedCommunity

Web performance optimization. Lighthouse, bundle analysis, code splitting, image optimization, critical CSS, fonts, service workers, CDN.

10SKILL.mdUpdated Apr 26, 2026

arbazkhan971/webhook

development

VerifiedTrustedCommunity

Webhook design, delivery, retry, HMAC verification, event subscriptions, dead letter queues.

10SKILL.mdUpdated Apr 26, 2026

arbazkhan971/vue

development

VerifiedTrustedCommunity

Vue.js mastery. Composition API, Pinia, Vue Router, Nuxt SSR/SSG, Vite optimization, testing.

10SKILL.mdUpdated Apr 26, 2026

arbazkhan971/verify

development

VerifiedTrustedCommunity

Evidence gate. Run command, read full output, confirm or deny claim. No trust, only proof.

10SKILL.mdUpdated Apr 26, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/arbazkhan971/godmode.git

# Copy into Claude Code skills folder (global)
cp -r godmode/skills/eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

arbazkhan971/godmode

10 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT