skills/ai-harness/SKILL.md
Use when building evaluation infrastructure for AI systems — test harnesses, CI pipelines for AI, automated regression detection, golden datasets, and continuous quality measurement.
npx skillsauth add kienbui1995/magic-powers ai-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Three layers for comprehensive coverage:
Layer 1: Deterministic checks (fast, cheap)
- Output schema validation (JSON structure, required fields)
- Length constraints (too short/long)
- Keyword presence/absence
- PII detection in outputs
Layer 2: Model-as-judge (flexible, moderate cost)
- Use GPT-4/Claude to score outputs on rubric
- Criteria: accuracy, helpfulness, safety, hallucination
- Score 1-5 with reasoning
- Compare to baseline (previous prompt/model)
Layer 3: Human review (ground truth, expensive)
- Sample 5-10% of outputs weekly
- Focus on borderline model-as-judge scores (2-3 out of 5)
- Use disagreements to improve rubric
# .github/workflows/ai-eval.yml
name: AI Eval
on: [push, pull_request]
jobs:
eval:
steps:
- name: Run eval harness
run: python eval/run_harness.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Check pass rate
run: python eval/check_thresholds.py --min-pass-rate 0.85
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: eval-results
path: eval/results/
# A/B test two prompts against golden dataset
results = {
"prompt_v1": run_eval(dataset, prompt_v1),
"prompt_v2": run_eval(dataset, prompt_v2),
}
# Compare: accuracy, cost, latency, safety
winner = select_winner(results, primary_metric="accuracy", cost_constraint=1.2)
| Tool | Use case | |------|---------| | LangSmith | Tracing + eval for LangChain apps | | Braintrust | Eval platform, dataset management, CI integration | | PromptFoo | Open-source prompt testing, CLI-first | | Weights & Biases | Experiment tracking, model comparison | | Ragas | RAG-specific evaluation (faithfulness, answer relevance) | | pytest + custom | Lightweight eval for simple use cases |
Avoid declaring "improvement" without statistical evidence:
from scipy import stats
def is_statistically_significant(baseline_scores, new_scores, alpha=0.05):
"""Two-sample t-test for eval score comparison"""
t_stat, p_value = stats.ttest_ind(baseline_scores, new_scores)
effect_size = (mean(new_scores) - mean(baseline_scores)) / std(baseline_scores)
return StatResult(
significant=p_value < alpha,
p_value=p_value,
effect_size=effect_size, # Cohen's d
practical_significant=abs(effect_size) > 0.2, # small effect
recommendation="ship" if (p_value < alpha and effect_size > 0.2) else "no change"
)
# Minimum sample sizes for reliable conclusions:
# Small effect (d=0.2): n ≥ 197 per group
# Medium effect (d=0.5): n ≥ 52 per group
# Large effect (d=0.8): n ≥ 26 per group
Key rules:
Golden datasets degrade — keep them fresh:
class EvalDatasetManager:
def review_test_cases(self, dataset: Dataset, threshold_days=30):
stale = [tc for tc in dataset if tc.last_reviewed_days > threshold_days]
low_signal = [tc for tc in dataset if tc.pass_rate in (0.0, 1.0)]
# Pass rate 0% = always fails (broken test or impossible), 1% = always passes (too easy)
return DatasetHealthReport(
stale_count=len(stale),
low_signal_count=len(low_signal),
action="review_and_update" if len(stale) + len(low_signal) > len(dataset) * 0.2 else "ok"
)
def add_from_production_failures(self, prod_failures: list[Failure]):
"""Convert production failures into eval test cases"""
for failure in prod_failures:
if failure.confirmed_bug: # human verified
dataset.add(TestCase(
input=failure.input,
expected=failure.expected_output,
source="production_failure",
added_date=today()
))
Dataset health signals:
20% of tests always pass → too easy, add harder cases
10% of tests never pass → broken tests or capability gap, investigate
llm-evaluation (frameworks) and llm-observability (production monitoring)ai-safety-guardrails to add safety checks as eval layercontent-media
Use when designing for XR (AR/VR/MR), choosing interaction modes, or adapting 2D UI patterns for spatial computing
testing
Use when creating new skills, editing existing skills, or verifying skills work before deployment
development
Use when you have a spec or requirements for a multi-step task, before touching code
development
Use when executing a structured workflow — select and run a feature, bugfix, refactor, research, or incident template with correct agent and model assignments per phase.