skills/phoenix-evals/SKILL.md
Build and run evaluators for AI/LLM applications using Phoenix.
npx skillsauth add williamlimasilva/.copilot phoenix-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
Security scan pending...
This skill is queued for security scanning. Results will appear when the scan completes.
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
| Task | Files | | ---- | ----- | | Setup | setup-python, setup-typescript | | Decide what to evaluate | evaluators-overview | | Choose a judge model | fundamentals-model-selection | | Use pre-built evaluators | evaluators-pre-built | | Build code evaluator | evaluators-code-python, evaluators-code-typescript | | Build LLM evaluator | evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates | | Batch evaluate DataFrame | evaluate-dataframe-python | | Run experiment | experiments-running-python, experiments-running-typescript | | Create dataset | experiments-datasets-python, experiments-datasets-typescript | | Generate synthetic data | experiments-synthetic-python, experiments-synthetic-typescript | | Validate evaluator accuracy | validation, validation-evaluators-python, validation-evaluators-typescript | | Sample traces for review | observe-sampling-python, observe-sampling-typescript | | Analyze errors | error-analysis, error-analysis-multi-turn, axial-coding | | RAG evals | evaluators-rag | | Avoid common mistakes | common-mistakes-python, fundamentals-anti-patterns | | Production | production-overview, production-guardrails, production-continuous |
Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production: production-overview → production-guardrails → production-continuous
| Prefix | Description |
| ------ | ----------- |
| fundamentals-* | Types, scores, anti-patterns |
| observe-* | Tracing, sampling |
| error-analysis-* | Finding failures |
| axial-coding-* | Categorizing failures |
| evaluators-* | Code, LLM, RAG evaluators |
| experiments-* | Datasets, running experiments |
| validation-* | Validating evaluator accuracy against human labels |
| production-* | CI/CD, monitoring |
| Principle | Action | | --------- | ------ | | Error analysis first | Can't automate what you haven't observed | | Custom > generic | Build from your failures | | Code first | Deterministic before LLM | | Validate judges | >80% TPR/TNR | | Binary > Likert | Pass/fail, not 1-5 |
tools
Narrative and synthesis profile for Wiggins: framing, explanation, and audience-aware communication patterns for Ember sessions.
tools
Collaboration profile for Quinn: curious, energetic, and implementation-focused partnership patterns for Ember sessions with Alison.
development
Rigorous challenge profile for Anitta: assumption checks, evidence calibration, and defensible reasoning patterns for Ember collaboration.
testing
Create Git branches following the Conventional Branch specification (feature/, bugfix/, hotfix/, release/, chore/). Use when creating a new branch, naming a branch, or checking whether a branch name complies with the spec.