bundles/ai-agents/skills/advanced-evaluation/SKILL.md
Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.
npx skillsauth add shipshitdev/library advanced-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.
Direct Scoring: Single LLM rates one response on a defined scale.
Pairwise Comparison: LLM compares two responses and selects better one.
| Bias | Description | Mitigation | |------|-------------|------------| | Position | First-position preference | Swap positions, check consistency | | Length | Longer = higher scores | Explicit prompting, length-normalized scoring | | Self-Enhancement | Models rate own outputs higher | Use different model for evaluation | | Verbosity | Unnecessary detail rated higher | Criteria-specific rubrics | | Authority | Confident tone rated higher | Require evidence citation |
Is there an objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Pairwise Comparison (tone, style, creativity)
Works with:
For detailed implementation patterns, prompt templates, examples, and metrics: references/full-guide.md
See also: references/implementation-patterns.md, references/bias-mitigation.md, references/metrics-guide.md
development
Create an isolated git worktree from the correct base branch and check it out into a clean, gitignored directory. Use when the user asks to make a worktree, spin up a parallel/isolated workspace, work on something without disturbing the current checkout, branch off the current work, or run multiple agents on the same repo at once. Picks the base branch smartly — the current feature branch when you are on one, otherwise the develop integration branch — so worktrees continue your in-progress work by default instead of forking from the wrong place.
development
Verify a release was fully promoted through develop, staging, and master/main, then prune merged local and remote branches and stale git worktrees. Squash-merge aware — uses GitHub PR merge state as the merge oracle, not commit ancestry. Use when the user asks to clean up branches after a deploy, prune worktrees, remove merged branches, tidy up after promoting develop to staging to master, or confirm nothing stale was left behind before pruning.
development
Structured "done coding, now what?" workflow: verify tests pass, detect the repository environment (normal repo vs worktree, named branch vs detached HEAD), present exactly the right merge / PR / keep / discard options, and execute the chosen path including safe worktree cleanup. Use when implementation is complete and the branch needs to be integrated, published, or abandoned.
tools
Capture a client or stakeholder feature request, turn it into a planner-ready PRD epic with scoped sub-issues, check for duplicate work, and place approved issues on a GitHub Projects kanban. Use when a user invokes feature intake, asks to turn a rough client requirement into GitHub issues, or wants an idea written as a PRD and pushed to a board.