skills/evaluation-rubrics/SKILL.md
Designs structured scoring tools with explicit criteria, performance scales, and descriptors for consistent, transparent quality assessment. Use when need quality criteria and scoring scales to evaluate work consistently, compare alternatives objectively, set acceptance thresholds, reduce subjective bias, or when user mentions rubric, scoring criteria, quality standards, evaluation framework, inter-rater reliability, or grading/assessing work.
npx skillsauth add lyndonkl/claude evaluation-rubricsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Scenario: Evaluating technical blog posts (1-5 scale)
| Criterion | 1 (Poor) | 3 (Adequate) | 5 (Excellent) | |-----------|----------|--------------|---------------| | Technical Accuracy | Multiple factual errors, misleading | Mostly correct, minor inaccuracies | Fully accurate, technically rigorous | | Clarity | Confusing, jargon-heavy, poor structure | Clear to experts, some structure | Accessible to target audience, well-organized | | Practical Value | No actionable guidance, theoretical only | Some examples, limited applicability | Concrete examples, immediately applicable | | Originality | Rehashes common knowledge, no new insight | Some fresh perspective, builds on existing | Novel approach, advances understanding |
Scoring: Post A [4, 5, 3, 2] = 3.5 avg. Post B [5, 4, 5, 4] = 4.5 avg. Feedback for Post A: "Strong clarity (5) and good accuracy (4), but needs more practical examples (3) and offers less original insight (2)."
Copy this checklist and track your progress:
Rubric Development Progress:
- [ ] Step 1: Define purpose and scope
- [ ] Step 2: Identify evaluation criteria
- [ ] Step 3: Design the scale
- [ ] Step 4: Write performance descriptors
- [ ] Step 5: Test and calibrate
- [ ] Step 6: Use and iterate
Step 1: Define purpose and scope
Clarify what you're evaluating, who evaluates, who uses results, what decisions depend on scores. See resources/template.md for scoping questions.
Step 2: Identify evaluation criteria
Brainstorm quality dimensions, prioritize most important/observable, balance coverage vs. simplicity (4-8 criteria typical). See resources/template.md for brainstorming framework.
Step 3: Design the scale
Choose number of levels (1-5, 1-4, 1-10), scale type (numeric, qualitative), anchors (what does each level mean?). See resources/methodology.md for scale selection guidance.
Step 4: Write performance descriptors
For each criterion × level, write observable description of what that performance looks like. See resources/template.md for writing guidelines.
Step 5: Test and calibrate
Have multiple reviewers score sample work, compare scores, discuss discrepancies, refine rubric. See resources/methodology.md for inter-rater reliability testing.
Step 6: Use and iterate
Apply rubric, collect feedback from evaluators and evaluatees, revise criteria/descriptors as needed. Validate using resources/evaluators/rubric_evaluation_rubrics.json. Minimum standard: Average score ≥ 3.5.
Pattern 1: Analytic Rubric (Most Common)
Pattern 2: Holistic Rubric
Pattern 3: Single-Point Rubric
Pattern 4: Checklist (Binary)
Pattern 5: Standards-Based Rubric
Criteria should be observable and measurable: Not "good attitude" (subjective), but "arrives on time, volunteers for tasks, helps teammates" (observable). Test: Can two independent reviewers score this criterion consistently?
Descriptors should distinguish levels clearly: Each level needs concrete differences from adjacent levels. Avoid "5=very good, 4=good, 3=okay". Better: "5=zero bugs, meets all requirements, 4=1-2 minor bugs, meets 90% requirements."
Use appropriate scale granularity: 1-3 is too coarse, 1-10 is too fine. Sweet spot: 1-4 (forced choice, no middle) or 1-5 (allows neutral middle). Match granularity to actual observable differences.
Balance comprehensiveness with simplicity: Aim for 4-8 criteria covering essential quality dimensions. If >10 criteria, consider grouping or prioritizing.
Calibrate for inter-rater reliability: Have multiple reviewers score same work, measure agreement (Kappa, ICC). If <70% agreement, refine descriptors.
Provide examples at each level: Include concrete examples of work at each level (anchor papers, reference designs, code samples) to calibrate reviewers.
Share rubric before evaluation: If evaluatees see the rubric only after being scored, it is grading not guidance. Share upfront so people know expectations and can self-assess.
Weight criteria appropriately: If "Security" matters more than "Code style", weight it (Security x3, Style x1). Or use thresholds (score >=4 on Security to pass, regardless of other scores).
Common pitfalls:
Key resources:
Scale Selection Guide:
| Scale | Use When | Pros | Cons | |-------|----------|------|------| | 1-3 | Need quick categorization, clear tiers | Fast, forces clear decision | Too coarse, less feedback | | 1-4 | Want forced choice (no middle) | Avoids central tendency, clear differentiation | No neutral option, feels binary | | 1-5 | General purpose, most common | Allows neutral, familiar, good granularity | Central tendency bias (everyone gets 3) | | 1-10 | Need fine gradations, large sample | Maximum differentiation, statistical analysis | False precision, hard to distinguish adjacent levels | | Qualitative (Novice/Proficient/Expert) | Educational, skill development | Intuitive, growth-oriented | Less quantitative, harder to aggregate | | Binary (Yes/No, Pass/Fail) | Compliance, gatekeeping | Objective, simple | No gradations, misses quality differences |
Criteria Types:
Inter-Rater Reliability Benchmarks:
Typical Rubric Development Time:
When to escalate beyond rubrics:
Inputs required:
Outputs produced:
evaluation-rubrics.md: Purpose, criteria definitions, scale with descriptors, usage instructions, weighting/thresholds, calibration notestesting
--- name: advisory-edit description: A strict advisory-only editing discipline for a writer who dictates ("speaks out") essays and wants help WITHOUT having their voice changed. The editor directs structure, flags grammar, and suggests strategic language — but never modifies the writer's text unless the writer explicitly says "apply" / "make that change" / "rewrite this." Produces a line-referenced, suggestion-only critique where every item is marked the writer's call. Four passes: structural, l
testing
Provides the house style for analyst-grade strategist writing — third-person register with sparing first-person, no em dashes, no "not X, not Y, not Z" negation cascades, numbered footnote citations rather than inline source parentheticals, specific opinion-signaling phrases, and topic-forward paragraph structure modeled on voice patterns observed in Damodaran's Musings on Markets and Thompson's Stratechery. Use when consolidating working notes into a finished long-form strategist or analyst report that must read as written by a senior human analyst rather than an AI assistant.
testing
Renders a markdown report to a PDF using pandoc with xelatex (11pt serif body, 1-inch margins, numbered footnotes, formal heading hierarchy). Requires a one-time install of pandoc and a LaTeX engine on the user's machine — basictex on macOS or texlive-xetex on Linux. Does not attempt automatic install. Fails loudly with the exact install commands if pandoc or xelatex is missing on the user's PATH. Use when producing a finished strategist or analyst report PDF from a polished markdown source.
testing
Produces step-by-step computational walkthroughs of vector and matrix operations as a sequence of numbered "frames", showing the explicit state at each step. The text-equivalent of a 3Blue1Brown animation — each frame shows what changed and why, so the learner can re-trace the operation by hand. Use when the learner needs to *see* a computation unfold (eigenvalue computation, attention with 3 tokens, gradient descent step, SVD on a 2×2, layer norm on a 3-vector, softmax of a small input), when an explanation has been given but the learner needs to ground it in a worked example, or when introducing an operation that's intimidating in symbol form but trivial in pencil-and-paper form.