plugins/utopia-azraq-engagement/skills/evaluation-rubrics/SKILL.md
Designs structured scoring tools with explicit criteria, performance scales, and descriptors for consistent, transparent quality assessment. Use when need quality criteria and scoring scales to evaluate work consistently, compare alternatives objectively, set acceptance thresholds, reduce subjective bias, or when user mentions rubric, scoring criteria, quality standards, evaluation framework, inter-rater reliability, or grading/assessing work.
npx skillsauth add The-Utopia-Studio/skills evaluation-rubricsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Scenario: Evaluating technical blog posts (1-5 scale)
| Criterion | 1 (Poor) | 3 (Adequate) | 5 (Excellent) | |-----------|----------|--------------|---------------| | Technical Accuracy | Multiple factual errors, misleading | Mostly correct, minor inaccuracies | Fully accurate, technically rigorous | | Clarity | Confusing, jargon-heavy, poor structure | Clear to experts, some structure | Accessible to target audience, well-organized | | Practical Value | No actionable guidance, theoretical only | Some examples, limited applicability | Concrete examples, immediately applicable | | Originality | Rehashes common knowledge, no new insight | Some fresh perspective, builds on existing | Novel approach, advances understanding |
Scoring: Post A [4, 5, 3, 2] = 3.5 avg. Post B [5, 4, 5, 4] = 4.5 avg. Feedback for Post A: "Strong clarity (5) and good accuracy (4), but needs more practical examples (3) and offers less original insight (2)."
Copy this checklist and track your progress:
Rubric Development Progress:
- [ ] Step 1: Define purpose and scope
- [ ] Step 2: Identify evaluation criteria
- [ ] Step 3: Design the scale
- [ ] Step 4: Write performance descriptors
- [ ] Step 5: Test and calibrate
- [ ] Step 6: Use and iterate
Step 1: Define purpose and scope
Clarify what you're evaluating, who evaluates, who uses results, what decisions depend on scores. See resources/template.md for scoping questions.
Step 2: Identify evaluation criteria
Brainstorm quality dimensions, prioritize most important/observable, balance coverage vs. simplicity (4-8 criteria typical). See resources/template.md for brainstorming framework.
Step 3: Design the scale
Choose number of levels (1-5, 1-4, 1-10), scale type (numeric, qualitative), anchors (what does each level mean?). See resources/methodology.md for scale selection guidance.
Step 4: Write performance descriptors
For each criterion × level, write observable description of what that performance looks like. See resources/template.md for writing guidelines.
Step 5: Test and calibrate
Have multiple reviewers score sample work, compare scores, discuss discrepancies, refine rubric. See resources/methodology.md for inter-rater reliability testing.
Step 6: Use and iterate
Apply rubric, collect feedback from evaluators and evaluatees, revise criteria/descriptors as needed. Validate using resources/evaluators/rubric_evaluation_rubrics.json. Minimum standard: Average score ≥ 3.5.
Pattern 1: Analytic Rubric (Most Common)
Pattern 2: Holistic Rubric
Pattern 3: Single-Point Rubric
Pattern 4: Checklist (Binary)
Pattern 5: Standards-Based Rubric
Criteria should be observable and measurable: Not "good attitude" (subjective), but "arrives on time, volunteers for tasks, helps teammates" (observable). Test: Can two independent reviewers score this criterion consistently?
Descriptors should distinguish levels clearly: Each level needs concrete differences from adjacent levels. Avoid "5=very good, 4=good, 3=okay". Better: "5=zero bugs, meets all requirements, 4=1-2 minor bugs, meets 90% requirements."
Use appropriate scale granularity: 1-3 is too coarse, 1-10 is too fine. Sweet spot: 1-4 (forced choice, no middle) or 1-5 (allows neutral middle). Match granularity to actual observable differences.
Balance comprehensiveness with simplicity: Aim for 4-8 criteria covering essential quality dimensions. If >10 criteria, consider grouping or prioritizing.
Calibrate for inter-rater reliability: Have multiple reviewers score same work, measure agreement (Kappa, ICC). If <70% agreement, refine descriptors.
Provide examples at each level: Include concrete examples of work at each level (anchor papers, reference designs, code samples) to calibrate reviewers.
Share rubric before evaluation: If evaluatees see the rubric only after being scored, it is grading not guidance. Share upfront so people know expectations and can self-assess.
Weight criteria appropriately: If "Security" matters more than "Code style", weight it (Security x3, Style x1). Or use thresholds (score >=4 on Security to pass, regardless of other scores).
Common pitfalls:
Key resources:
Scale Selection Guide:
| Scale | Use When | Pros | Cons | |-------|----------|------|------| | 1-3 | Need quick categorization, clear tiers | Fast, forces clear decision | Too coarse, less feedback | | 1-4 | Want forced choice (no middle) | Avoids central tendency, clear differentiation | No neutral option, feels binary | | 1-5 | General purpose, most common | Allows neutral, familiar, good granularity | Central tendency bias (everyone gets 3) | | 1-10 | Need fine gradations, large sample | Maximum differentiation, statistical analysis | False precision, hard to distinguish adjacent levels | | Qualitative (Novice/Proficient/Expert) | Educational, skill development | Intuitive, growth-oriented | Less quantitative, harder to aggregate | | Binary (Yes/No, Pass/Fail) | Compliance, gatekeeping | Objective, simple | No gradations, misses quality differences |
Criteria Types:
Inter-Rater Reliability Benchmarks:
Typical Rubric Development Time:
When to escalate beyond rubrics:
Inputs required:
Outputs produced:
evaluation-rubrics.md: Purpose, criteria definitions, scale with descriptors, usage instructions, weighting/thresholds, calibration notesdevelopment
Create professional equity research earnings update reports (8-12 pages, 3,000-5,000 words) analyzing quarterly results for companies already under coverage. Fast-turnaround format focusing on beat/miss analysis, key metrics, updated estimates, and revised thesis. Includes 1-3 summary tables and 8-12 charts. Use when user requests "earnings update", "quarterly update", "earnings analysis", "Q1/Q2/Q3/Q4 results", or post-earnings report.
development
Updates a presentation with new numbers — quarterly refreshes, earnings updates, comp rolls, rebased market data. Use whenever the user asks to "update the deck with Q4 numbers", "refresh the comps", "roll this forward", "swap in the new earnings", "change all the $485M to $512M", or any request to swap figures across an existing deck without rebuilding it.
development
Real DCF (Discounted Cash Flow) model creation for equity valuation. Retrieves financial data from SEC filings and analyst reports, builds comprehensive cash flow projections with proper WACC calculations, performs sensitivity analysis, and outputs professional Excel models with executive summaries. Use when users need to value a company using DCF methodology, request intrinsic value analysis, or ask for detailed financial modeling with growth projections and terminal value calculations.
tools
Build professional financial services data packs from various sources including CIMs, offering memorandums, SEC filings, web search, or MCP servers. Extract, normalize, and standardize financial data into investment committee-ready Excel workbooks with consistent structure, proper formatting, and documented assumptions. Use for M&A due diligence, private equity analysis, investment committee materials, and standardizing financial reporting across portfolio companies. Do not use for simple financial calculations or working with already-completed data packs.