plugins/leyline/skills/evaluation-framework/SKILL.md
Provides weighted scoring, rubrics, and decision-threshold patterns. Use when designing quality gates, evaluation systems, or decision frameworks.
npx skillsauth add athola/claude-night-market evaluation-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.
This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.
criteria:
- name: criterion_name
weight: 0.30 # 30% of total score
description: What this measures
scoring_guide:
90-100: Exceptional
70-89: Strong
50-69: Acceptable
30-49: Weak
0-29: Poor
Verification: Run the command with --help flag to verify availability.
scores = {
"criterion_1": 85, # Out of 100
"criterion_2": 92,
"criterion_3": 78,
}
Verification: Run the command with --help flag to verify availability.
total = sum(score * weights[criterion] for criterion, score in scores.items())
# Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5
Verification: Run the command with --help flag to verify availability.
thresholds:
80-100: Accept with priority
60-79: Accept with conditions
40-59: Review required
20-39: Reject with feedback
0-19: Reject
Verification: Run the command with --help flag to verify availability.
criteria:
correctness: {weight: 0.40, description: Does code work as intended?}
maintainability: {weight: 0.25, description: Is it readable?}
performance: {weight: 0.20, description: Meets performance needs?}
testing: {weight: 0.15, description: Tests detailed?}
thresholds:
85-100: Approve immediately
70-84: Approve with minor feedback
50-69: Request changes
0-49: Reject, major issues
Verification: Run pytest -v to verify tests pass.
**Verification:** Run the command with `--help` flag to verify availability.
1. Review artifact against each criterion
2. Assign 0-100 score for each criterion
3. Calculate: total = Σ(score × weight)
4. Compare total to thresholds
5. Take action based on threshold range
Verification: Run the command with --help flag to verify availability.
Quality Gates: Code review, PR approval, release readiness Content Evaluation: Document quality, knowledge intake, skill assessment Resource Allocation: Backlog prioritization, investment decisions, triage
# In your skill's frontmatter
dependencies: [leyline:evaluation-framework]
Verification: Run the command with --help flag to verify availability.
Then customize the framework for your domain:
modules/scoring-patterns.md for detailed methodologymodules/decision-thresholds.md for threshold designresearch
Generate diverse solution candidates with category-spanning ideation methods and rotation. Use when stuck on a design or fighting repetitive LLM output.
tools
--- name: validate-pr description: Use when you need a diff-derived test plan for a PR: reads the diff, groups changes by area, runs targeted verifications, and proves revert-tests are genuine guards, not dead assertions. alwaysApply: false category: validation tags: - pr - validation - test-plan - diff - revert-test - evidence tools: [] usage_patterns: - diff-derived-test-plan - revert-test-quality-check - evidence-capture complexity: intermediate model_hint: standard estimated_tokens: 650
development
Contract for the project decision journal (tradeoffs and lessons-learned logs). Use when recording a decision, tradeoff, or lesson, or building a consumer hook.
development
Ramps implementation ambition a notch only after the prior increment is understood. Use when building a feature you must understand, not just ship.