Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

athola/evaluation-framework

Name: evaluation-framework
Author: athola

plugins/leyline/skills/evaluation-framework/SKILL.md

npx skillsauth add athola/claude-night-market evaluation-framework

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview
When to Use
Core Pattern
1. Define Criteria
2. Score Each Criterion
3. Calculate Weighted Total
4. Apply Decision Thresholds
Quick Start
Define Your Evaluation
Example: Code Review Evaluation
Evaluation Workflow
Common Use Cases
Integration Pattern
Detailed Resources
Exit Criteria

Evaluation Framework

Overview

A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.

This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.

When To Use

Implementing quality gates or evaluation rubrics
Building scoring systems for artifacts, proposals, or submissions
Need consistent evaluation methodology across different domains
Want threshold-based automated decision making
Creating assessment tools with weighted criteria

When NOT To Use

Simple pass/fail without scoring needs

Core Pattern

1. Define Criteria

criteria:
  - name: criterion_name
    weight: 0.30          # 30% of total score
    description: What this measures
    scoring_guide:
      90-100: Exceptional
      70-89: Strong
      50-69: Acceptable
      30-49: Weak
      0-29: Poor

Verification: Run the command with --help flag to verify availability.

2. Score Each Criterion

scores = {
    "criterion_1": 85,  # Out of 100
    "criterion_2": 92,
    "criterion_3": 78,
}

Verification: Run the command with --help flag to verify availability.

3. Calculate Weighted Total

total = sum(score * weights[criterion] for criterion, score in scores.items())
# Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5

Verification: Run the command with --help flag to verify availability.

4. Apply Decision Thresholds

thresholds:
  80-100: Accept with priority
  60-79: Accept with conditions
  40-59: Review required
  20-39: Reject with feedback
  0-19: Reject

Verification: Run the command with --help flag to verify availability.

Quick Start

Define Your Evaluation

Identify criteria: What aspects matter for your domain?
Assign weights: Which criteria are most important? (sum to 1.0)
Create scoring guides: What does each score range mean?
Set thresholds: What total scores trigger which decisions?

Example: Code Review Evaluation

criteria:
  correctness: {weight: 0.40, description: Does code work as intended?}
  maintainability: {weight: 0.25, description: Is it readable?}
  performance: {weight: 0.20, description: Meets performance needs?}
  testing: {weight: 0.15, description: Tests detailed?}

thresholds:
  85-100: Approve immediately
  70-84: Approve with minor feedback
  50-69: Request changes
  0-49: Reject, major issues

Verification: Run pytest -v to verify tests pass.

Evaluation Workflow

**Verification:** Run the command with `--help` flag to verify availability.
1. Review artifact against each criterion
2. Assign 0-100 score for each criterion
3. Calculate: total = Σ(score × weight)
4. Compare total to thresholds
5. Take action based on threshold range

Verification: Run the command with --help flag to verify availability.

Common Use Cases

Quality Gates: Code review, PR approval, release readiness Content Evaluation: Document quality, knowledge intake, skill assessment Resource Allocation: Backlog prioritization, investment decisions, triage

Integration Pattern

# In your skill's frontmatter
dependencies: [leyline:evaluation-framework]

Verification: Run the command with --help flag to verify availability.

Then customize the framework for your domain:

Define domain-specific criteria
Set appropriate weights for your context
Establish meaningful thresholds
Document what each score range means

Detailed Resources

Scoring Patterns: See modules/scoring-patterns.md for detailed methodology
Decision Thresholds: See modules/decision-thresholds.md for threshold design

Exit Criteria

[ ] Criteria defined with clear descriptions
[ ] Weights assigned and sum to 1.0
[ ] Scoring guides documented for each criterion
[ ] Thresholds mapped to specific actions
[ ] Evaluation process documented and reproducible

athola/evaluation-framework

plugins/leyline/skills/evaluation-framework/SKILL.md

Provides weighted scoring, rubrics, and decision-threshold patterns. Use when designing quality gates, evaluation systems, or decision frameworks.

317 stars

development

Updated Jun 28, 2026

$ install --global

skillsauth

npx skillsauth add athola/claude-night-market evaluation-framework

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 28, 2026, 4:04 AM136.7s8 files scanned

SKILL.md

name:: evaluation-framework
description:: Provides weighted scoring, rubrics, and decision-threshold patterns. Use when designing quality gates, evaluation systems, or decision frameworks.
alwaysApply:: false
category:: infrastructure
dependencies:: []
complexity:: beginner
model_hint:: fast
estimated_tokens:: 550
progressive_loading:: true

Overview
When to Use
Core Pattern
1. Define Criteria
2. Score Each Criterion
3. Calculate Weighted Total
4. Apply Decision Thresholds
Quick Start
Define Your Evaluation
Example: Code Review Evaluation
Evaluation Workflow
Common Use Cases
Integration Pattern
Detailed Resources
Exit Criteria

Evaluation Framework

Overview

A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.

This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.

When To Use

Implementing quality gates or evaluation rubrics
Building scoring systems for artifacts, proposals, or submissions
Need consistent evaluation methodology across different domains
Want threshold-based automated decision making
Creating assessment tools with weighted criteria

When NOT To Use

Simple pass/fail without scoring needs

Core Pattern

1. Define Criteria

criteria:
  - name: criterion_name
    weight: 0.30          # 30% of total score
    description: What this measures
    scoring_guide:
      90-100: Exceptional
      70-89: Strong
      50-69: Acceptable
      30-49: Weak
      0-29: Poor

Verification: Run the command with --help flag to verify availability.

2. Score Each Criterion

scores = {
    "criterion_1": 85,  # Out of 100
    "criterion_2": 92,
    "criterion_3": 78,
}

Verification: Run the command with --help flag to verify availability.

3. Calculate Weighted Total

total = sum(score * weights[criterion] for criterion, score in scores.items())
# Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5

Verification: Run the command with --help flag to verify availability.

4. Apply Decision Thresholds

thresholds:
  80-100: Accept with priority
  60-79: Accept with conditions
  40-59: Review required
  20-39: Reject with feedback
  0-19: Reject

Verification: Run the command with --help flag to verify availability.

Quick Start

Define Your Evaluation

Identify criteria: What aspects matter for your domain?
Assign weights: Which criteria are most important? (sum to 1.0)
Create scoring guides: What does each score range mean?
Set thresholds: What total scores trigger which decisions?

Example: Code Review Evaluation

criteria:
  correctness: {weight: 0.40, description: Does code work as intended?}
  maintainability: {weight: 0.25, description: Is it readable?}
  performance: {weight: 0.20, description: Meets performance needs?}
  testing: {weight: 0.15, description: Tests detailed?}

thresholds:
  85-100: Approve immediately
  70-84: Approve with minor feedback
  50-69: Request changes
  0-49: Reject, major issues

Verification: Run pytest -v to verify tests pass.

Evaluation Workflow

**Verification:** Run the command with `--help` flag to verify availability.
1. Review artifact against each criterion
2. Assign 0-100 score for each criterion
3. Calculate: total = Σ(score × weight)
4. Compare total to thresholds
5. Take action based on threshold range

Verification: Run the command with --help flag to verify availability.

Common Use Cases

Integration Pattern

# In your skill's frontmatter
dependencies: [leyline:evaluation-framework]

Verification: Run the command with --help flag to verify availability.

Then customize the framework for your domain:

Define domain-specific criteria
Set appropriate weights for your context
Establish meaningful thresholds
Document what each score range means

Detailed Resources

Scoring Patterns: See modules/scoring-patterns.md for detailed methodology
Decision Thresholds: See modules/decision-thresholds.md for threshold design

Exit Criteria

[ ] Criteria defined with clear descriptions
[ ] Weights assigned and sum to 1.0
[ ] Scoring guides documented for each criterion
[ ] Thresholds mapped to specific actions
[ ] Evaluation process documented and reproducible

Related Skills

athola/architecture-paradigm-domain-driven

data-ai

VerifiedTrustedCommunity

Models a business in its own language. Use when the domain has real business rules to capture.

323SKILL.mdUpdated Jul 15, 2026

athola/architecture-paradigm-domain-driven

athola/ideate

research

VerifiedTrustedCommunity

Generate diverse solution candidates with category-spanning ideation methods and rotation. Use when stuck on a design or fighting repetitive LLM output.

323SKILL.mdUpdated Jun 8, 2026

athola/validate-pr

development

VerifiedTrustedCommunity

Generates and self-executes a diff-derived test plan for a PR. Use when validating PR changes before merge. Do not use for code review; use sanctum:pr-review.

323SKILL.mdUpdated Jun 8, 2026

athola/graduated-implementation

development

VerifiedTrustedCommunity

Ramps implementation ambition a notch only after the prior increment is understood. Use when building a feature you must understand, not just ship.

323SKILL.mdUpdated Jun 8, 2026

athola/graduated-implementation

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/athola/claude-night-market.git

# Copy into Claude Code skills folder (global)
cp -r claude-night-market/plugins/leyline/skills/evaluation-framework ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

athola/claude-night-market

317 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

athola/evaluation-framework

$ install --global

Security Scan Results

SKILL.md

Table of Contents

Evaluation Framework

Overview

When To Use

When NOT To Use

Core Pattern

1. Define Criteria

2. Score Each Criterion

3. Calculate Weighted Total

4. Apply Decision Thresholds

Quick Start

Define Your Evaluation

Example: Code Review Evaluation

Evaluation Workflow

Common Use Cases

Integration Pattern

Detailed Resources

Exit Criteria

Related Skills

athola/architecture-paradigm-domain-driven

athola/ideate

athola/validate-pr

athola/graduated-implementation

athola/evaluation-framework

$ install --global

Security Scan Results

SKILL.md

Table of Contents

Evaluation Framework

Overview

When To Use

When NOT To Use

Core Pattern

1. Define Criteria

2. Score Each Criterion

3. Calculate Weighted Total

4. Apply Decision Thresholds

Quick Start

Define Your Evaluation

Example: Code Review Evaluation

Evaluation Workflow

Common Use Cases

Integration Pattern

Detailed Resources

Exit Criteria

Related Skills

athola/architecture-paradigm-domain-driven

athola/ideate

athola/validate-pr

athola/graduated-implementation