Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

oimiragieo/pipeline-evaluator

Name: pipeline-evaluator
Author: oimiragieo

.claude/skills/pipeline-evaluator/SKILL.md

npx skillsauth add oimiragieo/agent-studio pipeline-evaluator

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Pipeline Evaluator

Purpose

Score a completed agent pipeline across 5 dimensions and produce a structured evaluation report with a composite verdict (EXCELLENT/GOOD/ACCEPTABLE/NEEDS_IMPROVEMENT) and ranked recommendations.

When to Invoke

Skill({ skill: 'pipeline-evaluator' })

Invoke when:

A multi-task pipeline has completed and quality measurement is needed
Router requests post-pipeline quality gate
Reflection agent needs structured metrics for the rubric
Continuous improvement loop requires baseline data

Workflow

Step 1: Gather Pipeline Data

Call TaskList() to retrieve all tasks associated with the pipeline.
For each task, call TaskGet({ taskId }) to fetch full metadata including:
- status (completed/failed/blocked/cancelled)
- metadata.summary
- metadata.filesModified
- metadata.deviations (array)
- metadata.testResult (PASS/FAIL + counts)
- metadata.completedAt vs task createdAt (for time efficiency)
Read any plan file referenced in task metadata to cross-check task completion against planned scope.

Step 2: Score 5 Dimensions

Calculate each dimension score (0–100 unless noted):

D1: Task Completion Rate (weight: 30%)

completionRate = (completedTasks / totalTasks) * 100

Count tasks with status: "completed" as completed.
Tasks with status: "failed" or status: "cancelled" count against.
Blocked tasks that resolved and completed count as completed.

Score mapping:

100% → 100
90–99% → 85
75–89% → 70
60–74% → 50
<60% → 20

D2: Deviation Count (weight: 20%)

Count total deviations logged across all task metadata deviations[] arrays.

Score mapping (inverse — fewer is better):

0 deviations → 100
1–2 deviations → 85
3–5 deviations → 65
6–10 deviations → 40
10 deviations → 10

D3: Test Pass Rate (weight: 25%)

Parse metadata.testResult fields. Extract pass counts and fail counts from strings like "PASS 42/42" or "FAIL 38/42".

testPassRate = (totalPassed / totalTests) * 100

If no test results reported: use 50 as neutral score.

D4: Time Efficiency (weight: 10%)

Compare actual pipeline duration vs estimated duration (if available in plan file).

efficiency = min(estimatedDuration / actualDuration, 1.0) * 100

If no estimate available: score 70 (neutral).

D5: Quality Score (weight: 15%)

Derived from code quality signals in task metadata:

Each task that ran pnpm lint:fix with zero errors: +10 points base
Each task that ran pnpm format with no changes: +5 points base
Deduct 5 points per task that reported lint errors
Deduct 10 points per task with unresolved TODOs in modified files

Cap at 100. Default to 60 if no quality signals present.

Step 3: Calculate Composite Score

composite = (D1 * 0.30) + (D2 * 0.20) + (D3 * 0.25) + (D4 * 0.10) + (D5 * 0.15)

Step 4: Determine Verdict

| Composite Score | Verdict | | --------------- | ----------------- | | > 90 | EXCELLENT | | > 75 | GOOD | | > 60 | ACCEPTABLE | | ≤ 60 | NEEDS_IMPROVEMENT |

Step 5: Generate Recommendations

For each dimension scoring below 80, generate a specific recommendation:

Task Completion Rate < 80: "Investigate failed/cancelled tasks — review blocker patterns and add prerequisite checks to planner prompts."
Deviation Count > 5: "High deviation count suggests scope creep. Add DR-3 architectural escalation gates and tighten task scoping in planner."
Test Pass Rate < 80: "Failing tests indicate implementation gaps. Enforce TDD red-green cycle and add pre-completion test gate."
Time Efficiency < 70: "Pipeline ran significantly over estimate. Profile slowest tasks and consider parallelization or task decomposition."
Quality Score < 70: "Lint/format issues present. Add blocking lint gate to pre-completion validation hook."

Sort recommendations by severity (lowest score dimension first).

Step 6: Write Evaluation Report

Write the structured evaluation to .claude/context/reports/backend/pipeline-eval-{pipelineId}-{YYYY-MM-DD}.md:

<!-- Agent: pipeline-evaluator | Task: #{taskId} | Session: {date} -->

# Pipeline Evaluation: {pipelineId}

**Evaluated At**: {ISO timestamp}
**Verdict**: {VERDICT}
**Composite Score**: {score}/100

## Dimension Scores

| Dimension            | Score | Weight | Weighted   |
| -------------------- | ----- | ------ | ---------- |
| Task Completion Rate | {D1}  | 30%    | {D1\*0.3}  |
| Deviation Count      | {D2}  | 20%    | {D2\*0.2}  |
| Test Pass Rate       | {D3}  | 25%    | {D3\*0.25} |
| Time Efficiency      | {D4}  | 10%    | {D4\*0.1}  |
| Quality Score        | {D5}  | 15%    | {D5\*0.15} |

## Recommendations

{ranked recommendations list}

Also write the machine-readable JSON to .claude/context/reports/backend/pipeline-eval-{pipelineId}-{YYYY-MM-DD}.json conforming to pipeline-evaluation.schema.json.

Scoring Reference

| Verdict | Composite | Meaning | | ----------------- | --------- | ------------------------------------ | | EXCELLENT | > 90 | Pipeline executed near-perfectly | | GOOD | > 75 | Minor issues; no systemic problems | | ACCEPTABLE | > 60 | Notable gaps but goals met | | NEEDS_IMPROVEMENT | ≤ 60 | Systemic issues require intervention |

Anti-Patterns

Never score a pipeline before all tasks have reached a terminal state (completed/failed/cancelled).
Never use only task count as a proxy for quality — always check test pass rate.
Never omit the JSON report — machine-readable output is required for trend tracking.
Never assign EXCELLENT without verifying test pass rate > 90%.

Related Skills

reflection — Uses pipeline evaluation scores in the rubric
tdd — Informs the Test Pass Rate dimension
verification-before-completion — Pre-completion gates that feed quality signals

oimiragieo/pipeline-evaluator

.claude/skills/pipeline-evaluator/SKILL.md

Evaluates completed agent pipelines across 5 scoring dimensions and produces a composite verdict with actionable recommendations

23 stars

devops

Updated Apr 7, 2026

$ install --global

skillsauth

npx skillsauth add oimiragieo/agent-studio pipeline-evaluator

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 7, 2026, 8:25 PM14.7s10 files scanned

SKILL.md

name:: pipeline-evaluator
description:: Evaluates completed agent pipelines across 5 scoring dimensions and produces a composite verdict with actionable recommendations
version:: 1.0.0

Pipeline Evaluator

Purpose

Score a completed agent pipeline across 5 dimensions and produce a structured evaluation report with a composite verdict (EXCELLENT/GOOD/ACCEPTABLE/NEEDS_IMPROVEMENT) and ranked recommendations.

When to Invoke

Skill({ skill: 'pipeline-evaluator' })

Invoke when:

A multi-task pipeline has completed and quality measurement is needed
Router requests post-pipeline quality gate
Reflection agent needs structured metrics for the rubric
Continuous improvement loop requires baseline data

Workflow

Step 1: Gather Pipeline Data

Call TaskList() to retrieve all tasks associated with the pipeline.
For each task, call TaskGet({ taskId }) to fetch full metadata including:
- status (completed/failed/blocked/cancelled)
- metadata.summary
- metadata.filesModified
- metadata.deviations (array)
- metadata.testResult (PASS/FAIL + counts)
- metadata.completedAt vs task createdAt (for time efficiency)
Read any plan file referenced in task metadata to cross-check task completion against planned scope.

Step 2: Score 5 Dimensions

Calculate each dimension score (0–100 unless noted):

D1: Task Completion Rate (weight: 30%)

completionRate = (completedTasks / totalTasks) * 100

Count tasks with status: "completed" as completed.
Tasks with status: "failed" or status: "cancelled" count against.
Blocked tasks that resolved and completed count as completed.

Score mapping:

100% → 100
90–99% → 85
75–89% → 70
60–74% → 50
<60% → 20

D2: Deviation Count (weight: 20%)

Count total deviations logged across all task metadata deviations[] arrays.

Score mapping (inverse — fewer is better):

0 deviations → 100
1–2 deviations → 85
3–5 deviations → 65
6–10 deviations → 40
10 deviations → 10

D3: Test Pass Rate (weight: 25%)

Parse metadata.testResult fields. Extract pass counts and fail counts from strings like "PASS 42/42" or "FAIL 38/42".

testPassRate = (totalPassed / totalTests) * 100

If no test results reported: use 50 as neutral score.

D4: Time Efficiency (weight: 10%)

Compare actual pipeline duration vs estimated duration (if available in plan file).

efficiency = min(estimatedDuration / actualDuration, 1.0) * 100

If no estimate available: score 70 (neutral).

D5: Quality Score (weight: 15%)

Derived from code quality signals in task metadata:

Each task that ran pnpm lint:fix with zero errors: +10 points base
Each task that ran pnpm format with no changes: +5 points base
Deduct 5 points per task that reported lint errors
Deduct 10 points per task with unresolved TODOs in modified files

Cap at 100. Default to 60 if no quality signals present.

Step 3: Calculate Composite Score

composite = (D1 * 0.30) + (D2 * 0.20) + (D3 * 0.25) + (D4 * 0.10) + (D5 * 0.15)

Step 4: Determine Verdict

| Composite Score | Verdict | | --------------- | ----------------- | | > 90 | EXCELLENT | | > 75 | GOOD | | > 60 | ACCEPTABLE | | ≤ 60 | NEEDS_IMPROVEMENT |

Step 5: Generate Recommendations

For each dimension scoring below 80, generate a specific recommendation:

Task Completion Rate < 80: "Investigate failed/cancelled tasks — review blocker patterns and add prerequisite checks to planner prompts."
Deviation Count > 5: "High deviation count suggests scope creep. Add DR-3 architectural escalation gates and tighten task scoping in planner."
Test Pass Rate < 80: "Failing tests indicate implementation gaps. Enforce TDD red-green cycle and add pre-completion test gate."
Time Efficiency < 70: "Pipeline ran significantly over estimate. Profile slowest tasks and consider parallelization or task decomposition."
Quality Score < 70: "Lint/format issues present. Add blocking lint gate to pre-completion validation hook."

Sort recommendations by severity (lowest score dimension first).

Step 6: Write Evaluation Report

Write the structured evaluation to .claude/context/reports/backend/pipeline-eval-{pipelineId}-{YYYY-MM-DD}.md:

<!-- Agent: pipeline-evaluator | Task: #{taskId} | Session: {date} -->

# Pipeline Evaluation: {pipelineId}

**Evaluated At**: {ISO timestamp}
**Verdict**: {VERDICT}
**Composite Score**: {score}/100

## Dimension Scores

| Dimension            | Score | Weight | Weighted   |
| -------------------- | ----- | ------ | ---------- |
| Task Completion Rate | {D1}  | 30%    | {D1\*0.3}  |
| Deviation Count      | {D2}  | 20%    | {D2\*0.2}  |
| Test Pass Rate       | {D3}  | 25%    | {D3\*0.25} |
| Time Efficiency      | {D4}  | 10%    | {D4\*0.1}  |
| Quality Score        | {D5}  | 15%    | {D5\*0.15} |

## Recommendations

{ranked recommendations list}

Also write the machine-readable JSON to .claude/context/reports/backend/pipeline-eval-{pipelineId}-{YYYY-MM-DD}.json conforming to pipeline-evaluation.schema.json.

Scoring Reference

Anti-Patterns

Never score a pipeline before all tasks have reached a terminal state (completed/failed/cancelled).
Never use only task count as a proxy for quality — always check test pass rate.
Never omit the JSON report — machine-readable output is required for trend tracking.
Never assign EXCELLENT without verifying test pass rate > 90%.

Related Skills

reflection — Uses pipeline evaluation scores in the rubric
tdd — Informs the Test Pass Rate dimension
verification-before-completion — Pre-completion gates that feed quality signals

Related Skills

oimiragieo/neurokit2

tools

VerifiedTrustedCommunity

Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/networkx

tools

VerifiedTrustedCommunity

Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/molfeat

data-ai

VerifiedTrustedCommunity

Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/modal

development

VerifiedTrustedCommunity

Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.

24SKILL.mdUpdated Apr 15, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/oimiragieo/agent-studio.git

# Copy into Claude Code skills folder (global)
cp -r agent-studio/.claude/skills/pipeline-evaluator ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

oimiragieo/agent-studio

23 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT