.claude/skills/pipeline-evaluator/SKILL.md
Evaluates completed agent pipelines across 5 scoring dimensions and produces a composite verdict with actionable recommendations
npx skillsauth add oimiragieo/agent-studio pipeline-evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Score a completed agent pipeline across 5 dimensions and produce a structured evaluation report with a composite verdict (EXCELLENT/GOOD/ACCEPTABLE/NEEDS_IMPROVEMENT) and ranked recommendations.
Skill({ skill: 'pipeline-evaluator' })
Invoke when:
TaskList() to retrieve all tasks associated with the pipeline.TaskGet({ taskId }) to fetch full metadata including:
status (completed/failed/blocked/cancelled)metadata.summarymetadata.filesModifiedmetadata.deviations (array)metadata.testResult (PASS/FAIL + counts)metadata.completedAt vs task createdAt (for time efficiency)Calculate each dimension score (0–100 unless noted):
completionRate = (completedTasks / totalTasks) * 100
status: "completed" as completed.status: "failed" or status: "cancelled" count against.Score mapping:
Count total deviations logged across all task metadata deviations[] arrays.
Score mapping (inverse — fewer is better):
10 deviations → 10
Parse metadata.testResult fields. Extract pass counts and fail counts from strings like "PASS 42/42" or "FAIL 38/42".
testPassRate = (totalPassed / totalTests) * 100
If no test results reported: use 50 as neutral score.
Compare actual pipeline duration vs estimated duration (if available in plan file).
efficiency = min(estimatedDuration / actualDuration, 1.0) * 100
If no estimate available: score 70 (neutral).
Derived from code quality signals in task metadata:
pnpm lint:fix with zero errors: +10 points basepnpm format with no changes: +5 points baseCap at 100. Default to 60 if no quality signals present.
composite = (D1 * 0.30) + (D2 * 0.20) + (D3 * 0.25) + (D4 * 0.10) + (D5 * 0.15)
| Composite Score | Verdict | | --------------- | ----------------- | | > 90 | EXCELLENT | | > 75 | GOOD | | > 60 | ACCEPTABLE | | ≤ 60 | NEEDS_IMPROVEMENT |
For each dimension scoring below 80, generate a specific recommendation:
Sort recommendations by severity (lowest score dimension first).
Write the structured evaluation to .claude/context/reports/backend/pipeline-eval-{pipelineId}-{YYYY-MM-DD}.md:
<!-- Agent: pipeline-evaluator | Task: #{taskId} | Session: {date} -->
# Pipeline Evaluation: {pipelineId}
**Evaluated At**: {ISO timestamp}
**Verdict**: {VERDICT}
**Composite Score**: {score}/100
## Dimension Scores
| Dimension | Score | Weight | Weighted |
| -------------------- | ----- | ------ | ---------- |
| Task Completion Rate | {D1} | 30% | {D1\*0.3} |
| Deviation Count | {D2} | 20% | {D2\*0.2} |
| Test Pass Rate | {D3} | 25% | {D3\*0.25} |
| Time Efficiency | {D4} | 10% | {D4\*0.1} |
| Quality Score | {D5} | 15% | {D5\*0.15} |
## Recommendations
{ranked recommendations list}
Also write the machine-readable JSON to .claude/context/reports/backend/pipeline-eval-{pipelineId}-{YYYY-MM-DD}.json conforming to pipeline-evaluation.schema.json.
| Verdict | Composite | Meaning | | ----------------- | --------- | ------------------------------------ | | EXCELLENT | > 90 | Pipeline executed near-perfectly | | GOOD | > 75 | Minor issues; no systemic problems | | ACCEPTABLE | > 60 | Notable gaps but goals met | | NEEDS_IMPROVEMENT | ≤ 60 | Systemic issues require intervention |
reflection — Uses pipeline evaluation scores in the rubrictdd — Informs the Test Pass Rate dimensionverification-before-completion — Pre-completion gates that feed quality signalstools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.