Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mpazaryna/orchestra-eval

Name: orchestra-eval
Author: mpazaryna

skills/orchestra-eval/SKILL.md

npx skillsauth add mpazaryna/agentic-factory orchestra-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Orchestra Eval

Grade an orchestra skill's output against its eval assertions. This skill does not run the skill under test — it grades what was already produced.

Variables

SKILL_NAME: first word of $ARGUMENTS WORK_ITEM: second word of $ARGUMENTS (optional)

Step 1: Load the Eval Suite

Find the skill's evals.json:

${CLAUDE_SKILL_DIR}/../{SKILL_NAME}/evals/evals.json

If the file doesn't exist, stop: "No evals found for {SKILL_NAME}. Add evals/evals.json to the skill directory."

Read the file. Note how many test cases exist.

Step 2: Identify the Work Item

The grader needs to know which output to evaluate.

If WORK_ITEM is provided in $ARGUMENTS:

Use .orchestra/work/{WORK_ITEM}/ as the output directory

If no WORK_ITEM:

Glob .orchestra/work/*/ to find all work items
Find the most recently modified one:
```
ls -t .orchestra/work/ | head -1
```
Use that as the work item
Report which work item was auto-detected

If no work items exist, stop: "No .orchestra/work/ items found in this project. Run the skill first, then eval."

Step 3: Grade Each Test Case

For each test case in evals.json, grade every assertion. Record PASS or FAIL with specific evidence.

Assertion types and how to check them

File existence — "X file exists at path":

ls .orchestra/work/{id}/prd.md 2>/dev/null && echo "EXISTS" || echo "MISSING"

PASS if file exists and is non-empty. FAIL if missing or empty.

Section presence — "contains a ## X section":

grep -c "^## X" .orchestra/work/{id}/prd.md

PASS if count > 0. FAIL if absent.

Frontmatter field — "frontmatter has status: approved":

grep "^status:" .orchestra/work/{id}/prd.md

PASS if value matches. FAIL if missing or wrong value.

Absence checks — "contains no X": Read the file content. Look for the prohibited term or pattern.

For "no framework or library names": scan for: pytest, click, argparse, zoneinfo, pytz, flask, fastapi, django, httpx, requests, sqlalchemy, pydantic
For "no HTTP method names": scan for: GET, POST, PUT, DELETE, PATCH as standalone words
For "no endpoint paths": scan for patterns like /api/, /v1/, starting with /
For "no technical implementation language": use model judgment — read the section and evaluate whether it describes outcomes or implementations PASS if prohibited content is absent. FAIL with the specific term or line found.

Count assertions — "at least N X": Count the relevant items and compare.

Model judgment assertions — assertions about quality, tone, or semantic content that can't be mechanically verified: Read the relevant section and reason about whether it passes. State explicitly: "Model judgment: [reasoning]. Result: PASS/FAIL."

Grading rules

Require evidence for every PASS. Quote the file content or show the grep result.
Require specifics for every FAIL. Name the exact term, missing section, or wrong value.
Never give benefit of the doubt. If an assertion says "contains no technical language" and the file has one framework name, that's a FAIL even if everything else is clean.

Step 4: Write the Grading Report

Write to .orchestra/eval/{skill-name}/{date}-{work-item}.json:

{
  "skill": "{skill-name}",
  "work_item": "{id}",
  "graded_at": "{YYYY-MM-DD}",
  "test_cases": [
    {
      "id": 1,
      "prompt": "{prompt from evals.json}",
      "assertion_results": [
        {
          "text": "{assertion text}",
          "passed": true,
          "evidence": "{quoted content or command output that confirms it}"
        },
        {
          "text": "{assertion text}",
          "passed": false,
          "evidence": "{exactly what was found or missing}"
        }
      ],
      "summary": {
        "passed": N,
        "failed": N,
        "total": N,
        "pass_rate": 0.NN
      }
    }
  ],
  "overall": {
    "passed": N,
    "failed": N,
    "total": N,
    "pass_rate": 0.NN
  }
}

Create .orchestra/eval/ directory if it doesn't exist.

Step 5: Report to User

Present a clean summary:

## Eval: {skill-name} — {work-item}

Overall: {passed}/{total} assertions passed ({pass_rate}%)

### Test Case 1
✓ prd.md exists at .orchestra/work/{id}/prd.md
✓ prd.md frontmatter has status: approved
✓ prd.md contains a ## Problem section
✗ prd.md contains no framework or library names
  → Found "pytest" in line 12: "Success criteria: pytest tests pass"
✓ spec.md contains a ### Unit Tests subsection
...

### Failed Assertions ({N} total)
1. [test-1] prd.md contains no framework or library names
   → Found "pytest" in line 12
2. [test-1] gherkin-spec.md contains at least one error scenario
   → No scenario with error, invalid, or fail in its name

### Next Steps
Fix the {N} failing assertions in {skill-name}/SKILL.md then re-run the skill and eval.

Rules

Grade the assertions in evals.json for the relevant test case, not all test cases — match by which test case fits what was produced
If multiple test cases exist and only one was run, grade the one whose prompt best matches the produced work item
Do not modify the skill under test — only read and grade
The grading.json is the authoritative record; the console output is for the human

Hard Stop

After presenting the report and writing grading.json, stop completely. Do not:

Run any other skill
Start any new work
Interpret project state as a signal to do something
Invoke orchestra-plan, orchestra-implement, or any other skill

Your job is grading. It ends when the report is written.

mpazaryna/orchestra-eval

skills/orchestra-eval/SKILL.md

Grade the outputs of an orchestra skill run against its eval assertions — reads produced files, checks each assertion, and writes a grading report.

testing

Updated May 5, 2026

$ install --global

skillsauth

npx skillsauth add mpazaryna/agentic-factory orchestra-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 5, 2026, 2:04 AM197.5s1 file scanned

SKILL.md

name:: orchestra-eval
description:: Grade the outputs of an orchestra skill run against its eval assertions — reads produced files, checks each assertion, and writes a grading report.
when_to_use:: Use after running an orchestra skill to verify it produced correct outputs. Run orchestra-eval <skill-name> [work-item-id] in the project where the skill was just executed.
allowed-tools:: Read, Glob, Grep, Bash, Write
argument-hint:: <skill-name> [work-item-id]

Orchestra Eval

Grade an orchestra skill's output against its eval assertions. This skill does not run the skill under test — it grades what was already produced.

Variables

SKILL_NAME: first word of $ARGUMENTS WORK_ITEM: second word of $ARGUMENTS (optional)

Step 1: Load the Eval Suite

Find the skill's evals.json:

${CLAUDE_SKILL_DIR}/../{SKILL_NAME}/evals/evals.json

If the file doesn't exist, stop: "No evals found for {SKILL_NAME}. Add evals/evals.json to the skill directory."

Read the file. Note how many test cases exist.

Step 2: Identify the Work Item

The grader needs to know which output to evaluate.

If WORK_ITEM is provided in $ARGUMENTS:

Use .orchestra/work/{WORK_ITEM}/ as the output directory

If no WORK_ITEM:

Glob .orchestra/work/*/ to find all work items
Find the most recently modified one:
```
ls -t .orchestra/work/ | head -1
```
Use that as the work item
Report which work item was auto-detected

If no work items exist, stop: "No .orchestra/work/ items found in this project. Run the skill first, then eval."

Step 3: Grade Each Test Case

For each test case in evals.json, grade every assertion. Record PASS or FAIL with specific evidence.

Assertion types and how to check them

File existence — "X file exists at path":

ls .orchestra/work/{id}/prd.md 2>/dev/null && echo "EXISTS" || echo "MISSING"

PASS if file exists and is non-empty. FAIL if missing or empty.

Section presence — "contains a ## X section":

grep -c "^## X" .orchestra/work/{id}/prd.md

PASS if count > 0. FAIL if absent.

Frontmatter field — "frontmatter has status: approved":

grep "^status:" .orchestra/work/{id}/prd.md

PASS if value matches. FAIL if missing or wrong value.

Absence checks — "contains no X": Read the file content. Look for the prohibited term or pattern.

For "no framework or library names": scan for: pytest, click, argparse, zoneinfo, pytz, flask, fastapi, django, httpx, requests, sqlalchemy, pydantic
For "no HTTP method names": scan for: GET, POST, PUT, DELETE, PATCH as standalone words
For "no endpoint paths": scan for patterns like /api/, /v1/, starting with /
For "no technical implementation language": use model judgment — read the section and evaluate whether it describes outcomes or implementations PASS if prohibited content is absent. FAIL with the specific term or line found.

Count assertions — "at least N X": Count the relevant items and compare.

Grading rules

Require evidence for every PASS. Quote the file content or show the grep result.
Require specifics for every FAIL. Name the exact term, missing section, or wrong value.
Never give benefit of the doubt. If an assertion says "contains no technical language" and the file has one framework name, that's a FAIL even if everything else is clean.

Step 4: Write the Grading Report

Write to .orchestra/eval/{skill-name}/{date}-{work-item}.json:

{
  "skill": "{skill-name}",
  "work_item": "{id}",
  "graded_at": "{YYYY-MM-DD}",
  "test_cases": [
    {
      "id": 1,
      "prompt": "{prompt from evals.json}",
      "assertion_results": [
        {
          "text": "{assertion text}",
          "passed": true,
          "evidence": "{quoted content or command output that confirms it}"
        },
        {
          "text": "{assertion text}",
          "passed": false,
          "evidence": "{exactly what was found or missing}"
        }
      ],
      "summary": {
        "passed": N,
        "failed": N,
        "total": N,
        "pass_rate": 0.NN
      }
    }
  ],
  "overall": {
    "passed": N,
    "failed": N,
    "total": N,
    "pass_rate": 0.NN
  }
}

Create .orchestra/eval/ directory if it doesn't exist.

Step 5: Report to User

Present a clean summary:

## Eval: {skill-name} — {work-item}

Overall: {passed}/{total} assertions passed ({pass_rate}%)

### Test Case 1
✓ prd.md exists at .orchestra/work/{id}/prd.md
✓ prd.md frontmatter has status: approved
✓ prd.md contains a ## Problem section
✗ prd.md contains no framework or library names
  → Found "pytest" in line 12: "Success criteria: pytest tests pass"
✓ spec.md contains a ### Unit Tests subsection
...

### Failed Assertions ({N} total)
1. [test-1] prd.md contains no framework or library names
   → Found "pytest" in line 12
2. [test-1] gherkin-spec.md contains at least one error scenario
   → No scenario with error, invalid, or fail in its name

### Next Steps
Fix the {N} failing assertions in {skill-name}/SKILL.md then re-run the skill and eval.

Rules

Grade the assertions in evals.json for the relevant test case, not all test cases — match by which test case fits what was produced
If multiple test cases exist and only one was run, grade the one whose prompt best matches the produced work item
Do not modify the skill under test — only read and grade
The grading.json is the authoritative record; the console output is for the human

Hard Stop

After presenting the report and writing grading.json, stop completely. Do not:

Run any other skill
Start any new work
Interpret project state as a signal to do something
Invoke orchestra-plan, orchestra-implement, or any other skill

Your job is grading. It ends when the report is written.

Related Skills

mpazaryna/cloudflare

development

VerifiedTrustedCommunity

Comprehensive Cloudflare platform skill covering Workers, Pages, storage (KV, D1, R2), AI (Workers AI, Vectorize, Agents SDK), feature flags (Flagship), networking (Tunnel, Spectrum), security (WAF, DDoS), and infrastructure-as-code (Terraform, Pulumi). Use for any Cloudflare development task. Biases towards retrieval from Cloudflare docs over pre-trained knowledge.

1SKILL.mdUpdated Jun 4, 2026

mpazaryna/cloudflare-email-service

tools

VerifiedTrustedCommunity

Send and receive transactional emails with Cloudflare Email Service (Email Sending + Email Routing). Use when building email sending (Workers binding or REST API), email routing, Agents SDK email handling, or integrating email into any app — Workers, Node.js, Python, Go, etc. Also use for email deliverability, SPF/DKIM/DMARC, wrangler email setup, MCP email tools, or when a coding agent needs to send emails. Even for simple requests like "add email to my Worker" — this skill has critical config details.

1SKILL.mdUpdated Jun 4, 2026

mpazaryna/cloudflare-email-service

mpazaryna/agents-sdk

tools

VerifiedTrustedCommunity

Build AI agents on Cloudflare Workers using the Agents SDK. Load when creating stateful agents, durable workflows, real-time WebSocket apps, scheduled tasks, MCP servers, chat applications, voice agents, or browser automation. Covers Agent class, state management, callable RPC, Workflows, durable execution, queues, retries, observability, and React hooks. Biases towards retrieval from Cloudflare docs over pre-trained knowledge.

1SKILL.mdUpdated Jun 4, 2026

mpazaryna/orchestra-plan

tools

VerifiedTrustedCommunity

Planning conductor — runs the full PRD → Spec → Gherkin loop in a single interactive session with human approval gates.

SKILL.mdUpdated May 5, 2026

mpazaryna/orchestra-plan

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mpazaryna/agentic-factory.git

# Copy into Claude Code skills folder (global)
cp -r agentic-factory/skills/orchestra-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mpazaryna/agentic-factory

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT