skills/orchestra-eval/SKILL.md
Grade the outputs of an orchestra skill run against its eval assertions — reads produced files, checks each assertion, and writes a grading report.
npx skillsauth add mpazaryna/agentic-factory orchestra-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Grade an orchestra skill's output against its eval assertions. This skill does not run the skill under test — it grades what was already produced.
SKILL_NAME: first word of $ARGUMENTS WORK_ITEM: second word of $ARGUMENTS (optional)
Find the skill's evals.json:
${CLAUDE_SKILL_DIR}/../{SKILL_NAME}/evals/evals.json
If the file doesn't exist, stop: "No evals found for {SKILL_NAME}. Add evals/evals.json to the skill directory."
Read the file. Note how many test cases exist.
The grader needs to know which output to evaluate.
If WORK_ITEM is provided in $ARGUMENTS:
.orchestra/work/{WORK_ITEM}/ as the output directoryIf no WORK_ITEM:
.orchestra/work/*/ to find all work itemsls -t .orchestra/work/ | head -1
If no work items exist, stop: "No .orchestra/work/ items found in this project. Run the skill first, then eval."
For each test case in evals.json, grade every assertion. Record PASS or FAIL with specific evidence.
File existence — "X file exists at path":
ls .orchestra/work/{id}/prd.md 2>/dev/null && echo "EXISTS" || echo "MISSING"
PASS if file exists and is non-empty. FAIL if missing or empty.
Section presence — "contains a ## X section":
grep -c "^## X" .orchestra/work/{id}/prd.md
PASS if count > 0. FAIL if absent.
Frontmatter field — "frontmatter has status: approved":
grep "^status:" .orchestra/work/{id}/prd.md
PASS if value matches. FAIL if missing or wrong value.
Absence checks — "contains no X": Read the file content. Look for the prohibited term or pattern.
Count assertions — "at least N X": Count the relevant items and compare.
Model judgment assertions — assertions about quality, tone, or semantic content that can't be mechanically verified: Read the relevant section and reason about whether it passes. State explicitly: "Model judgment: [reasoning]. Result: PASS/FAIL."
Write to .orchestra/eval/{skill-name}/{date}-{work-item}.json:
{
"skill": "{skill-name}",
"work_item": "{id}",
"graded_at": "{YYYY-MM-DD}",
"test_cases": [
{
"id": 1,
"prompt": "{prompt from evals.json}",
"assertion_results": [
{
"text": "{assertion text}",
"passed": true,
"evidence": "{quoted content or command output that confirms it}"
},
{
"text": "{assertion text}",
"passed": false,
"evidence": "{exactly what was found or missing}"
}
],
"summary": {
"passed": N,
"failed": N,
"total": N,
"pass_rate": 0.NN
}
}
],
"overall": {
"passed": N,
"failed": N,
"total": N,
"pass_rate": 0.NN
}
}
Create .orchestra/eval/ directory if it doesn't exist.
Present a clean summary:
## Eval: {skill-name} — {work-item}
Overall: {passed}/{total} assertions passed ({pass_rate}%)
### Test Case 1
✓ prd.md exists at .orchestra/work/{id}/prd.md
✓ prd.md frontmatter has status: approved
✓ prd.md contains a ## Problem section
✗ prd.md contains no framework or library names
→ Found "pytest" in line 12: "Success criteria: pytest tests pass"
✓ spec.md contains a ### Unit Tests subsection
...
### Failed Assertions ({N} total)
1. [test-1] prd.md contains no framework or library names
→ Found "pytest" in line 12
2. [test-1] gherkin-spec.md contains at least one error scenario
→ No scenario with error, invalid, or fail in its name
### Next Steps
Fix the {N} failing assertions in {skill-name}/SKILL.md then re-run the skill and eval.
After presenting the report and writing grading.json, stop completely. Do not:
Your job is grading. It ends when the report is written.
development
Comprehensive Cloudflare platform skill covering Workers, Pages, storage (KV, D1, R2), AI (Workers AI, Vectorize, Agents SDK), feature flags (Flagship), networking (Tunnel, Spectrum), security (WAF, DDoS), and infrastructure-as-code (Terraform, Pulumi). Use for any Cloudflare development task. Biases towards retrieval from Cloudflare docs over pre-trained knowledge.
tools
Send and receive transactional emails with Cloudflare Email Service (Email Sending + Email Routing). Use when building email sending (Workers binding or REST API), email routing, Agents SDK email handling, or integrating email into any app — Workers, Node.js, Python, Go, etc. Also use for email deliverability, SPF/DKIM/DMARC, wrangler email setup, MCP email tools, or when a coding agent needs to send emails. Even for simple requests like "add email to my Worker" — this skill has critical config details.
tools
Build AI agents on Cloudflare Workers using the Agents SDK. Load when creating stateful agents, durable workflows, real-time WebSocket apps, scheduled tasks, MCP servers, chat applications, voice agents, or browser automation. Covers Agent class, state management, callable RPC, Workflows, durable execution, queues, retries, observability, and React hooks. Biases towards retrieval from Cloudflare docs over pre-trained knowledge.
tools
Planning conductor — runs the full PRD → Spec → Gherkin loop in a single interactive session with human approval gates.