skills/orchestra-spec/SKILL.md
Generate an execution spec from an approved PRD — define approach, steps, deliverables, acceptance criteria, and risks.
npx skillsauth add mpazaryna/agentic-factory orchestra-specInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate an execution spec from an approved PRD. The spec is the contract between the conductor (you) and the orchestra (agents).
.orchestra/work/{id}-{name}/prd.mdIf the work involves code changes:
Break the work into concrete steps:
TDD is mandatory across three tiers. For every spec that involves code, design all three layers before writing the approach steps:
Unit tests — mock external boundaries (network, filesystem, DB). Fast, deterministic. Committed before the implementation they cover. Run as the project default suite — no external dependencies required.
Integration tests — exercise real subsystems at the boundary being tested (real API, real DB, real filesystem). No mocks at the boundary. Excluded from the default run — require external access. Must include any external API call so redirect behavior, auth, and response shape are verified against reality.
E2E tests — exercise the full user-facing interface from end to end (CLI invocation, HTTP endpoint, browser). Hits real external dependencies. Excluded from the default run — require the full running stack.
For each tier, specify:
Generate the spec with this structure:
Save to: .orchestra/work/{id}-{name}/spec.md (alongside the PRD)
Include YAML frontmatter at the top of the file:
---
ticket: {ticket-id or slug — matches the work item folder name}
status: draft
created_on: {today's date in YYYY-MM-DD format}
---
Status vocabulary for specs: draft → approved → in-progress → complete
Show the user the spec. Ask:
| | PRD | Spec | |---|---|---| | Answers | What and why | How | | Audience | Composer (strategic) | Orchestra (execution) | | Language | Business/outcome | Tactical/technical | | Approval | Before spec | Before implementation | | Lifespan | Until milestone closes | Until work is done |
Every spec that involves code must plan for three test tiers. All three are non-negotiable. Each tier has a different purpose, a different mocking stance, and a different run profile.
Purpose: Verify logic in isolation. Fast, deterministic, no I/O.
Mocking stance: Mock everything outside the unit under test — network, filesystem, DB, time. Use purpose-built mocking tools (e.g. pytest-httpx, monkeypatch, unittest.mock).
Run profile: Default — uv run pytest with no flags. Must pass on every commit.
What they catch: Logic errors, branching, edge cases, error paths.
What they do NOT catch: Redirect behavior, real API response shape, real DB schema changes, real filesystem permissions — anything that only exists at the real boundary.
TDD constraint: Test file committed before the implementation file it covers.
Purpose: Verify that the code works correctly against the real external boundary — the actual API, the actual DB, the actual filesystem. This is where mocks lie and reality bites.
Mocking stance: No mocks at the boundary being tested. If you are testing an HTTP API integration, hit the real HTTP endpoint. If you are testing a DB layer, use a real DB. Internal helpers may still be mocked if they are not the subject of the test.
Run profile: uv run pytest -m integration — excluded from the default run because they are slow and require network/external access.
What they catch: Redirect chains, auth failures, unexpected response shapes, rate limits, SSL issues, schema drift — anything that only reveals itself against the real system.
What they do NOT catch: Full user-facing behavior or multi-layer interaction.
TDD constraint: Integration test file committed before the integration implementation it covers.
Mandatory rule: Any spec that touches an external API, database, or remote service MUST include at least one integration test that hits the real endpoint. "We tested it with mocks" is not sufficient. Mocks encode assumptions; integration tests verify them.
Purpose: Verify the complete user-facing interface from entry point to output, exercising the full stack with real external dependencies.
Mocking stance: Nothing mocked. The test invokes the system exactly as a user would — CLI command, HTTP request, UI action.
Run profile: uv run pytest -m e2e — excluded from the default run. May require live credentials or network access.
What they catch: Wiring failures between layers, misconfigured entry points, broken CLI argument parsing, missing env vars in production config, deployment regressions.
What they do NOT catch: Fine-grained logic (that's unit territory) or boundary-specific behavior (that's integration territory).
TDD constraint: E2E test file committed before the final wiring that makes it pass.
| Scenario | Unit | Integration | E2E |
|----------|------|-------------|-----|
| External HTTP API | Mock with pytest-httpx | Hit real endpoint | Hit real endpoint |
| Database | Mock or in-memory | Real DB | Real DB |
| Filesystem | tmp_path fixture | Real filesystem | Real filesystem |
| Time / clock | monkeypatch | Real time | Real time |
| Internal functions | Mock if needed | Real | Real |
Pick the right tools for the stack. The three-tier rule is language-agnostic; the tools are not.
| Tier | Tools |
|------|-------|
| Unit | pytest, pytest-httpx (HTTP mocking), monkeypatch (env/time/functions), tmp_path (filesystem) |
| Integration | pytest + @pytest.mark.integration, real httpx/requests calls, real DB via pytest-postgresql / pytest-mongo / SQLite |
| E2E | pytest + @pytest.mark.e2e, subprocess or click.testing.CliRunner against real dependencies |
Run profiles:
uv run pytest # unit only (default)
uv run pytest -m integration # integration suite
uv run pytest -m e2e # E2E suite
uv run pytest -m "integration or e2e" # all live tests
| Tier | Tools |
|------|-------|
| Unit | vitest or jest, msw (Mock Service Worker for HTTP), vi.mock / jest.mock |
| Integration | vitest / jest with real HTTP (fetch/axios), real DB via testcontainers or a local instance |
| E2E | Playwright — full browser automation, also works for API-level E2E via request context |
Run profiles:
npx vitest run # unit only
npx vitest run --project integration # integration suite
npx playwright test # E2E suite
For anything with a browser interface, Playwright is the default E2E tool. Prefer the Playwright MCP (dev-playwright skill) when generating tests — it drives a real browser, explores the UI, and produces assertions grounded in what the page actually renders rather than what the code assumes it renders.
npx playwright test # run all E2E specs
npx playwright test --ui # interactive mode
Key Playwright practices for specs:
page.getByRole / page.getByLabel locators — more resilient than CSS selectorsMocks are allowed in unit tests to keep them fast and isolated. Integration tests MUST hit the real system at the boundary being tested — no exceptions. E2E tests MUST invoke the real user interface against the real stack.
Violating this rule means tests can pass while the system is broken in production. A common failure mode: all unit tests pass against mocked responses, but the real external service returns a redirect or unexpected shape that the mock never encoded — crashing the system at runtime in a way no unit test could have caught.
In human-paced development, a regression is caught within minutes — someone runs the app and notices it's broken. In agentic development, an autonomous agent can execute dozens of steps, commit multiple files, and close a milestone before anyone looks at the output. Regressions don't surface until review, or worse, production.
Tests are the agent's conscience. They are the only mechanism that runs continuously, independent of the agent's intent, and reports objective truth about the system state. An agent that writes no tests is operating on faith that its own reasoning is correct — and agents, like humans, make mistakes.
The "do no harm" principle: Every spec must treat the existing test suite as a hard constraint. Before implementing anything new:
Regressions are silent failures. A test that was passing and now fails is information. A test that was deleted to avoid a failure is deception. Agents must never remove or weaken a test to make a build green — if a test is failing, stop, surface it, and resolve the root cause.
Milestones build on each other. In a multi-milestone project like this one, each milestone's test suite is inherited by the next. The integration tests from Live Rates must still pass when Multi-Currency ships. The E2E tests from Basic Converter must still pass at Polish & Ship. The full suite across all milestones is the definition of "the system works."
For autonomous agents running long loops: run
uv run pytest(or the equivalent) at the start of every session before touching code, and at the end of every step before committing. A red suite is a stop signal, not a obstacle to route around.
development
Comprehensive Cloudflare platform skill covering Workers, Pages, storage (KV, D1, R2), AI (Workers AI, Vectorize, Agents SDK), feature flags (Flagship), networking (Tunnel, Spectrum), security (WAF, DDoS), and infrastructure-as-code (Terraform, Pulumi). Use for any Cloudflare development task. Biases towards retrieval from Cloudflare docs over pre-trained knowledge.
tools
Send and receive transactional emails with Cloudflare Email Service (Email Sending + Email Routing). Use when building email sending (Workers binding or REST API), email routing, Agents SDK email handling, or integrating email into any app — Workers, Node.js, Python, Go, etc. Also use for email deliverability, SPF/DKIM/DMARC, wrangler email setup, MCP email tools, or when a coding agent needs to send emails. Even for simple requests like "add email to my Worker" — this skill has critical config details.
tools
Build AI agents on Cloudflare Workers using the Agents SDK. Load when creating stateful agents, durable workflows, real-time WebSocket apps, scheduled tasks, MCP servers, chat applications, voice agents, or browser automation. Covers Agent class, state management, callable RPC, Workflows, durable execution, queues, retries, observability, and React hooks. Biases towards retrieval from Cloudflare docs over pre-trained knowledge.
tools
Planning conductor — runs the full PRD → Spec → Gherkin loop in a single interactive session with human approval gates.