plugins/powerups/skills/test-driven-development/SKILL.md
Use when building any feature (small or large), fixing bugs, or refactoring — always write failing tests first against real infrastructure
npx skillsauth add jackyliang/powerups test-driven-developmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Write the test first. Watch it fail. Write minimal code to pass. No exceptions.
If you didn't watch the test fail, you don't know if it tests the right thing. Tests written after implementation pass immediately — that proves nothing.
Always:
Exceptions (ask the user first):
NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST
Wrote code before the test? Delete it. Start over. Don't keep it as "reference" — you'll adapt it instead of writing from tests. Delete means delete.
Write one test that describes what the code should do.
Run it. Confirm it fails. The failure must be because the feature is missing, not because of a typo or import error.
pytest tests/test_{module}.py::TestClassName::test_method -v
If the test passes immediately, you're testing existing behavior. Fix the test.
Write the simplest code that makes the test pass. Don't add features, refactor, or "improve" beyond what the test requires.
pytest tests/test_{module}.py -v
All tests must pass — the new one and all existing ones.
Only after green. Remove duplication, improve names, extract helpers. Keep all tests green. Don't add behavior.
Next failing test for the next behavior.
Tests should verify business outcomes, not that frameworks work.
Good — tests actual behavior:
async def test_upsert_updates_existing_record(writer, test_engine, test_schema):
await writer.ensure_schema(test_schema)
await writer.ensure_table(test_schema, "users", {"id": "1", "name": "Alice", "status": "new"})
await writer.upsert_records(test_schema, "users", [{"id": "1", "name": "Alice", "status": "new"}])
# Upsert with updated status
count = await writer.upsert_records(
test_schema, "users", [{"id": "1", "name": "Alice", "status": "active"}]
)
assert count == 1
# Verify the actual data changed in the database
async with test_engine.connect() as conn:
result = await conn.execute(
text(f'SELECT "status" FROM "{test_schema}"."users" WHERE "id" = \'1\'')
)
assert result.scalar() == "active"
Bad — tests that the framework returns 200:
async def test_endpoint_works(client):
response = await client.get("/v1/health")
assert response.status_code == 200 # This tells you nothing useful
Name tests so failures are self-documenting:
# Good — describes the behavior
async def test_rejects_duplicate_sync_trigger_with_409(...)
async def test_expired_otp_returns_unauthorized(...)
async def test_upsert_sets_synced_at_timestamp(...)
# Bad — vague
async def test_sync_works(...)
async def test_auth(...)
async def test_api(...)
If the test name has "and" in it, split it into two tests.
Do not use unittest.mock, MagicMock, patch, or any mocking library.
Tests must hit real infrastructure:
| What | How to Test | |------|-------------| | Database operations | Real Postgres with cleanup-based isolation | | API endpoints | Full HTTP via FastAPI test client against real DB | | Service classes | Instantiate real classes, call real methods | | Provider APIs (Zendesk, etc.) | Real API with test credentials, or VCR cassettes as last resort | | Nango OAuth | Nango sandbox/test environment |
The only acceptable mock: A third-party API that has no sandbox, no test environment, and charges per call. Document why with a comment above the mock:
# Mock required: Stripe charges real money per API call and
# their sandbox is down. No viable alternative for CI.
"It's easier" and "it's faster" are not valid reasons to mock.
Always test against a real database — the same engine as production (Postgres, not SQLite). Never test against production — use a dedicated test database.
You need explicit cleanup so tests don't leave debris. Use one or both of these patterns:
Each test runs in a transaction that rolls back afterward. Fast, but can't test DDL operations.
@pytest.fixture
async def db_session(test_engine):
async with test_engine.connect() as conn:
trans = await conn.begin()
session = AsyncSession(bind=conn)
yield session
await trans.rollback() # Each test starts clean
When tests create schemas, tables, or other DDL that can't be rolled back:
@pytest.fixture(autouse=True)
async def _cleanup_schema(test_engine, test_schema):
yield
async with test_engine.begin() as conn:
await conn.execute(text(f'DROP SCHEMA IF EXISTS "{test_schema}" CASCADE'))
Rules for cleanup-based isolation:
test_{uuid.uuid4().hex[:12]}) to avoid collisionsyield fixture (runs even if test fails)pytest tests/ -v # All tests
pytest tests/test_{module}.py -v # Specific module
pytest tests/test_{module}.py -k "test_name" -v # Specific test
When using the plan-driven-development skill, every milestone lists test tasks before implementation tasks — PDD's milestone template owns that structure; this skill governs how each test is written.
When spawning subagents for parallel milestone work, every agent prompt must include:
"Write failing integration tests FIRST in
tests/test_{module}.py. Tests must use a real database — no mocks. Run them to confirm they fail. Then write implementation to make them pass."
The team lead must verify:
mock, patch, MagicMock)| Layer | Test Pattern | |-------|--------------| | API endpoints | Full HTTP request → verify response AND database state | | Service methods | Call real methods → verify side effects (DB writes, state changes) | | Data models | Create real instances → verify constraints, defaults, relationships | | Business logic | Pure functions → verify input/output for normal cases AND edge cases | | Error handling | Trigger real errors → verify correct error codes and messages | | Sync engine | Run real syncs → verify data lands in correct schema/table |
| Excuse | Reality | |--------|---------| | "Too simple to test" | Simple code breaks. Test takes 30 seconds. | | "I'll test after" | Tests passing immediately prove nothing. | | "Need to explore first" | Fine. Throw away exploration. Start with TDD. | | "Test is hard to write" | Listen to the test. Hard to test = hard to use. Simplify the design. | | "Mocking is faster" | Mocks test mock behavior, not real behavior. You'll ship bugs. | | "Just this once" | That's rationalization. Every time is "just this once." | | "The plan doesn't mention tests" | Every plan implicitly requires TDD. Add test tasks. |
Before marking any task or feature complete:
development
Run PM-grade discovery before building any user-facing feature — problem statement, jobs-to-be-done, core flow, decision matrix. Output is a short brief with open decisions surfaced as explicit questions. Invoked by plan-driven-development and give-me-five.
testing
Ultra-short replies — answer a quick question, draft a short text/social post, or draft a short email. No preamble, no offers to elaborate, drafts under 480 characters (280 for X), never em-dashes.
development
Reconcile shipped code with the plan in both directions — additive drift (unplanned work that landed) and subtractive drift (dead files, stale TODOs, completed deferred items). Run as part of PDD's post-completion audit, before /simplify.
data-ai
Generate 5 meaningfully distinct UI/UX variants of the same screen in parallel (one subagent each), reachable via ?style=1...5, so the user can compare and pick. Calling again on a chosen style refines within that direction.