Perfection Engine v4 — Pure Discovery

Reads any codebase. Discovers what matters. Measures it. Fixes it. Loops until perfect.

No templates. No hardcoded categories. Everything generated from the project itself.

Quick Start

# THE ULTIMATE COMMAND — runs everything, makes the project the best it can be
/perfection-engine max                      # All features: score + fix + ideas + implement + loop forever

# Individual modes
/perfection-engine                          # Quality loop only (score + fix until 10/10)
/perfection-engine --goal "description"     # Goal-driven: measure completeness against an end state
/perfection-engine --ideas                  # Quality loop + ideas discovery each cycle
/perfection-engine score                    # Score only (no fixes)
/perfection-engine score --changed-only     # Score only files changed since last cycle
/perfection-engine validate                 # Validate scoring quality
/perfection-engine fix                      # Fix failing metrics only
/perfection-engine fix --dry-run            # Show what fixes would be applied
/perfection-engine fix --quick-wins         # Fix easiest high-impact items first
/perfection-engine ideas                    # Discover improvement opportunities
/perfection-engine ideas --implement        # Discover AND auto-implement best ideas
/perfection-engine report                   # Display latest scorecard
/perfection-engine portfolio               # Cross-project dashboard
/perfection-engine export                   # Export scores to CSV/JSON
/perfection-engine reset                    # Delete state, start fresh

MAX Mode — The Ultimate Command

/perfection-engine max runs EVERY feature in a single continuous loop:

DISCOVER → SCORE → VALIDATE → FIX → IDEAS → IMPLEMENT → LOOP
    ↑                                                      |
    └──────────────────────────────────────────────────────┘

What it does each cycle:

DISCOVER — Read the project, discover capabilities, generate/refresh rubric
SCORE — Score every metric using all available tools (Playwright, bash, LLM, tests)
VALIDATE — Check scoring quality (environment, contradictions, coverage, flaky)
FIX — Auto-fix all failing quality metrics (security first, then bugs, then polish)
IDEAS — Launch parallel agents to discover improvements, new features, architecture changes
IMPLEMENT — Auto-implement the top small/medium ideas that score highest impact
LOOP — Create GitHub issues, update leaderboard, start next cycle

It combines:

Quality scoring + fixing (the base engine)
Validation (environment checks, contradiction detection, regression testing)
Ideas discovery (6 parallel research agents)
Auto-implementation (build new features, not just fix existing ones)
Continuous looping (Stop hook keeps it running)

The result: The engine doesn't just make your code better — it makes your PROJECT better. It finds what's missing, builds it, then makes sure it's high quality. It loops until every quality metric is 10/10 AND there are no more improvement ideas to implement.

Safety: All MAX mode safety rails apply — max 20 cycles, $50/cycle cost cap, simplicity gate, ratcheting, dev branch only, never deletes features.

How It Works

READ the project → DISCOVER what matters → SCORE everything →
FIX what's broken → LOOP (re-read, re-discover, re-score, re-fix)

The engine has zero opinions about what your project should look like. It reads the code, understands what the project does, generates categories and metrics that make sense for THIS project, scores them, fixes what it can, and repeats.

Each cycle, the rubric refreshes — new metrics added for new code, obsolete metrics removed for deleted code, priorities adjusted based on current state.

Phase 1: DISCOVER

The engine's intelligence lives here. Everything is generated dynamically.

Step 1: Load State

mkdir -p .claude/skills/perfection-engine-state

If perfection-rubric.json exists → load it (continuing from previous cycle)
If current-cycle.json shows in-progress and started less than 2 hours ago → STOP with message "Another cycle is in progress (started {time}). Wait or run /perfection-engine reset." Locks older than 2 hours are considered stale and ignored.
If .perfection-engine.yml exists in project root → load user config overrides
Write lock: {"cycle": N, "status": "in-progress|completed|paused|error", "started_at": "ISO", "completed_at": "ISO|null", "metrics_scored": 0, "metrics_fixed": 0}

Step 2: Read the Project

Build a comprehensive understanding of the project by reading:

Project manifest: package.json, pyproject.toml, Cargo.toml, go.mod, pom.xml, etc.
Project docs: CLAUDE.md, README.md, any docs/ directory
Source structure: List all directories and files, understand the architecture
Configuration: .env.example, config files, CI workflows, deployment configs
Dependencies: What libraries, frameworks, APIs does this project use?
Test infrastructure: What test files exist? What test runner? What coverage?

Skip binary files: Do not read images, fonts, compiled binaries, databases, or other non-text files during discovery. Detect by extension (.png, .jpg, .gif, .woff, .ttf, .ico, .sqlite, .wasm, etc.) or by failed text decoding.

For large projects (>200 source files): Sample strategically — read entry points, config files, key directories, and a representative sample of source files. Use subagents to read different areas in parallel.

Save understanding as project-profile.json:

{
  "project_name": "detected from manifest",
  "description": "LLM-generated 1-sentence summary of what this project does",
  "languages": ["typescript", "python"],
  "frameworks": ["react", "express"],
  "features_detected": [
    "user authentication",
    "payment processing",
    "real-time chat",
    "admin dashboard"
  ],
  "architecture": "monolith|microservices|serverless|static|library|cli",
  "has_frontend": true,
  "has_backend": true,
  "has_tests": true,
  "has_ci": true,
  "has_deployment": true,
  "urls": { "dev": "...", "prod": "..." },
  "source_file_count": 150,
  "test_runner": "jest|pytest|cargo-test|go-test|make|null",
  "goal": "from --goal flag or .perfection-engine.yml or null"
}

Step 2.5: Bootstrap Guard

If source_file_count == 0 → report "No source files found. Nothing to measure." and exit.

If source_file_count < 5 → generate a simplified rubric focused on project setup (README quality, license, gitignore, basic linting, dependency management) with ~10 metrics. Suggest the user re-run after the project has more substance.

Step 3: Generate the Rubric

This is the core innovation. The LLM reads the project and generates ALL categories and metrics dynamically. No templates, no pre-defined lists.

Rubric generation prompt:

You just read this project: {project_profile}
Source files sampled: {file_list_with_summaries}

Generate a comprehensive quality rubric for THIS specific project.

For each category you discover:
- Give it a short ID (2-5 chars) and descriptive name
- Explain why it matters for THIS project
- Generate 5-25 specific, measurable metrics

For each metric:
- Give it an ID ({category}-{NNN})
- Name it specifically (not generic — reference actual files/features)
- Describe exactly how to measure it (what tool, what command, what to look for)
- Classify the scoring method:
    deterministic — run a command, check output (lint, grep, test, curl)
    playwright — navigate a URL, interact with UI, verify state
    llm_judged — read code/output, evaluate quality with structured rubric
    statistical — run N samples, verify distribution
    unit_test_proxy — run the project's own test suite
- Determine if it can be auto-fixed (true/false)
- Define what 10/10 looks like for this specific metric

Guidelines:
- Be SPECIFIC to this project. "Login page loads" not "Page loads"
- Reference actual file paths, function names, endpoints you found
- Cover ALL dimensions: does it work, is it secure, is it fast, is it accessible,
  is it well-coded, is it well-tested, is it well-documented
- If a goal was provided, generate COMPLETENESS metrics:
  does feature X exist, does flow Y work end-to-end
- Include metrics at every level: code, feature, flow, system
- Remove anything generic that doesn't apply

Return as JSON:
{
  "categories": {
    "category_id": {
      "name": "Category Name",
      "why": "Why this matters for this project",
      "metrics": [
        {
          "id": "CAT-001",
          "name": "Specific metric name",
          "description": "How to measure this",
          "type": "deterministic|playwright|llm_judged|statistical|unit_test_proxy",
          "auto_fixable": true,
          "what_10_looks_like": "Description of perfect score",
          "score": null,
          "incumbent_score": null,
          "target": 10,
          "scope": ["src/auth/*.js"],
          "goal_driven": false,
          "history": []
        }
      ]
    }
  }
}

Step 4: Merge with Previous Rubric (cycle 2+)

On subsequent cycles, the engine re-reads the project and re-generates the rubric. Then merges with the existing rubric:

New metrics (not in previous rubric) → add with score null
Existing metrics (in both) → keep score history, update description if changed
Obsolete metrics (in previous but code was deleted) → mark deprecated, remove
Duplicate metrics (LLM generated a variant of existing) → deduplicate
Goal change (goal text differs from previous cycle) → deprecate all previous GOAL-* metrics, regenerate from new goal

Rubric cleanup rules:

If a metric has scored 10/10 for 3+ consecutive cycles → mark as stable, carry forward score of 10 (included in aggregation but not re-measured)
If a metric has scored null for 2+ cycles → investigate why, remove if code gone
If total metrics > 300 → ask LLM to consolidate: "Merge metrics that test the same file with the same method, or that are threshold variants of each other"
Hard cap: 500 metrics (absolute maximum, config max_metrics range is 1-500)

Step 5: Goal-Driven Completeness (when --goal or config specifies a goal)

When a goal is provided, the LLM generates an additional layer of metrics:

Goal: "{goal_text}"
Current features detected: {features_detected}

What features/flows/capabilities are NEEDED to achieve this goal but
do NOT exist in the codebase? For each missing feature:
- Name it
- Describe what "complete" looks like
- Define 3-5 acceptance criteria
- Estimate complexity (small/medium/large)

Also: for features that PARTIALLY exist, what's missing?

These generate metrics scored using any of the five standard methods (usually playwright or deterministic), and are tagged with "goal_driven": true to distinguish them from structural metrics:

{
  "id": "GOAL-001",
  "name": "Payment flow exists and works end-to-end",
  "type": "playwright",
  "goal_driven": true,
  "score": 0,
  "what_10_looks_like": "User can complete checkout, payment processes, confirmation shows"
}

Step 6: Save State

git add .claude/skills/perfection-engine-state/
git commit -m "perfection-engine: cycle {N} DISCOVER"

If no git repo, write state files without committing.

Phase 2: SCORE

Score every metric in the rubric. The scoring methods are universal — they're tools, not opinions.

Pre-Loop Estimate (first cycle only)

Before starting, estimate and display:

Project: {name}
Metrics discovered: {N} across {M} categories
Estimated scoring time: {T} minutes
Estimated API cost: ${C}

Scoring Methods

Read references/scoring-methods.md for detailed protocols.

| Method | What It Does | When Used | |--------|-------------|-----------| | deterministic | Run a command, check output | Lint, grep, test suites, curl, file checks | | playwright | Navigate URL, interact with UI, verify state | Frontend flows, forms, auth gates | | llm_judged | Read code/output, evaluate quality | Code review, prompt quality, documentation | | statistical | Run N samples, verify distribution | Randomness, balance, probability systems | | unit_test_proxy | Run the project's test suite | Map pass/fail ratio to 0-10 |

Scoring scale: 0 (broken) → 5 (adequate) → 10 (perfect). Full definitions in references/scoring-methods.md.

Scoring Workflow

Group metrics by scoring method:
- Batch A: All deterministic → run in parallel subagents
- Batch B: All playwright → serial (shared browser)
- Batch C: All llm_judged → parallel subagents
- Batch D: Statistical + unit_test_proxy → parallel
Process each batch in groups of 20, compact context between groups
Dependency awareness: If metric A depends on metric B (e.g., "login works" gates "dashboard loads"), and B failed → skip A, score as null with reason
Incremental mode (--changed-only): Use git diff to identify changed files, only re-score metrics related to those files, carry forward unchanged scores

Capability Discovery

Before scoring, discover what tools and capabilities are available in this session. The engine adapts its scoring methods based on what it finds — never assume a tool exists, always check first.

Discovery checklist (run at start of each cycle):

MCP servers: Check what MCP tools are available (Playwright, Exa, Tavily, etc.). If Playwright MCP is connected → browser-based scoring is available. If no browser tools → skip all playwright-type metrics or fall back to curl/API checks.
Installed skills: Check what skills are loaded in this session. If superpowers:writing-plans available → use for fix planning. If e2e-playwright-core available → leverage its patterns. If none → fall back to inline approaches.
Plugin tools: Check for any plugin-provided tools that could enhance scoring (web search, code analysis, database queries, etc.).
CLI tools on the system: Check what's installed locally. Run which npm node python3 cargo go java rustc docker kubectl gh curl (or equivalent). Only generate metrics that the local toolchain can actually measure.
Project's own tooling: Check package.json scripts, Makefile targets, CI workflows for project-specific commands that can be leveraged for scoring.

Save discovered capabilities to project-profile.json:

{
  "capabilities": {
    "browser": true,
    "mcp_tools": ["playwright", "exa", "tavily"],
    "skills": ["superpowers:writing-plans", "superpowers:executing-plans"],
    "cli_tools": ["npm", "node", "python3", "docker", "kubectl", "gh"],
    "project_scripts": ["test", "lint", "build", "typecheck"],
    "can_deploy": true,
    "can_create_issues": true
  }
}

How capabilities affect metric generation:

No browser tools → no playwright-type metrics (use curl/API instead)
No gh CLI → no GitHub issues (log to markdown instead)
No test runner → no unit_test_proxy metrics
No deploy pipeline → skip deploy step in fix loop, validate locally only
Has web search (Exa/Tavily) → IDEAS mode can research competitors and trends

Use whatever is available. Adapt to whatever is missing. Never fail because a tool doesn't exist — degrade gracefully and note what was skipped and why.

Blind Scoring

When scoring will inform the FIX phase, the scores are recorded but detailed scoring criteria are NOT exposed to fix agents. Fix agents see: {metric_id, name, score, evidence, suggestion} — not weights or formulas. This prevents gaming.

Ratcheting

Scores move monotonically upward. Track incumbent_score per metric (best ever). After a fix, if new_score <= incumbent_score → reject the fix immediately.

Aggregation

Category score = mean of all metrics in that category
Overall score = mean of all category scores
Null scores excluded (not measured ≠ zero)

Output

Write to docs/perfection-engine/cycle-{N}-scorecard.md:

Overall score with trend (↑↓→)
Per-category scores with delta
Top 10 worst + Top 10 best
Metrics that improved/regressed since last cycle
Feature completeness % (if goal specified)

Update docs/perfection-engine/leaderboard.md with best/failed fixes.

Phase 2.5: VALIDATE (Scoring Quality Assurance)

After scoring, validate the results before acting on them. This prevents fixing based on bad data.

Environment Validation (before scoring begins)

Before scoring any metrics, verify the environment is stable:

If project has URLs → check they're reachable (curl -s -o /dev/null -w "%{http_code}")
If project has a test runner → verify it can execute (npm test --bail or equivalent)
If project has a build step → verify it builds (npm run build or equivalent)
If any check fails → log warning, mark affected metrics as null with null_reason: "environment_unstable", do NOT score them as 0

Scoring Sanity Checks (after scoring completes)

Run these checks on the scored rubric:

Contradiction detection: If metric A says "feature works" (score 10) but metric B says "feature's output is broken" (score 0), flag both for manual review. Detect by finding metric pairs in the same category where scores differ by >7 points and they reference the same file or feature.
Unmeasurable metric detection: If a metric scored null for 2+ consecutive cycles and has never been successfully measured, it's likely unmeasurable. Remove it from the rubric and log: "Removed {id}: never successfully measured after {N} attempts".
Coverage analysis: Map all source files to the metrics that reference them via scope. If any source file has zero metrics → flag as uncovered. If >30% of files are uncovered → generate additional metrics for those files in the next DISCOVER phase.
Score confidence check: For deterministic metrics, add a confidence level:
- Command exited cleanly with parseable output → high
- Command timed out but produced partial output → medium
- Command failed or returned unexpected format → low Low-confidence deterministic scores are treated like low-confidence LLM scores: excluded from auto-fix targeting and flagged for review.
Flaky metric detection: If a metric's score has oscillated (e.g., 8 → 3 → 9 → 2) across recent cycles, mark it as flaky: true. Flaky metrics get 3 measurements (median taken) instead of 1. If still oscillating → flag for human review.

Validation Output

Append validation results to the scorecard:

## Scoring Validation
- Contradictions found: {N} (details below)
- Unmeasurable metrics removed: {N}
- Uncovered source files: {N} of {total} ({%})
- Low-confidence scores: {N} (excluded from auto-fix)
- Flaky metrics: {N} (using 3x median)

Phase 3: FIX

Fix metrics scoring below target where auto_fixable is true.

Fix Ordering

Two modes:

Default (impact-priority): LLM looks at all failing metrics and orders by:

Security/safety issues first (any metric the LLM judges as security-critical)
Broken features (score 0-2)
Highest-impact improvements (biggest score gap × most users affected)
Quick wins last (easy fixes with small impact)

Quick-wins mode (--quick-wins): Sort by effort-adjusted impact:

priority = (target - score) × (1 / estimated_effort)

One-line boolean fixes before complex refactors.

Fix Loop (per metric)

Check fix memory: Has this been tried before? What worked on similar metrics? Read fix-history.json — if same approach failed 2x → skip, create issue
Capture baseline: Save before-state evidence:
- Screenshot (if Playwright metric)
- Current score + evidence
- git stash of current working tree (safety net)
Plan: Use superpowers:writing-plans if available, otherwise plan inline
Implement: Write code, run tests
Time gate: If implementation exceeds timeout (default 60s) → abort, log. 3 timeouts in one cycle → pause the entire cycle.
Smoke test: Run the project's own test suite (if it exists). If any test FAILS that was previously PASSING → revert immediately, do not proceed to deploy.
Deploy: Push to dev branch if CI exists, otherwise validate locally
Health check: If URL known, verify health endpoint AND navigate to the affected feature (not just /health) to confirm it works
Re-score: Run the specific metric again
Evidence capture: Save after-state evidence:
- Screenshot (if Playwright metric)
- New score + evidence
- Diff of changes (git diff HEAD~1)
- Store as {metric_id}-cycle-{N}-before.png / {metric_id}-cycle-{N}-after.png
Ratchet check: new_score must exceed incumbent_score, else reject
Simplicity check: If lines_added > 50 AND score_delta < 1 → reject (bloat)
Full regression check: Re-score ALL metrics that share any file in the fix's changeset (not just "nearby" — check every metric whose scope overlaps with modified files). If any regressed → revert, create combined issue with evidence.
Rollback verification: If fix was reverted in steps 6/11/12/13, verify the revert is clean: re-score the original metric to confirm score matches baseline.
Record: Log to fix-history.json — approach, files, lines, scores, evidence paths

After max attempts (default 2) → create GitHub issue with:

All evidence (before/after screenshots, diffs, score history)
Why auto-fix failed (timeout, regression, ratchet rejection, etc.)
Suggested manual approach

Fix Memory

fix-history.json records every attempt:

{
  "metric_id": "AUTH-003",
  "approach": "Added CSRF middleware to Express",
  "files_changed": ["backend/middleware/csrf.js"],
  "lines_added": 25, "lines_removed": 0,
  "score_before": 2, "score_after": 9,
  "success": true, "cycle": 3
}

Before fixing, check: "What worked on similar metrics in past cycles?" This makes the engine smarter over time.

Crash Recovery

If any step fails with an exception:

Catch it, log to state
Score metric as null with null_reason: "fix_error: {message}"
Continue to next metric (NEVER block the loop)
After 5+ consecutive errors → pause, alert user

Dry-Run Mode

/perfection-engine fix --dry-run: Generate fix plans without modifying code. Output a report showing what would change, which files, estimated risk.

Phase 4: LOOP

GitHub Issue Management

Requires: A git repository with a GitHub remote. If no GitHub remote is detected, skip issue management entirely and log unfixable metrics to docs/perfection-engine/unfixable-metrics.md instead.

Create milestone: "Perfection Engine Cycle #{N} — {project_name}"
Create parent issues per category with score table
Create child issues per failing metric with evidence + suggested fix
Team assignment: If configured in .perfection-engine.yml
Close resolved issues from previous cycles
Rate limit handling: If GitHub API returns 429, wait and retry (max 3 retries per cycle)
Update leaderboard: Best fixes, failed fixes, patterns

Leaderboard

After each cycle, update docs/perfection-engine/leaderboard.md:

Best fixes (highest score improvement, what approach, which files)
Failed fixes (what didn't work, why)
Patterns (which types of fixes succeed most)
Fix agents read this in subsequent cycles to learn

Completion Check

COMPLETE when ALL of:

All auto-fixable metrics at their target (10 or custom)
Rubric has stabilized (LLM generates no new metrics for 2 cycles)
All non-fixable metrics have GitHub issues

NOT COMPLETE → compact context, commit state, start next cycle immediately.

Compact context means: discard raw tool outputs and file contents from memory, retain only: metric scores, category scores, fix history summary, current cycle number, and the rubric JSON. This frees context window for the next cycle's DISCOVER phase.

Auto-Loop

The engine does NOT pause between cycles. It:

Commits state to git
Compacts context
Returns to Phase 1 (DISCOVER)
Re-reads the project (catches changes from fixes just applied)
Refreshes the rubric
Scores again
Fixes again
Repeats until done or safety cap hit

Special Commands

`/perfection-engine report`

Display the latest cycle scorecard. If none exists, suggest running score first.

`/perfection-engine reset`

Delete .claude/skills/perfection-engine-state/ and docs/perfection-engine/. Asks for confirmation. Next run starts completely fresh.

`/perfection-engine portfolio`

Scan the parent directory of the current project for sibling directories containing .claude/skills/perfection-engine-state/perfection-rubric.json. Trend = score delta from previous cycle, shown as arrow. If a project has no completed cycles, show --. Generate cross-project dashboard:

| Project | Score | Trend | Weakest | Strongest |
|---------|-------|-------|---------|-----------|

`/perfection-engine export`

Export all scores to docs/perfection-engine/scores-export.csv (columns: metric_id, category, name, score, target, type, auto_fixable, cycle) and .json.

IDEAS Mode — Discovery, Review, and Improvement

The ideas mode goes beyond quality — it discovers what the project COULD be, not just whether what exists is good. It uses parallel agents to research, analyze, and propose improvements, new features, architectural changes, and strategic opportunities.

`/perfection-engine ideas`

Runs a multi-agent review of the project and generates an improvement report.

Phase 1: UNDERSTAND — Read the entire project (same as DISCOVER)

Build project profile
Understand architecture, features, tech stack, domain

Phase 2: RESEARCH — Launch parallel agents to explore opportunities across 6 dimensions:

| Agent | What It Explores | |-------|-----------------| | Feature gaps | What features do similar projects have that this one doesn't? What would users expect? What's the next logical feature? | | Architecture | Could the codebase be restructured for better maintainability? Are there patterns that should be refactored? Is the tech stack optimal? | | Performance | What would make this significantly faster? CDN, caching, database optimization, lazy loading, edge computing? | | User experience | What flows are confusing? What would make users happier? What's the onboarding experience like? | | Developer experience | What would make contributing easier? Better docs, tests, CI, tooling, error messages? | | Strategic | What's the competitive landscape? What trends in this domain should the project adopt? What's the biggest risk if nothing changes? |

Each agent reads the codebase, uses web search for research when helpful, and returns structured findings.

Phase 3: SYNTHESIZE — Combine all agent findings into a prioritized report:

# Improvement Report — {project_name}

## High-Impact Ideas (do these first)
| # | Idea | Category | Effort | Impact | Description |
|---|------|----------|--------|--------|-------------|
| 1 | Add real-time collaboration | Feature | Large | High | Users expect multiplayer editing... |
| 2 | Split monolithic router | Architecture | Medium | High | 17K-line file is unmaintainable... |

## Medium-Impact Ideas
...

## Exploratory Ideas (research further)
...

## Anti-Patterns Found
| Pattern | Where | Why It's Risky | Suggested Fix |
|---------|-------|---------------|---------------|

Save to docs/perfection-engine/ideas-{date}.md.

Phase 4: ISSUE CREATION — For each idea, create a GitHub issue:

Title: "💡 {idea_name}"
Labels: perfection-engine, idea, {category}, {effort}
Body: Description, rationale, suggested approach, estimated effort, impact assessment

`/perfection-engine ideas --implement`

Same as ideas but after generating the report, automatically implements the top 3 highest-impact ideas using the fix loop (plan → implement → deploy → validate).

Only implements ideas tagged as effort: small or effort: medium by default. Large-effort ideas always create issues instead of auto-implementing.

Safety: Ideas mode never deletes existing features or makes breaking changes. It only ADDS new capabilities or IMPROVES existing ones. If an idea requires removing code, it creates an issue instead of auto-implementing.

Ideas + Quality Loop Combined

You can run both together:

/perfection-engine --ideas    # Full loop + ideas discovery each cycle

This adds a 5th phase to the loop:

DISCOVER → SCORE → FIX → IDEAS → LOOP

Each cycle: fix quality issues AND discover new improvements. The ideas feed into the next cycle's rubric — new features that get implemented become new metrics to score for quality.

Ideas State

Ideas are tracked in the per-project state:

{project}/.claude/skills/perfection-engine-state/
├── ideas-history.json     # All ideas generated across cycles
├── ideas-implemented.json # Ideas that were auto-implemented
└── ...existing state files

`/perfection-engine score --changed-only`

Incremental scoring via git diff — only metrics affected by recent changes.

`/perfection-engine --goal "description"`

Goal-driven mode: generates completeness metrics in addition to quality metrics. Measures both "does it exist?" and "is it good?"

Configuration

Optional .perfection-engine.yml in project root:

# All fields optional — engine auto-discovers everything
goal: "Build a weight loss telehealth platform"  # Adds completeness metrics

scoring:
  max_metrics: 300      # Default 300, hard cap 500
  batch_size: 20        # Metrics per scoring batch

fix:
  max_attempts: 2       # Per metric per cycle
  timeout_seconds: 60   # Per fix attempt
  strategy: "impact"    # impact (default) or quick-wins
  ratcheting: true      # Scores only go up
  simplicity_gate: true # Reject bloat fixes

github:
  max_issues_per_cycle: 50
  team_assignments:     # Optional
    security: "@security-team"
    default: "@tech-lead"

safety:
  max_cycles: 20
  cost_threshold: 50    # USD per cycle, pause for confirmation
  deploy_branch: "dev"  # Auto-detected if not specified

exclude:                # Skip specific things
  files: ["vendor/", "dist/", "node_modules/"]
  metrics: ["PERF-003"] # Skip specific metric IDs

State Files (per-project, git-committed)

{project}/.claude/skills/perfection-engine-state/
├── project-profile.json      # What the engine understood about the project
├── perfection-rubric.json    # Living rubric (regenerated each cycle)
├── cycle-history.json        # Score history across all cycles
├── current-cycle.json        # Lock + progress
├── fix-history.json          # What fixes were tried, what worked/failed
└── leaderboard-data.json     # Structured data for leaderboard.md

If no git repo → state files still written, just not committed.

Continuous Loop (Stop Hook)

The engine uses a Stop hook to keep looping without human intervention. When Claude finishes responding, the hook checks if a cycle is in-progress:

In-progress → hook blocks Claude from stopping, injects a continuation prompt
Completed/paused/errored → hook allows Claude to stop normally
Stale (>2 hours) → hook allows stop (treats as abandoned)
Max cycles (20) → hook allows stop

The hook is registered in hooks/hooks.json and fires on both Stop and SubagentStop events. It reads current-cycle.json to determine state.

How to stop the engine manually: Set current-cycle.json status to "paused" or "completed", or run /perfection-engine reset. The hook will then allow Claude to stop.

No hook installed? The engine still works — it just won't auto-continue between cycles. You'd need to manually say "continue" after each cycle completes.

Safety Rails

| Rail | Default | |------|---------| | Max fix attempts per metric per cycle | 2 | | Max GitHub issues per cycle | 50 | | Max total cycles | 20 | | Fix timeout | 60s | | Cost threshold | $50/cycle → pause for confirmation | | Max metrics | 300 (hard cap 500) | | Deployment target | Dev branch only | | Simplicity gate | Reject >50 lines for <1 point | | Ratcheting | Scores only go up | | Consecutive errors | 5 → pause | | Concurrency lock | 2-hour staleness window | | Bootstrap guard | 0 files → exit; <5 files → simplified rubric |

Integration

| Skill | Usage | Required? | |-------|-------|-----------| | superpowers:writing-plans | Create fix plans | Optional (fallback: inline) | | superpowers:executing-plans | Implement fixes | Optional (fallback: inline) | | Playwright MCP | Browser testing | Only when URLs available |

File Structure

perfection-engine/
├── .claude-plugin/
│   └── plugin.json          # Plugin manifest
└── skills/
    └── perfection-engine/
        ├── SKILL.md          # Main skill documentation
        └── references/
            ├── config-schema.md    # .perfection-engine.yml schema
            ├── fix-process.md      # Fix workflow & crash recovery
            └── scoring-methods.md  # Scoring tools & protocols

Reference Files

| File | Purpose | |------|---------| | references/scoring-methods.md | How each scoring method works (tools, not content) | | references/fix-process.md | Fix workflow, memory, crash recovery, simplicity | | references/config-schema.md | .perfection-engine.yml full schema |

Perfection Engine v4 — Pure Discovery

Reads any codebase. Discovers what matters. Measures it. Fixes it. Loops until perfect.

No templates. No hardcoded categories. Everything generated from the project itself.

Quick Start

# THE ULTIMATE COMMAND — runs everything, makes the project the best it can be
/perfection-engine max                      # All features: score + fix + ideas + implement + loop forever

# Individual modes
/perfection-engine                          # Quality loop only (score + fix until 10/10)
/perfection-engine --goal "description"     # Goal-driven: measure completeness against an end state
/perfection-engine --ideas                  # Quality loop + ideas discovery each cycle
/perfection-engine score                    # Score only (no fixes)
/perfection-engine score --changed-only     # Score only files changed since last cycle
/perfection-engine validate                 # Validate scoring quality
/perfection-engine fix                      # Fix failing metrics only
/perfection-engine fix --dry-run            # Show what fixes would be applied
/perfection-engine fix --quick-wins         # Fix easiest high-impact items first
/perfection-engine ideas                    # Discover improvement opportunities
/perfection-engine ideas --implement        # Discover AND auto-implement best ideas
/perfection-engine report                   # Display latest scorecard
/perfection-engine portfolio               # Cross-project dashboard
/perfection-engine export                   # Export scores to CSV/JSON
/perfection-engine reset                    # Delete state, start fresh

MAX Mode — The Ultimate Command

/perfection-engine max runs EVERY feature in a single continuous loop:

DISCOVER → SCORE → VALIDATE → FIX → IDEAS → IMPLEMENT → LOOP
    ↑                                                      |
    └──────────────────────────────────────────────────────┘

What it does each cycle:

DISCOVER — Read the project, discover capabilities, generate/refresh rubric
SCORE — Score every metric using all available tools (Playwright, bash, LLM, tests)
VALIDATE — Check scoring quality (environment, contradictions, coverage, flaky)
FIX — Auto-fix all failing quality metrics (security first, then bugs, then polish)
IDEAS — Launch parallel agents to discover improvements, new features, architecture changes
IMPLEMENT — Auto-implement the top small/medium ideas that score highest impact
LOOP — Create GitHub issues, update leaderboard, start next cycle

It combines:

Quality scoring + fixing (the base engine)
Validation (environment checks, contradiction detection, regression testing)
Ideas discovery (6 parallel research agents)
Auto-implementation (build new features, not just fix existing ones)
Continuous looping (Stop hook keeps it running)

Safety: All MAX mode safety rails apply — max 20 cycles, $50/cycle cost cap, simplicity gate, ratcheting, dev branch only, never deletes features.

How It Works

READ the project → DISCOVER what matters → SCORE everything →
FIX what's broken → LOOP (re-read, re-discover, re-score, re-fix)

Each cycle, the rubric refreshes — new metrics added for new code, obsolete metrics removed for deleted code, priorities adjusted based on current state.

Phase 1: DISCOVER

The engine's intelligence lives here. Everything is generated dynamically.

Step 1: Load State

mkdir -p .claude/skills/perfection-engine-state

If perfection-rubric.json exists → load it (continuing from previous cycle)
If current-cycle.json shows in-progress and started less than 2 hours ago → STOP with message "Another cycle is in progress (started {time}). Wait or run /perfection-engine reset." Locks older than 2 hours are considered stale and ignored.
If .perfection-engine.yml exists in project root → load user config overrides
Write lock: {"cycle": N, "status": "in-progress|completed|paused|error", "started_at": "ISO", "completed_at": "ISO|null", "metrics_scored": 0, "metrics_fixed": 0}

Step 2: Read the Project

Build a comprehensive understanding of the project by reading:

Project manifest: package.json, pyproject.toml, Cargo.toml, go.mod, pom.xml, etc.
Project docs: CLAUDE.md, README.md, any docs/ directory
Source structure: List all directories and files, understand the architecture
Configuration: .env.example, config files, CI workflows, deployment configs
Dependencies: What libraries, frameworks, APIs does this project use?
Test infrastructure: What test files exist? What test runner? What coverage?

Save understanding as project-profile.json:

{
  "project_name": "detected from manifest",
  "description": "LLM-generated 1-sentence summary of what this project does",
  "languages": ["typescript", "python"],
  "frameworks": ["react", "express"],
  "features_detected": [
    "user authentication",
    "payment processing",
    "real-time chat",
    "admin dashboard"
  ],
  "architecture": "monolith|microservices|serverless|static|library|cli",
  "has_frontend": true,
  "has_backend": true,
  "has_tests": true,
  "has_ci": true,
  "has_deployment": true,
  "urls": { "dev": "...", "prod": "..." },
  "source_file_count": 150,
  "test_runner": "jest|pytest|cargo-test|go-test|make|null",
  "goal": "from --goal flag or .perfection-engine.yml or null"
}

Step 2.5: Bootstrap Guard

If source_file_count == 0 → report "No source files found. Nothing to measure." and exit.

Step 3: Generate the Rubric

This is the core innovation. The LLM reads the project and generates ALL categories and metrics dynamically. No templates, no pre-defined lists.

Rubric generation prompt:

You just read this project: {project_profile}
Source files sampled: {file_list_with_summaries}

Generate a comprehensive quality rubric for THIS specific project.

For each category you discover:
- Give it a short ID (2-5 chars) and descriptive name
- Explain why it matters for THIS project
- Generate 5-25 specific, measurable metrics

For each metric:
- Give it an ID ({category}-{NNN})
- Name it specifically (not generic — reference actual files/features)
- Describe exactly how to measure it (what tool, what command, what to look for)
- Classify the scoring method:
    deterministic — run a command, check output (lint, grep, test, curl)
    playwright — navigate a URL, interact with UI, verify state
    llm_judged — read code/output, evaluate quality with structured rubric
    statistical — run N samples, verify distribution
    unit_test_proxy — run the project's own test suite
- Determine if it can be auto-fixed (true/false)
- Define what 10/10 looks like for this specific metric

Guidelines:
- Be SPECIFIC to this project. "Login page loads" not "Page loads"
- Reference actual file paths, function names, endpoints you found
- Cover ALL dimensions: does it work, is it secure, is it fast, is it accessible,
  is it well-coded, is it well-tested, is it well-documented
- If a goal was provided, generate COMPLETENESS metrics:
  does feature X exist, does flow Y work end-to-end
- Include metrics at every level: code, feature, flow, system
- Remove anything generic that doesn't apply

Return as JSON:
{
  "categories": {
    "category_id": {
      "name": "Category Name",
      "why": "Why this matters for this project",
      "metrics": [
        {
          "id": "CAT-001",
          "name": "Specific metric name",
          "description": "How to measure this",
          "type": "deterministic|playwright|llm_judged|statistical|unit_test_proxy",
          "auto_fixable": true,
          "what_10_looks_like": "Description of perfect score",
          "score": null,
          "incumbent_score": null,
          "target": 10,
          "scope": ["src/auth/*.js"],
          "goal_driven": false,
          "history": []
        }
      ]
    }
  }
}

Step 4: Merge with Previous Rubric (cycle 2+)

On subsequent cycles, the engine re-reads the project and re-generates the rubric. Then merges with the existing rubric:

New metrics (not in previous rubric) → add with score null
Existing metrics (in both) → keep score history, update description if changed
Obsolete metrics (in previous but code was deleted) → mark deprecated, remove
Duplicate metrics (LLM generated a variant of existing) → deduplicate
Goal change (goal text differs from previous cycle) → deprecate all previous GOAL-* metrics, regenerate from new goal

Rubric cleanup rules:

If a metric has scored 10/10 for 3+ consecutive cycles → mark as stable, carry forward score of 10 (included in aggregation but not re-measured)
If a metric has scored null for 2+ cycles → investigate why, remove if code gone
If total metrics > 300 → ask LLM to consolidate: "Merge metrics that test the same file with the same method, or that are threshold variants of each other"
Hard cap: 500 metrics (absolute maximum, config max_metrics range is 1-500)

Step 5: Goal-Driven Completeness (when --goal or config specifies a goal)

When a goal is provided, the LLM generates an additional layer of metrics:

Goal: "{goal_text}"
Current features detected: {features_detected}

What features/flows/capabilities are NEEDED to achieve this goal but
do NOT exist in the codebase? For each missing feature:
- Name it
- Describe what "complete" looks like
- Define 3-5 acceptance criteria
- Estimate complexity (small/medium/large)

Also: for features that PARTIALLY exist, what's missing?

These generate metrics scored using any of the five standard methods (usually playwright or deterministic), and are tagged with "goal_driven": true to distinguish them from structural metrics:

{
  "id": "GOAL-001",
  "name": "Payment flow exists and works end-to-end",
  "type": "playwright",
  "goal_driven": true,
  "score": 0,
  "what_10_looks_like": "User can complete checkout, payment processes, confirmation shows"
}

Step 6: Save State

git add .claude/skills/perfection-engine-state/
git commit -m "perfection-engine: cycle {N} DISCOVER"

If no git repo, write state files without committing.

Phase 2: SCORE

Score every metric in the rubric. The scoring methods are universal — they're tools, not opinions.

Pre-Loop Estimate (first cycle only)

Before starting, estimate and display:

Project: {name}
Metrics discovered: {N} across {M} categories
Estimated scoring time: {T} minutes
Estimated API cost: ${C}

Scoring Methods

Read references/scoring-methods.md for detailed protocols.

Scoring scale: 0 (broken) → 5 (adequate) → 10 (perfect). Full definitions in references/scoring-methods.md.

Scoring Workflow

Group metrics by scoring method:
- Batch A: All deterministic → run in parallel subagents
- Batch B: All playwright → serial (shared browser)
- Batch C: All llm_judged → parallel subagents
- Batch D: Statistical + unit_test_proxy → parallel
Process each batch in groups of 20, compact context between groups
Dependency awareness: If metric A depends on metric B (e.g., "login works" gates "dashboard loads"), and B failed → skip A, score as null with reason
Incremental mode (--changed-only): Use git diff to identify changed files, only re-score metrics related to those files, carry forward unchanged scores

Capability Discovery

Before scoring, discover what tools and capabilities are available in this session. The engine adapts its scoring methods based on what it finds — never assume a tool exists, always check first.

Discovery checklist (run at start of each cycle):

MCP servers: Check what MCP tools are available (Playwright, Exa, Tavily, etc.). If Playwright MCP is connected → browser-based scoring is available. If no browser tools → skip all playwright-type metrics or fall back to curl/API checks.
Installed skills: Check what skills are loaded in this session. If superpowers:writing-plans available → use for fix planning. If e2e-playwright-core available → leverage its patterns. If none → fall back to inline approaches.
Plugin tools: Check for any plugin-provided tools that could enhance scoring (web search, code analysis, database queries, etc.).
CLI tools on the system: Check what's installed locally. Run which npm node python3 cargo go java rustc docker kubectl gh curl (or equivalent). Only generate metrics that the local toolchain can actually measure.
Project's own tooling: Check package.json scripts, Makefile targets, CI workflows for project-specific commands that can be leveraged for scoring.

Save discovered capabilities to project-profile.json:

{
  "capabilities": {
    "browser": true,
    "mcp_tools": ["playwright", "exa", "tavily"],
    "skills": ["superpowers:writing-plans", "superpowers:executing-plans"],
    "cli_tools": ["npm", "node", "python3", "docker", "kubectl", "gh"],
    "project_scripts": ["test", "lint", "build", "typecheck"],
    "can_deploy": true,
    "can_create_issues": true
  }
}

How capabilities affect metric generation:

No browser tools → no playwright-type metrics (use curl/API instead)
No gh CLI → no GitHub issues (log to markdown instead)
No test runner → no unit_test_proxy metrics
No deploy pipeline → skip deploy step in fix loop, validate locally only
Has web search (Exa/Tavily) → IDEAS mode can research competitors and trends

Use whatever is available. Adapt to whatever is missing. Never fail because a tool doesn't exist — degrade gracefully and note what was skipped and why.

Blind Scoring

Ratcheting

Scores move monotonically upward. Track incumbent_score per metric (best ever). After a fix, if new_score <= incumbent_score → reject the fix immediately.

Aggregation

Category score = mean of all metrics in that category
Overall score = mean of all category scores
Null scores excluded (not measured ≠ zero)

Output

Write to docs/perfection-engine/cycle-{N}-scorecard.md:

Overall score with trend (↑↓→)
Per-category scores with delta
Top 10 worst + Top 10 best
Metrics that improved/regressed since last cycle
Feature completeness % (if goal specified)

Update docs/perfection-engine/leaderboard.md with best/failed fixes.

Phase 2.5: VALIDATE (Scoring Quality Assurance)

After scoring, validate the results before acting on them. This prevents fixing based on bad data.

Environment Validation (before scoring begins)

Before scoring any metrics, verify the environment is stable:

If project has URLs → check they're reachable (curl -s -o /dev/null -w "%{http_code}")
If project has a test runner → verify it can execute (npm test --bail or equivalent)
If project has a build step → verify it builds (npm run build or equivalent)
If any check fails → log warning, mark affected metrics as null with null_reason: "environment_unstable", do NOT score them as 0

Scoring Sanity Checks (after scoring completes)

Run these checks on the scored rubric:

Contradiction detection: If metric A says "feature works" (score 10) but metric B says "feature's output is broken" (score 0), flag both for manual review. Detect by finding metric pairs in the same category where scores differ by >7 points and they reference the same file or feature.
Unmeasurable metric detection: If a metric scored null for 2+ consecutive cycles and has never been successfully measured, it's likely unmeasurable. Remove it from the rubric and log: "Removed {id}: never successfully measured after {N} attempts".
Coverage analysis: Map all source files to the metrics that reference them via scope. If any source file has zero metrics → flag as uncovered. If >30% of files are uncovered → generate additional metrics for those files in the next DISCOVER phase.
Score confidence check: For deterministic metrics, add a confidence level:
- Command exited cleanly with parseable output → high
- Command timed out but produced partial output → medium
- Command failed or returned unexpected format → low Low-confidence deterministic scores are treated like low-confidence LLM scores: excluded from auto-fix targeting and flagged for review.
Flaky metric detection: If a metric's score has oscillated (e.g., 8 → 3 → 9 → 2) across recent cycles, mark it as flaky: true. Flaky metrics get 3 measurements (median taken) instead of 1. If still oscillating → flag for human review.

Validation Output

Append validation results to the scorecard:

## Scoring Validation
- Contradictions found: {N} (details below)
- Unmeasurable metrics removed: {N}
- Uncovered source files: {N} of {total} ({%})
- Low-confidence scores: {N} (excluded from auto-fix)
- Flaky metrics: {N} (using 3x median)

Phase 3: FIX

Fix metrics scoring below target where auto_fixable is true.

Fix Ordering

Two modes:

Default (impact-priority): LLM looks at all failing metrics and orders by:

Security/safety issues first (any metric the LLM judges as security-critical)
Broken features (score 0-2)
Highest-impact improvements (biggest score gap × most users affected)
Quick wins last (easy fixes with small impact)

Quick-wins mode (--quick-wins): Sort by effort-adjusted impact:

priority = (target - score) × (1 / estimated_effort)

One-line boolean fixes before complex refactors.

Fix Loop (per metric)

Check fix memory: Has this been tried before? What worked on similar metrics? Read fix-history.json — if same approach failed 2x → skip, create issue
Capture baseline: Save before-state evidence:
- Screenshot (if Playwright metric)
- Current score + evidence
- git stash of current working tree (safety net)
Plan: Use superpowers:writing-plans if available, otherwise plan inline
Implement: Write code, run tests
Time gate: If implementation exceeds timeout (default 60s) → abort, log. 3 timeouts in one cycle → pause the entire cycle.
Smoke test: Run the project's own test suite (if it exists). If any test FAILS that was previously PASSING → revert immediately, do not proceed to deploy.
Deploy: Push to dev branch if CI exists, otherwise validate locally
Health check: If URL known, verify health endpoint AND navigate to the affected feature (not just /health) to confirm it works
Re-score: Run the specific metric again
Evidence capture: Save after-state evidence:
- Screenshot (if Playwright metric)
- New score + evidence
- Diff of changes (git diff HEAD~1)
- Store as {metric_id}-cycle-{N}-before.png / {metric_id}-cycle-{N}-after.png
Ratchet check: new_score must exceed incumbent_score, else reject
Simplicity check: If lines_added > 50 AND score_delta < 1 → reject (bloat)
Full regression check: Re-score ALL metrics that share any file in the fix's changeset (not just "nearby" — check every metric whose scope overlaps with modified files). If any regressed → revert, create combined issue with evidence.
Rollback verification: If fix was reverted in steps 6/11/12/13, verify the revert is clean: re-score the original metric to confirm score matches baseline.
Record: Log to fix-history.json — approach, files, lines, scores, evidence paths

After max attempts (default 2) → create GitHub issue with:

All evidence (before/after screenshots, diffs, score history)
Why auto-fix failed (timeout, regression, ratchet rejection, etc.)
Suggested manual approach

Fix Memory

fix-history.json records every attempt:

{
  "metric_id": "AUTH-003",
  "approach": "Added CSRF middleware to Express",
  "files_changed": ["backend/middleware/csrf.js"],
  "lines_added": 25, "lines_removed": 0,
  "score_before": 2, "score_after": 9,
  "success": true, "cycle": 3
}

Before fixing, check: "What worked on similar metrics in past cycles?" This makes the engine smarter over time.

Crash Recovery

If any step fails with an exception:

Catch it, log to state
Score metric as null with null_reason: "fix_error: {message}"
Continue to next metric (NEVER block the loop)
After 5+ consecutive errors → pause, alert user

Dry-Run Mode

/perfection-engine fix --dry-run: Generate fix plans without modifying code. Output a report showing what would change, which files, estimated risk.

Phase 4: LOOP

GitHub Issue Management

Create milestone: "Perfection Engine Cycle #{N} — {project_name}"
Create parent issues per category with score table
Create child issues per failing metric with evidence + suggested fix
Team assignment: If configured in .perfection-engine.yml
Close resolved issues from previous cycles
Rate limit handling: If GitHub API returns 429, wait and retry (max 3 retries per cycle)
Update leaderboard: Best fixes, failed fixes, patterns

Leaderboard

After each cycle, update docs/perfection-engine/leaderboard.md:

Best fixes (highest score improvement, what approach, which files)
Failed fixes (what didn't work, why)
Patterns (which types of fixes succeed most)
Fix agents read this in subsequent cycles to learn

Completion Check

COMPLETE when ALL of:

All auto-fixable metrics at their target (10 or custom)
Rubric has stabilized (LLM generates no new metrics for 2 cycles)
All non-fixable metrics have GitHub issues

NOT COMPLETE → compact context, commit state, start next cycle immediately.

Auto-Loop

The engine does NOT pause between cycles. It:

Commits state to git
Compacts context
Returns to Phase 1 (DISCOVER)
Re-reads the project (catches changes from fixes just applied)
Refreshes the rubric
Scores again
Fixes again
Repeats until done or safety cap hit

Special Commands

`/perfection-engine report`

Display the latest cycle scorecard. If none exists, suggest running score first.

`/perfection-engine reset`

Delete .claude/skills/perfection-engine-state/ and docs/perfection-engine/. Asks for confirmation. Next run starts completely fresh.

`/perfection-engine portfolio`

| Project | Score | Trend | Weakest | Strongest |
|---------|-------|-------|---------|-----------|

`/perfection-engine export`

Export all scores to docs/perfection-engine/scores-export.csv (columns: metric_id, category, name, score, target, type, auto_fixable, cycle) and .json.

IDEAS Mode — Discovery, Review, and Improvement

`/perfection-engine ideas`

Runs a multi-agent review of the project and generates an improvement report.

Phase 1: UNDERSTAND — Read the entire project (same as DISCOVER)

Build project profile
Understand architecture, features, tech stack, domain

Phase 2: RESEARCH — Launch parallel agents to explore opportunities across 6 dimensions:

Each agent reads the codebase, uses web search for research when helpful, and returns structured findings.

Phase 3: SYNTHESIZE — Combine all agent findings into a prioritized report:

# Improvement Report — {project_name}

## High-Impact Ideas (do these first)
| # | Idea | Category | Effort | Impact | Description |
|---|------|----------|--------|--------|-------------|
| 1 | Add real-time collaboration | Feature | Large | High | Users expect multiplayer editing... |
| 2 | Split monolithic router | Architecture | Medium | High | 17K-line file is unmaintainable... |

## Medium-Impact Ideas
...

## Exploratory Ideas (research further)
...

## Anti-Patterns Found
| Pattern | Where | Why It's Risky | Suggested Fix |
|---------|-------|---------------|---------------|

Save to docs/perfection-engine/ideas-{date}.md.

Phase 4: ISSUE CREATION — For each idea, create a GitHub issue:

Title: "💡 {idea_name}"
Labels: perfection-engine, idea, {category}, {effort}
Body: Description, rationale, suggested approach, estimated effort, impact assessment

`/perfection-engine ideas --implement`

Same as ideas but after generating the report, automatically implements the top 3 highest-impact ideas using the fix loop (plan → implement → deploy → validate).

Only implements ideas tagged as effort: small or effort: medium by default. Large-effort ideas always create issues instead of auto-implementing.

Ideas + Quality Loop Combined

You can run both together:

/perfection-engine --ideas    # Full loop + ideas discovery each cycle

This adds a 5th phase to the loop:

DISCOVER → SCORE → FIX → IDEAS → LOOP

Each cycle: fix quality issues AND discover new improvements. The ideas feed into the next cycle's rubric — new features that get implemented become new metrics to score for quality.

Ideas State

Ideas are tracked in the per-project state:

{project}/.claude/skills/perfection-engine-state/
├── ideas-history.json     # All ideas generated across cycles
├── ideas-implemented.json # Ideas that were auto-implemented
└── ...existing state files

`/perfection-engine score --changed-only`

Incremental scoring via git diff — only metrics affected by recent changes.

`/perfection-engine --goal "description"`

Goal-driven mode: generates completeness metrics in addition to quality metrics. Measures both "does it exist?" and "is it good?"

Configuration

Optional .perfection-engine.yml in project root:

# All fields optional — engine auto-discovers everything
goal: "Build a weight loss telehealth platform"  # Adds completeness metrics

scoring:
  max_metrics: 300      # Default 300, hard cap 500
  batch_size: 20        # Metrics per scoring batch

fix:
  max_attempts: 2       # Per metric per cycle
  timeout_seconds: 60   # Per fix attempt
  strategy: "impact"    # impact (default) or quick-wins
  ratcheting: true      # Scores only go up
  simplicity_gate: true # Reject bloat fixes

github:
  max_issues_per_cycle: 50
  team_assignments:     # Optional
    security: "@security-team"
    default: "@tech-lead"

safety:
  max_cycles: 20
  cost_threshold: 50    # USD per cycle, pause for confirmation
  deploy_branch: "dev"  # Auto-detected if not specified

exclude:                # Skip specific things
  files: ["vendor/", "dist/", "node_modules/"]
  metrics: ["PERF-003"] # Skip specific metric IDs

State Files (per-project, git-committed)

{project}/.claude/skills/perfection-engine-state/
├── project-profile.json      # What the engine understood about the project
├── perfection-rubric.json    # Living rubric (regenerated each cycle)
├── cycle-history.json        # Score history across all cycles
├── current-cycle.json        # Lock + progress
├── fix-history.json          # What fixes were tried, what worked/failed
└── leaderboard-data.json     # Structured data for leaderboard.md

If no git repo → state files still written, just not committed.

Continuous Loop (Stop Hook)

The engine uses a Stop hook to keep looping without human intervention. When Claude finishes responding, the hook checks if a cycle is in-progress:

In-progress → hook blocks Claude from stopping, injects a continuation prompt
Completed/paused/errored → hook allows Claude to stop normally
Stale (>2 hours) → hook allows stop (treats as abandoned)
Max cycles (20) → hook allows stop

The hook is registered in hooks/hooks.json and fires on both Stop and SubagentStop events. It reads current-cycle.json to determine state.

How to stop the engine manually: Set current-cycle.json status to "paused" or "completed", or run /perfection-engine reset. The hook will then allow Claude to stop.

No hook installed? The engine still works — it just won't auto-continue between cycles. You'd need to manually say "continue" after each cycle completes.

Safety Rails

Integration

File Structure

perfection-engine/
├── .claude-plugin/
│   └── plugin.json          # Plugin manifest
└── skills/
    └── perfection-engine/
        ├── SKILL.md          # Main skill documentation
        └── references/
            ├── config-schema.md    # .perfection-engine.yml schema
            ├── fix-process.md      # Fix workflow & crash recovery
            └── scoring-methods.md  # Scoring tools & protocols

Adoption

adaptationio/perfection-engine

$ install --global

Security Scan Results

SKILL.md

Perfection Engine v4 — Pure Discovery

Quick Start

MAX Mode — The Ultimate Command

How It Works

Phase 1: DISCOVER

Step 1: Load State

Step 2: Read the Project

Step 2.5: Bootstrap Guard

Step 3: Generate the Rubric

Step 4: Merge with Previous Rubric (cycle 2+)

Step 5: Goal-Driven Completeness (when --goal or config specifies a goal)

Step 6: Save State

Phase 2: SCORE

Pre-Loop Estimate (first cycle only)

Scoring Methods

Scoring Workflow

Capability Discovery

Blind Scoring

Ratcheting

Aggregation

Output

Phase 2.5: VALIDATE (Scoring Quality Assurance)

Environment Validation (before scoring begins)

Scoring Sanity Checks (after scoring completes)

Validation Output

Phase 3: FIX

Fix Ordering

Fix Loop (per metric)

Fix Memory

Crash Recovery

Dry-Run Mode

Phase 4: LOOP

GitHub Issue Management

Leaderboard

Completion Check

Auto-Loop

Special Commands

/perfection-engine report

/perfection-engine reset

/perfection-engine portfolio

/perfection-engine export

IDEAS Mode — Discovery, Review, and Improvement

/perfection-engine ideas

/perfection-engine ideas --implement

Ideas + Quality Loop Combined

Ideas State

/perfection-engine score --changed-only

/perfection-engine --goal "description"

Configuration

State Files (per-project, git-committed)

Continuous Loop (Stop Hook)

Safety Rails

Integration

File Structure

Reference Files

Related Skills

openclaw/openclaw-secret-scanning-maintainer

openclaw/openclaw-release-maintainer

openclaw/openclaw-qa-testing

openclaw/openclaw-parallels-smoke

adaptationio/perfection-engine

$ install --global

Security Scan Results

SKILL.md

Perfection Engine v4 — Pure Discovery

Quick Start

MAX Mode — The Ultimate Command

How It Works

Phase 1: DISCOVER

Step 1: Load State

Step 2: Read the Project

Step 2.5: Bootstrap Guard

Step 3: Generate the Rubric

Step 4: Merge with Previous Rubric (cycle 2+)

Step 5: Goal-Driven Completeness (when --goal or config specifies a goal)

`/perfection-engine report`

`/perfection-engine reset`

`/perfection-engine portfolio`

`/perfection-engine export`

`/perfection-engine ideas`

`/perfection-engine ideas --implement`

`/perfection-engine score --changed-only`

`/perfection-engine --goal "description"`

`/perfection-engine report`

`/perfection-engine reset`

`/perfection-engine portfolio`

`/perfection-engine export`

`/perfection-engine ideas`

`/perfection-engine ideas --implement`

`/perfection-engine score --changed-only`

`/perfection-engine --goal "description"`