Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ShaheerKhawaja/auto-optimize

Name: auto-optimize
Author: ShaheerKhawaja

skills/auto-optimize/SKILL.md

npx skillsauth add ShaheerKhawaja/ProductionOS auto-optimize

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

auto-optimize

Self-improving agent optimization — generates challenger variants of any agent/command, benchmarks against baseline, promotes winners, logs learnings to instincts. Inspired by Karpathy's autoresearch pattern.

Inputs

| Parameter | Values | Default | Description | |-----------|--------|---------|-------------| | target | string | required | Agent or command to optimize (e.g., 'code-reviewer', 'security-hardener', '/production-upgrade') | | challengers | string | 3 | Number of challenger variants to generate (default: 3) | | benchmark | string | self-eval | Benchmark to evaluate against: 'self-eval' (default) | 'test-suite' | 'llm-judge' | path to custom benchmark | | hypothesis | string | -- | Specific hypothesis to test (e.g., 'add chain-of-thought to security-hardener'). If omitted, auto-generates hypotheses. | | max_cost | string | 5 | Maximum cost in USD for the optimization run (default: 5) | | mode | string | prompt | Optimization mode: prompt (modify agent instructions) | model (test different models) | layers (test prompt composition layers) | params (test convergence parameters) |

Auto-Optimize — Self-Improving Agent Loop

You are the Auto-Optimize orchestrator. You implement Karpathy's autoresearch pattern for ProductionOS: generate challenger variants, benchmark against baseline, promote winners, harvest learnings.

The compound moat: Every optimization run makes ProductionOS measurably better. Run #10 benefits from all learnings of runs #1-9.

Step 0: Preamble

Before executing, run the shared ProductionOS preamble (templates/PREAMBLE.md).

Phase 1: Baseline Capture

1.1: Read Target Definition

# For agents:
cat agents/$ARGUMENTS.target.md

# For commands:
cat .claude/commands/$ARGUMENTS.target.md

1.2: Extract Current Metrics

Read existing performance data if available:

cat ~/.productionos/analytics/skill-usage.jsonl | grep "$ARGUMENTS.target" | tail -20
cat ~/.productionos/instincts/project/*/lessons.json 2>/dev/null | grep "$ARGUMENTS.target"

1.3: Record Baseline

Run the target against the benchmark to establish baseline:

BASELINE:
  target: $ARGUMENTS.target
  benchmark: $ARGUMENTS.benchmark
  timestamp: {ISO8601}
  metrics:
    score: {0-10 from self-eval or test pass rate or LLM-judge}
    tokens: {token count for the run}
    duration: {seconds}
    issues_found: {count, for auditors}
    false_positives: {count}
  prompt_length: {word count of instructions}
  model: {current model assignment}
  layers: {which prompt composition layers are active}

Write baseline to .productionos/AUTO-OPTIMIZE-BASELINE.md.

Phase 2: Hypothesis Generation

If $ARGUMENTS.hypothesis is provided:

Use the user's hypothesis directly. Create $ARGUMENTS.challengers variants that test this hypothesis.

If no hypothesis:

Read the prompt-optimizer agent definition from agents/prompt-optimizer.md and dispatch it to generate hypotheses. If the target is prompt-heavy or rubric-heavy, also dispatch textgrad-optimizer to propose gradient-style wording improvements before challengers are generated.

The prompt-optimizer should analyze:

The target's current instructions (strengths, weaknesses)
Recent metaclaw-learner lessons about this target
The benchmark it will be evaluated against
Prompt engineering research patterns (from templates/PROMPT-COMPOSITION.md)

Generate $ARGUMENTS.challengers distinct hypotheses, each with:

{
  "id": "challenger-{N}",
  "hypothesis": "{what change we're testing}",
  "change_type": "prompt|model|layers|params",
  "expected_improvement": "{what metric should improve and by how much}",
  "risk": "{what could get worse}",
  "modification": "{specific text changes to apply}"
}

Write hypotheses to .productionos/AUTO-OPTIMIZE-HYPOTHESES.md.

Phase 3: Challenger Generation

For each hypothesis, create a modified version of the target:

Mode: prompt (default)

Copy the target agent/command definition
Apply the hypothesis modification to the instructions
Keep all other fields (model, tools, stakes) identical
Write to .productionos/challengers/challenger-{N}.md

Mode: model

Same instructions, different model assignment
Test combinations: opus (planning), sonnet (execution), haiku (validation)

Mode: layers

Same agent, different prompt composition layers
Test with/without: Emotion, Meta, ToT, GoT, CoD, Distractor, Generated Knowledge

Mode: params

Same agent, different convergence parameters
Test: EMA alpha (0.1-0.5), convergence threshold (0.01-0.1), max iterations (3-10)

Mode: rubric

Same evaluation dimensions, different scoring rubric
Dispatch rubric-evolver agent (OPRO pattern)
Generate 5 rubric variants with different anchor points, weights, and criteria
Score each variant against calibration set in .productionos/calibration/
Promote the variant with highest ground-truth correlation
See templates/calibration-set.md for calibration sample format

Phase 4: Benchmark Execution

Run baseline and all challengers against the same benchmark. The benchmark MUST be identical for fair comparison.

Benchmark: self-eval

For each variant:

Dispatch the agent with a fixed test task
Run /self-eval on the output
Record the 7-question scores + overall grade

Benchmark: test-suite

For each variant:

Dispatch the agent against the codebase
Run bun test after the agent completes
Record pass rate, new test failures, coverage delta

Benchmark: llm-judge

For each variant:

Dispatch the agent with a fixed test task
Submit output to llm-judge agent for blind evaluation
Record dimension scores, confidence intervals

Execution Protocol

FOR variant IN [baseline, challenger-1, ..., challenger-N]:
  1. Reset to clean state (git stash or worktree isolation)
  2. Apply variant's modifications (if challenger)
  3. Run the target against the benchmark task
  4. Collect metrics: score, tokens, duration, output quality
  5. Revert changes
  6. Record results in .productionos/AUTO-OPTIMIZE-RESULTS.md

Cost tracking: Before each variant run, check accumulated cost against $ARGUMENTS.max_cost. Halt if exceeded.

Phase 5: Harvest

5.1: Compare Results

RESULTS TABLE:
| Variant | Score | Tokens | Duration | Delta vs Baseline | p-value |
|

## Error Handling

| Scenario | Action |
|----------|--------|
| No target provided | Ask for clarification with examples |
| Target not found | Search for alternatives, suggest closest match |
| Agent dispatch fails | Fall back to manual execution, report the error |
| Ambiguous input | Present options, ask user to pick |
| Execution timeout | Save partial results, report what completed |

## Guardrails

1. Do not silently change scope or expand beyond the user request.
2. Prefer concrete outputs and verification over abstract descriptions.
3. Keep scope faithful to the user intent.
4. Preserve existing workflow guardrails and stop conditions.
5. Verify results before concluding. Run self-eval on output quality.

ShaheerKhawaja/auto-optimize

skills/auto-optimize/SKILL.md

7 stars

data-ai

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add ShaheerKhawaja/ProductionOS auto-optimize

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 2:11 AM14.8s1 file scanned

SKILL.md

name:: auto-optimize
description:: Self-improving agent optimization — generates challenger variants of any agent/command, benchmarks against baseline, promotes winners, logs learnings to instincts. Inspired by Karpathy's autoresearch pattern.
argument-hint:: [repo path, target, or task context]

auto-optimize

Inputs

Auto-Optimize — Self-Improving Agent Loop

You are the Auto-Optimize orchestrator. You implement Karpathy's autoresearch pattern for ProductionOS: generate challenger variants, benchmark against baseline, promote winners, harvest learnings.

The compound moat: Every optimization run makes ProductionOS measurably better. Run #10 benefits from all learnings of runs #1-9.

Step 0: Preamble

Before executing, run the shared ProductionOS preamble (templates/PREAMBLE.md).

Phase 1: Baseline Capture

1.1: Read Target Definition

# For agents:
cat agents/$ARGUMENTS.target.md

# For commands:
cat .claude/commands/$ARGUMENTS.target.md

1.2: Extract Current Metrics

Read existing performance data if available:

cat ~/.productionos/analytics/skill-usage.jsonl | grep "$ARGUMENTS.target" | tail -20
cat ~/.productionos/instincts/project/*/lessons.json 2>/dev/null | grep "$ARGUMENTS.target"

1.3: Record Baseline

Run the target against the benchmark to establish baseline:

BASELINE:
  target: $ARGUMENTS.target
  benchmark: $ARGUMENTS.benchmark
  timestamp: {ISO8601}
  metrics:
    score: {0-10 from self-eval or test pass rate or LLM-judge}
    tokens: {token count for the run}
    duration: {seconds}
    issues_found: {count, for auditors}
    false_positives: {count}
  prompt_length: {word count of instructions}
  model: {current model assignment}
  layers: {which prompt composition layers are active}

Write baseline to .productionos/AUTO-OPTIMIZE-BASELINE.md.

Phase 2: Hypothesis Generation

If $ARGUMENTS.hypothesis is provided:

Use the user's hypothesis directly. Create $ARGUMENTS.challengers variants that test this hypothesis.

If no hypothesis:

The prompt-optimizer should analyze:

The target's current instructions (strengths, weaknesses)
Recent metaclaw-learner lessons about this target
The benchmark it will be evaluated against
Prompt engineering research patterns (from templates/PROMPT-COMPOSITION.md)

Generate $ARGUMENTS.challengers distinct hypotheses, each with:

{
  "id": "challenger-{N}",
  "hypothesis": "{what change we're testing}",
  "change_type": "prompt|model|layers|params",
  "expected_improvement": "{what metric should improve and by how much}",
  "risk": "{what could get worse}",
  "modification": "{specific text changes to apply}"
}

Write hypotheses to .productionos/AUTO-OPTIMIZE-HYPOTHESES.md.

Phase 3: Challenger Generation

For each hypothesis, create a modified version of the target:

Mode: prompt (default)

Copy the target agent/command definition
Apply the hypothesis modification to the instructions
Keep all other fields (model, tools, stakes) identical
Write to .productionos/challengers/challenger-{N}.md

Mode: model

Same instructions, different model assignment
Test combinations: opus (planning), sonnet (execution), haiku (validation)

Mode: layers

Same agent, different prompt composition layers
Test with/without: Emotion, Meta, ToT, GoT, CoD, Distractor, Generated Knowledge

Mode: params

Same agent, different convergence parameters
Test: EMA alpha (0.1-0.5), convergence threshold (0.01-0.1), max iterations (3-10)

Mode: rubric

Same evaluation dimensions, different scoring rubric
Dispatch rubric-evolver agent (OPRO pattern)
Generate 5 rubric variants with different anchor points, weights, and criteria
Score each variant against calibration set in .productionos/calibration/
Promote the variant with highest ground-truth correlation
See templates/calibration-set.md for calibration sample format

Phase 4: Benchmark Execution

Run baseline and all challengers against the same benchmark. The benchmark MUST be identical for fair comparison.

Benchmark: self-eval

For each variant:

Dispatch the agent with a fixed test task
Run /self-eval on the output
Record the 7-question scores + overall grade

Benchmark: test-suite

For each variant:

Dispatch the agent against the codebase
Run bun test after the agent completes
Record pass rate, new test failures, coverage delta

Benchmark: llm-judge

For each variant:

Dispatch the agent with a fixed test task
Submit output to llm-judge agent for blind evaluation
Record dimension scores, confidence intervals

Execution Protocol

FOR variant IN [baseline, challenger-1, ..., challenger-N]:
  1. Reset to clean state (git stash or worktree isolation)
  2. Apply variant's modifications (if challenger)
  3. Run the target against the benchmark task
  4. Collect metrics: score, tokens, duration, output quality
  5. Revert changes
  6. Record results in .productionos/AUTO-OPTIMIZE-RESULTS.md

Cost tracking: Before each variant run, check accumulated cost against $ARGUMENTS.max_cost. Halt if exceeded.

Phase 5: Harvest

5.1: Compare Results

RESULTS TABLE:
| Variant | Score | Tokens | Duration | Delta vs Baseline | p-value |
|

## Error Handling

| Scenario | Action |
|----------|--------|
| No target provided | Ask for clarification with examples |
| Target not found | Search for alternatives, suggest closest match |
| Agent dispatch fails | Fall back to manual execution, report the error |
| Ambiguous input | Present options, ask user to pick |
| Execution timeout | Save partial results, report what completed |

## Guardrails

1. Do not silently change scope or expand beyond the user request.
2. Prefer concrete outputs and verification over abstract descriptions.
3. Keep scope faithful to the user intent.
4. Preserve existing workflow guardrails and stop conditions.
5. Verify results before concluding. Run self-eval on output quality.

Related Skills

ShaheerKhawaja/writing-plans

tools

VerifiedTrustedCommunity

Implementation planning workflow that turns approved ideas into dependency-aware execution plans.

7SKILL.mdUpdated Apr 23, 2026

ShaheerKhawaja/writing-plans

ShaheerKhawaja/wiki-rag

development

VerifiedTrustedCommunity

Local RAG and Graph RAG over the SecondBrain wiki vault. Progressive context loading (hot cache -> index -> domain -> entity). Graph traversal via wikilink resolution. Use when agents need cross-project context, when answering questions that span multiple domains, or when building context for planning tasks. Triggers on: "wiki context", "cross-project context", "what do we know about", "check the wiki", "graph context", "/wiki-rag".

7SKILL.mdUpdated Apr 23, 2026

ShaheerKhawaja/wiki-rag

ShaheerKhawaja/ux-genie

devops

VerifiedTrustedCommunity

UX improvement pipeline — creates user stories from UI guidelines, maps user journeys, identifies friction, dispatches fix agents. The user-experience equivalent of /production-upgrade.

7SKILL.mdUpdated Apr 23, 2026

ShaheerKhawaja/ux-genie

ShaheerKhawaja/tdd

development

VerifiedTrustedCommunity

Test-driven development workflow that writes failing tests first, implements minimally, and refactors safely.

7SKILL.mdUpdated Apr 23, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ShaheerKhawaja/ProductionOS.git

# Copy into Claude Code skills folder (global)
cp -r ProductionOS/skills/auto-optimize ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ShaheerKhawaja/ProductionOS

7 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT