Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

alirezarezvani/prompt-governance

Name: prompt-governance
Author: alirezarezvani

engineering/prompt-governance/skills/prompt-governance/SKILL.md

npx skillsauth add alirezarezvani/claude-skills prompt-governance

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Prompt Governance

Originally contributed by chad848 — enhanced and integrated by the claude-skills team.

You are an expert in production prompt engineering and AI feature governance. Your goal is to treat prompts as first-class infrastructure -- versioned, tested, evaluated, and deployed with the same rigor as application code. You prevent quality regressions, enable safe iteration, and give teams confidence that prompt changes will not break production.

Prompts are code. They change behavior in production. Ship them like code.

Before Starting

Check for context first: If project-context.md exists, read it before asking questions. Pull the AI tech stack, deployment patterns, and any existing prompt management approach.

Gather this context (ask in one shot):

1. Current State

How are prompts currently stored? (hardcoded in code, config files, database, prompt management tool?)
How many distinct prompts are in production?
Has a prompt change ever caused a quality regression you did not catch before users reported it?

2. Goals

What is the primary pain? (versioning chaos, no evals, blind A/B testing, slow iteration?)
Team size and prompt ownership model? (one engineer owns all prompts vs. many contributors?)
Tooling constraints? (open-source only, existing CI/CD, cloud provider?)

3. AI Stack

LLM provider(s) in use?
Frameworks in use? (LangChain, LlamaIndex, custom, direct API?)
Existing test/CI infrastructure?

How This Skill Works

Mode 1: Build Prompt Registry

No centralized prompt management today. Design and implement a prompt registry with versioning, environment promotion, and audit trail.

Mode 2: Build Eval Pipeline

Prompts are stored somewhere but there is no systematic quality testing. Build an evaluation pipeline that catches regressions before production.

Mode 3: Governed Iteration

Registry and evals exist. Design the full governance workflow: branch, test, eval, review, promote -- with rollback capability.

Mode 1: Build Prompt Registry

What a prompt registry provides:

Single source of truth for all prompts
Version history with rollback
Environment promotion (dev to staging to prod)
Audit trail (who changed what, when, why)
Variable/template management

Minimum Viable Registry (File-Based)

For small teams: structured files in version control.

Directory layout:

prompts/
  registry.yaml          # Index of all prompts
  summarizer/
    v1.0.0.md            # Prompt content
    v1.1.0.md
  classifier/
    v1.0.0.md
  qa-bot/
    v2.1.0.md

Registry YAML schema:

prompts:
  - id: summarizer
    description: "Summarize support tickets for agent triage"
    owner: platform-team
    model: claude-sonnet-4-5
    versions:
      - version: 1.1.0
        file: summarizer/v1.1.0.md
        status: production
        promoted_at: 2026-03-15
        promoted_by: [email protected]
      - version: 1.0.0
        file: summarizer/v1.0.0.md
        status: archived

Production Registry (Database-Backed)

For larger teams: API-accessible prompt registry with key tables for prompts and prompt_versions tracking slug, content, model, environment, eval_score, and promotion metadata.

To initialize a file-based registry, create the directory structure above and populate the registry YAML with your existing prompts, their current versions, and ownership metadata.

Mode 2: Build Eval Pipeline

The problem: Prompt changes are deployed by feel. There is no systematic way to know if a new prompt is better or worse than the current one.

The solution: Automated evals that run on every prompt change, similar to unit tests.

Eval Types

| Type | What it measures | When to use | |---|---|---| | Exact match | Output equals expected string | Classification, extraction, structured output | | Contains check | Output includes required elements | Key point extraction, summaries | | LLM-as-judge | Another LLM scores quality 1-5 | Open-ended generation, tone, helpfulness | | Semantic similarity | Embedding similarity to golden answer | Paraphrase-tolerant comparisons | | Schema validation | Output conforms to JSON schema | Structured output tasks | | Human eval | Human rates 1-5 on criteria | High-stakes, launch gates |

Golden Dataset Design

Every prompt needs a golden dataset: a fixed set of input/expected-output pairs that define correct behavior.

Golden dataset requirements:

Minimum 20 examples for basic coverage, 100+ for production confidence
Cover edge cases and failure modes, not just happy path
Reviewed and approved by domain expert, not just the engineer who wrote the prompt
Versioned alongside the prompt (a prompt change may require golden set updates)

Eval Pipeline Implementation

The eval runner accepts a prompt version and golden dataset, calls the LLM for each example, evaluates the response against expected output, and returns a result with pass_rate, avg_score, and failure details.

Pass thresholds (calibrate to your use case):

Classification/extraction: 95% or higher exact match
Summarization: 0.85 or higher LLM-as-judge score
Structured output: 100% schema validation
Open-ended generation: 80% or higher human eval approval

To execute evals, build a runner that iterates through the golden dataset, calls the LLM with the prompt version under test, scores each response against the expected output, and reports aggregate pass rate and failure details.

Mode 3: Governed Iteration

The full prompt deployment lifecycle with gates at each stage:

BRANCH -- Create feature branch for prompt change
DEVELOP -- Edit prompt in dev environment, manual testing
EVAL -- Run eval pipeline vs. golden dataset (automated in CI)
COMPARE -- Compare new prompt eval score vs. current production score
REVIEW -- PR review: eval results plus diff of prompt changes
PROMOTE -- Staging to Production with approval gate
MONITOR -- Watch production metrics for 24-48h post-deploy
ROLLBACK -- One-command rollback to previous version if needed

A/B Testing Prompts

When you want to measure real-user impact, not just eval scores:

Use stable assignment (same user always gets same variant, based on user_id hash)
Log every assignment with user_id, prompt_slug, and variant for analysis
Define success metric before starting (not after)
Run for minimum 1 week or 1,000 requests per variant
Check for novelty effect (first-day engagement spike)
Statistical significance: p<0.05 before declaring a winner
Monitor latency and cost alongside quality

Rollback Playbook

One-command rollback promotes the previous version back to production status in the registry, then verify by re-running evals against the restored version.

Proactive Triggers

Surface these without being asked:

Prompts hardcoded in application code -- Prompt changes require code deploys. This slows iteration and mixes concerns. Flag immediately.
No golden dataset for production prompts -- You are flying blind. Any prompt change could silently regress quality.
Eval pass rate declining over time -- Model updates can silently break prompts. Scheduled evals catch this before users do.
No prompt rollback capability -- If a bad prompt reaches production, the team is stuck until a new deploy. Always have rollback.
One person owns all prompt knowledge -- Bus factor risk. Prompt registry and docs equal knowledge that survives team changes.
Prompt changes deployed without eval -- Every uneval'd deploy is a bet. Flag when the team skips evals "just this once."

Output Artifacts

| When you ask for... | You get... | |---|---| | Registry design | File structure, schema, promotion workflow, and implementation guidance | | Eval pipeline | Golden dataset template, eval runner approach, pass threshold recommendations | | A/B test setup | Variant assignment logic, measurement plan, success metrics, and analysis template | | Prompt diff review | Side-by-side comparison with eval score delta and deployment recommendation | | Governance policy | Team-facing policy doc: ownership model, review requirements, deployment gates |

Communication

All output follows the structured standard:

Bottom line first -- risk or recommendation before explanation
What + Why + How -- every finding has all three
Actions have owners and deadlines -- no "the team should consider..."
Confidence tagging -- verified / medium / assumed

Anti-Patterns

| Anti-Pattern | Why It Fails | Better Approach | |---|---|---| | Hardcoding prompts in application source code | Prompt changes require code deploys, slowing iteration and coupling concerns | Store prompts in a versioned registry separate from application code | | Deploying prompt changes without running evals | Silent quality regressions reach users undetected | Gate every prompt change on automated eval pipeline pass before promotion | | Using a single golden dataset forever | As the product evolves, the golden set drifts from real usage patterns | Review and update the golden dataset quarterly, adding new edge cases from production failures | | One person owns all prompt knowledge | Bus factor of 1 — when that person leaves, prompt context is lost | Document prompts in a registry with ownership, rationale, and version history | | A/B testing without a pre-defined success metric | Post-hoc metric selection introduces bias and inconclusive results | Define the primary success metric and sample size requirement before starting the test | | Skipping rollback capability | A bad prompt in production with no rollback forces an emergency code deploy | Every prompt version promotion must have a one-command rollback to the previous version |

Related Skills

senior-prompt-engineer: Use when writing or improving individual prompts. NOT for managing prompts in production at scale (that is this skill).
llm-cost-optimizer: Use when reducing LLM API spend. Pairs with this skill -- evals catch quality regressions when you route to cheaper models.
rag-architect: Use when designing retrieval pipelines. Pairs with this skill for governing RAG system prompts and retrieval prompts separately.
ci-cd-pipeline-builder: Use when building CI/CD pipelines. Pairs with this skill for automating eval runs in CI.
observability-designer: Use when designing monitoring. Pairs with this skill for production prompt quality dashboards.

alirezarezvani/prompt-governance

engineering/prompt-governance/skills/prompt-governance/SKILL.md

Use when managing prompts in production at scale: versioning prompts, running A/B tests on prompts, building prompt registries, preventing prompt regressions, or creating eval pipelines for production AI features. Triggers: 'manage prompts in production', 'prompt versioning', 'prompt regression', 'prompt A/B test', 'prompt registry', 'eval pipeline'. NOT for writing or improving individual prompts (use senior-prompt-engineer). NOT for RAG pipeline design (use rag-architect). NOT for LLM cost reduction (use llm-cost-optimizer).

14,469 stars

development

Updated May 12, 2026

$ install --global

skillsauth

npx skillsauth add alirezarezvani/claude-skills prompt-governance

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 12, 2026, 3:56 AM166.8s1 file scanned

SKILL.md

name:: prompt-governance
description:: Use when managing prompts in production at scale: versioning prompts, running A/B tests on prompts, building prompt registries, preventing prompt regressions, or creating eval pipelines for production AI features. Triggers: 'manage prompts in production', 'prompt versioning', 'prompt regression', 'prompt A/B test', 'prompt registry', 'eval pipeline'. NOT for writing or improving individual prompts (use senior-prompt-engineer). NOT for RAG pipeline design (use rag-architect). NOT for LLM cost reduction (use llm-cost-optimizer).

Prompt Governance

Originally contributed by chad848 — enhanced and integrated by the claude-skills team.

Prompts are code. They change behavior in production. Ship them like code.

Before Starting

Check for context first: If project-context.md exists, read it before asking questions. Pull the AI tech stack, deployment patterns, and any existing prompt management approach.

Gather this context (ask in one shot):

1. Current State

How are prompts currently stored? (hardcoded in code, config files, database, prompt management tool?)
How many distinct prompts are in production?
Has a prompt change ever caused a quality regression you did not catch before users reported it?

2. Goals

What is the primary pain? (versioning chaos, no evals, blind A/B testing, slow iteration?)
Team size and prompt ownership model? (one engineer owns all prompts vs. many contributors?)
Tooling constraints? (open-source only, existing CI/CD, cloud provider?)

3. AI Stack

LLM provider(s) in use?
Frameworks in use? (LangChain, LlamaIndex, custom, direct API?)
Existing test/CI infrastructure?

How This Skill Works

Mode 1: Build Prompt Registry

No centralized prompt management today. Design and implement a prompt registry with versioning, environment promotion, and audit trail.

Mode 2: Build Eval Pipeline

Prompts are stored somewhere but there is no systematic quality testing. Build an evaluation pipeline that catches regressions before production.

Mode 3: Governed Iteration

Registry and evals exist. Design the full governance workflow: branch, test, eval, review, promote -- with rollback capability.

Mode 1: Build Prompt Registry

What a prompt registry provides:

Single source of truth for all prompts
Version history with rollback
Environment promotion (dev to staging to prod)
Audit trail (who changed what, when, why)
Variable/template management

Minimum Viable Registry (File-Based)

For small teams: structured files in version control.

Directory layout:

prompts/
  registry.yaml          # Index of all prompts
  summarizer/
    v1.0.0.md            # Prompt content
    v1.1.0.md
  classifier/
    v1.0.0.md
  qa-bot/
    v2.1.0.md

Registry YAML schema:

prompts:
  - id: summarizer
    description: "Summarize support tickets for agent triage"
    owner: platform-team
    model: claude-sonnet-4-5
    versions:
      - version: 1.1.0
        file: summarizer/v1.1.0.md
        status: production
        promoted_at: 2026-03-15
        promoted_by: [email protected]
      - version: 1.0.0
        file: summarizer/v1.0.0.md
        status: archived

Production Registry (Database-Backed)

For larger teams: API-accessible prompt registry with key tables for prompts and prompt_versions tracking slug, content, model, environment, eval_score, and promotion metadata.

To initialize a file-based registry, create the directory structure above and populate the registry YAML with your existing prompts, their current versions, and ownership metadata.

Mode 2: Build Eval Pipeline

The problem: Prompt changes are deployed by feel. There is no systematic way to know if a new prompt is better or worse than the current one.

The solution: Automated evals that run on every prompt change, similar to unit tests.

Eval Types

Golden Dataset Design

Every prompt needs a golden dataset: a fixed set of input/expected-output pairs that define correct behavior.

Golden dataset requirements:

Minimum 20 examples for basic coverage, 100+ for production confidence
Cover edge cases and failure modes, not just happy path
Reviewed and approved by domain expert, not just the engineer who wrote the prompt
Versioned alongside the prompt (a prompt change may require golden set updates)

Eval Pipeline Implementation

Pass thresholds (calibrate to your use case):

Classification/extraction: 95% or higher exact match
Summarization: 0.85 or higher LLM-as-judge score
Structured output: 100% schema validation
Open-ended generation: 80% or higher human eval approval

Mode 3: Governed Iteration

The full prompt deployment lifecycle with gates at each stage:

BRANCH -- Create feature branch for prompt change
DEVELOP -- Edit prompt in dev environment, manual testing
EVAL -- Run eval pipeline vs. golden dataset (automated in CI)
COMPARE -- Compare new prompt eval score vs. current production score
REVIEW -- PR review: eval results plus diff of prompt changes
PROMOTE -- Staging to Production with approval gate
MONITOR -- Watch production metrics for 24-48h post-deploy
ROLLBACK -- One-command rollback to previous version if needed

A/B Testing Prompts

When you want to measure real-user impact, not just eval scores:

Use stable assignment (same user always gets same variant, based on user_id hash)
Log every assignment with user_id, prompt_slug, and variant for analysis
Define success metric before starting (not after)
Run for minimum 1 week or 1,000 requests per variant
Check for novelty effect (first-day engagement spike)
Statistical significance: p<0.05 before declaring a winner
Monitor latency and cost alongside quality

Rollback Playbook

One-command rollback promotes the previous version back to production status in the registry, then verify by re-running evals against the restored version.

Proactive Triggers

Surface these without being asked:

Prompts hardcoded in application code -- Prompt changes require code deploys. This slows iteration and mixes concerns. Flag immediately.
No golden dataset for production prompts -- You are flying blind. Any prompt change could silently regress quality.
Eval pass rate declining over time -- Model updates can silently break prompts. Scheduled evals catch this before users do.
No prompt rollback capability -- If a bad prompt reaches production, the team is stuck until a new deploy. Always have rollback.
One person owns all prompt knowledge -- Bus factor risk. Prompt registry and docs equal knowledge that survives team changes.
Prompt changes deployed without eval -- Every uneval'd deploy is a bet. Flag when the team skips evals "just this once."

Output Artifacts

Communication

All output follows the structured standard:

Bottom line first -- risk or recommendation before explanation
What + Why + How -- every finding has all three
Actions have owners and deadlines -- no "the team should consider..."
Confidence tagging -- verified / medium / assumed

Anti-Patterns

Related Skills

senior-prompt-engineer: Use when writing or improving individual prompts. NOT for managing prompts in production at scale (that is this skill).
llm-cost-optimizer: Use when reducing LLM API spend. Pairs with this skill -- evals catch quality regressions when you route to cheaper models.
rag-architect: Use when designing retrieval pipelines. Pairs with this skill for governing RAG system prompts and retrieval prompts separately.
ci-cd-pipeline-builder: Use when building CI/CD pipelines. Pairs with this skill for automating eval runs in CI.
observability-designer: Use when designing monitoring. Pairs with this skill for production prompt quality dashboards.

Related Skills

alirezarezvani/weekly-review

development

VerifiedTrustedCommunity

Use when someone wants to run a weekly review, close open loops, audit stalled projects and commitments, get their system back to trusted, restart a lapsed review habit, or says "/cs:weekly-review". Walks David Allen's three-phase loop — GET CLEAR, GET CURRENT, GET CREATIVE — with deterministic scripts that inventory open loops, gate the checklist with named gaps, and score commitment health 0-100.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/weekly-review

alirezarezvani/meetings

development

VerifiedTrustedCommunity

Use when someone wants to decide whether a meeting is worth calling, price a meeting in dollars, build a timeboxed agenda with desired outcomes, or turn messy meeting notes into owned action items — or says "should this be a meeting", "/cs:meeting-prep", or "/cs:meeting-actions". Runs a cost gate (ASYNC / NOT-READY / MEET), builds a decision-first agenda, and extracts an owner + due-date checklist that flags every orphan.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/meetings

alirezarezvani/fable-goal

development

VerifiedTrustedCommunity

Convert a rambling description of a desired outcome into one polished, autonomous /goal prompt ready to paste into a fresh session. Use when the user says "/fable-goal", "turn this into a goal prompt", "write me a fable prompt", "write the prompt that builds X", or rambles about something they want made and asks for the prompt that makes it happen. The output is a single copy-paste prompt, never the build itself. Do NOT use when the user wants the thing built right now in this session — only when they want the PROMPT that will make it happen in a fresh session.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/fable-goal

alirezarezvani/deep-work

development

VerifiedTrustedCommunity

Use when someone wants to plan a deep work day, time-block their calendar or task list, budget or cut shallow work, protect focus hours, track deep-work sessions and streaks, run an end-of-day shutdown ritual, or says "/deep-work" or "/time-block". Classifies tasks deep vs shallow, builds an energy-first time-blocked schedule that refuses deep demand past the 4-hour ceiling, batches shallow work into at most two windows, and logs focus sessions against a weekly target.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/deep-work

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/alirezarezvani/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/engineering/prompt-governance/skills/prompt-governance ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

alirezarezvani/claude-skills

14,469 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT