Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

alirezarezvani/senior-prompt-engineer

Name: senior-prompt-engineer
Author: alirezarezvani

engineering-team/skills/senior-prompt-engineer/SKILL.md

npx skillsauth add alirezarezvani/claude-skills senior-prompt-engineer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Senior Prompt Engineer

Eval-driven prompt engineering, RAG quality measurement, and agent workflow validation. Everything here is model-agnostic by design: techniques are framed by what they do, not by which model generation they were observed on, and the tools never hardcode model IDs or pricing — you supply your provider's current rates when you want dollar figures.

Operating Rules

Never change a prompt without a baseline. Capture metrics first (--analyze --output baseline.json), then compare every iteration against it.
Eval set before optimization. 10–20 representative cases with expected outputs minimum. If the user has no eval set, build one with them before touching the prompt — optimizing against vibes is the #1 failure mode.
Prefer platform features over prompt hacks. If the provider offers native structured outputs / JSON schema enforcement, tool-use APIs, or prompt caching, use those instead of "respond ONLY with JSON" incantations. Prompt-level format enforcement is the fallback, not the default.
Current-generation models need less scaffolding. Don't add chain-of-thought boilerplate, role framing, or few-shot examples reflexively — frontier models often do worse with redundant scaffolding. Add each element only when the eval set shows it helps.
Cost numbers are always user-supplied. Look up the provider's current per-Mtok pricing and pass it via --price-per-mtok (never trust a cached price table — including any you remember).

Tools (exact CLIs, all stdlib)

1. Prompt Optimizer — `scripts/prompt_optimizer.py`

Static analysis: token estimate, clarity/structure scores (0–100), ambiguity + redundancy detection, few-shot example extraction.

# Full analysis (human-readable report)
python3 scripts/prompt_optimizer.py prompt.txt --analyze

# Save machine-readable baseline for later comparison
python3 scripts/prompt_optimizer.py prompt.txt --analyze --json --output baseline.json

# Token estimate; cost only if you supply your provider's current rate
python3 scripts/prompt_optimizer.py prompt.txt --tokens --model claude --price-per-mtok 3.00

# Whitespace/redundancy-trimmed version
python3 scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt

# Extract Input/Output few-shot pairs to JSON
python3 scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json

# Compare a revision against the saved baseline
python3 scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json

--model accepts any string; only the tokenizer family is inferred (names containing "claude" → 3.5 chars/token, otherwise 4.0). Exit 0 on success, 1 on missing file.

2. RAG Evaluator — `scripts/rag_evaluator.py`

Measures retrieval and grounding quality from two JSON files (formats printed in --help).

python3 scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --k 10 --json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --output report.json --verbose
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --compare baseline_report.json

Reports context relevance, precision@k, coverage, answer faithfulness, groundedness. Treat relevance < 0.80 as a retrieval problem (chunking/embedding/filtering), not a prompt problem — fix retrieval before rewriting the generation prompt.

3. Agent Orchestrator — `scripts/agent_orchestrator.py`

Validates agent configs (YAML/JSON): tool wiring, missing required config, loop risk, token estimates.

python3 scripts/agent_orchestrator.py agent.yaml --validate
python3 scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
python3 scripts/agent_orchestrator.py agent.yaml --estimate-cost --runs 100 \
    --input-price-per-mtok 3.00 --output-price-per-mtok 15.00

Without the two price flags, --estimate-cost reports token estimates only. The model: field in the config is informational — any model name is accepted.

Workflows

Prompt Optimization (eval-gated)

Baseline: python3 scripts/prompt_optimizer.py current_prompt.txt --analyze --json --output baseline.json
Diagnose from the report: ambiguous verbs ("analyze", "handle"), redundant blocks, missing output contract, token waste.
Apply one change at a time, in this order of leverage: | Symptom | Fix | |---------|-----| | Malformed/unparseable output | Native structured outputs / JSON schema if the API supports it; explicit schema-in-prompt otherwise | | Inconsistent answers across runs | Tighten instructions + add 2–3 contrastive examples (one near-miss showing what NOT to do) | | Misses edge cases | Enumerate the edge cases explicitly; add a "when uncertain, do X" rule | | Token bloat on repeated calls | Move stable prefix (system rules, examples) first so prompt caching applies; trim redundancy | | Wrong reasoning on hard cases | Ask for stepwise reasoning in a scratch field the consumer ignores, or use the provider's extended-thinking mode |
Re-analyze and compare: python3 scripts/prompt_optimizer.py revised.txt --analyze --compare baseline.json

Eval gate (must pass before shipping): run the revised prompt over the eval set, write per-case pass/fail to eval_results.json, then assert:

python3 scripts/prompt_optimizer.py revised.txt --analyze --json --output revised.json \
  && python3 -c "
import json, sys
r = json.load(open('revised.json')); b = json.load(open('baseline.json'))
ok = r['clarity_score'] >= b['clarity_score'] and r['token_count'] <= b['token_count'] * 1.10
sys.exit(0 if ok else 1)"
echo "gate exit=$?"   # 0 = ship; 1 = regression, iterate again

Pair this structural gate with your task-level eval: the revision must not lose any previously-passing eval case (no-regression rule).

Few-Shot Example Design

Define the task contract first (input shape, output shape, edge-case policy).
Start with zero examples and measure — current models often need none. Add examples only for failure clusters the eval reveals.
When adding: 3–5 max, ordered simple → edge → negative (what NOT to extract), formatted identically to the real output contract.
Validate consistency: python3 scripts/prompt_optimizer.py prompt_with_examples.txt --extract-examples --output examples.json and inspect that every extracted pair parses against your schema.
Re-run the eval set; if a case passes only because it resembles an example, add a held-out variant to the eval set.

Structured Output Design

Write the JSON Schema first (types, enums, required, maxLength).
Prefer API-native enforcement: structured-outputs / response-schema / tool-call parameters guarantee shape; prompt text cannot.
Fallback (API without schema support): include the schema rendered as field-by-field rules + one valid example, and instruct "output only the JSON object".
Gate: pipe 10 eval outputs through a schema validator (python3 -c "import json,sys; [json.loads(l) for l in sys.stdin]" at minimum); 10/10 must parse, else return to step 2.

RAG Tuning Loop

Build questions.json (id, question, reference answer) and capture current retrievals to contexts.json.
python3 scripts/rag_evaluator.py --contexts contexts.json --questions questions.json --output rag_baseline.json
Fix the lowest metric first: relevance → chunking/embeddings/metadata filters; faithfulness → grounding instructions + "answer only from context" + citation requirement; coverage → retrieval k / query expansion.
Gate: python3 scripts/rag_evaluator.py --contexts new_contexts.json --questions questions.json --compare rag_baseline.json — every metric must be ≥ baseline; any regression blocks the change.

Agent Config Review

python3 scripts/agent_orchestrator.py agent.yaml --validate — must exit with VALIDATION PASSED; fix every error and warning (missing tool config, unbounded iterations, loop risk).
Check context discipline: each tool description ≤ 1–2 sentences, tool count minimal for the job, stable system prompt placed first (cache-friendly), iteration cap + early-exit condition present.
Budget: --estimate-cost --runs N with your current prices; if cost/run exceeds budget, cut tools or context before downgrading the model.

References

| File | Contains | Load when user asks about | |------|----------|---------------------------| | references/prompt_engineering_patterns.md | 10 prompt patterns with input/output examples | "which pattern?", few-shot design, decomposition, meta-prompting | | references/llm_evaluation_frameworks.md | Eval metrics, scoring methods, A/B testing | "how to evaluate?", "measure quality", "compare prompts" | | references/agentic_system_design.md | Agent architectures (ReAct, Plan-Execute, Tool Use) | "build agent", "tool calling", "multi-agent" |

Related Skills

engineering-team/skills/senior-ml-engineer — model deployment and serving (this skill stops at the prompt/eval layer)
engineering/rag-architect — RAG system architecture (this skill measures RAG quality; that one designs the pipeline)
engineering/agent-designer — full agent system design (this skill validates configs; that one designs the architecture)

alirezarezvani/senior-prompt-engineer

engineering-team/skills/senior-prompt-engineer/SKILL.md

Use when the user asks to optimize prompts, design prompt templates, evaluate LLM outputs with an eval set, measure RAG retrieval quality, validate agent/tool configurations, analyze token usage, or design structured-output contracts. Covers eval-driven prompt iteration, RAG metrics (relevance, faithfulness, coverage), agent workflow validation, and token/cost budgeting — all model-agnostic, with three stdlib Python tools.

17,936 stars

tools

Updated Jun 13, 2026

$ install --global

skillsauth

npx skillsauth add alirezarezvani/claude-skills senior-prompt-engineer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 13, 2026, 4:20 AM76.0s7 files scanned

SKILL.md

name:: senior-prompt-engineer
description:: Use when the user asks to optimize prompts, design prompt templates, evaluate LLM outputs with an eval set, measure RAG retrieval quality, validate agent/tool configurations, analyze token usage, or design structured-output contracts. Covers eval-driven prompt iteration, RAG metrics (relevance, faithfulness, coverage), agent workflow validation, and token/cost budgeting — all model-agnostic, with three stdlib Python tools.

Senior Prompt Engineer

Operating Rules

Never change a prompt without a baseline. Capture metrics first (--analyze --output baseline.json), then compare every iteration against it.
Eval set before optimization. 10–20 representative cases with expected outputs minimum. If the user has no eval set, build one with them before touching the prompt — optimizing against vibes is the #1 failure mode.
Prefer platform features over prompt hacks. If the provider offers native structured outputs / JSON schema enforcement, tool-use APIs, or prompt caching, use those instead of "respond ONLY with JSON" incantations. Prompt-level format enforcement is the fallback, not the default.
Current-generation models need less scaffolding. Don't add chain-of-thought boilerplate, role framing, or few-shot examples reflexively — frontier models often do worse with redundant scaffolding. Add each element only when the eval set shows it helps.
Cost numbers are always user-supplied. Look up the provider's current per-Mtok pricing and pass it via --price-per-mtok (never trust a cached price table — including any you remember).

Tools (exact CLIs, all stdlib)

1. Prompt Optimizer — `scripts/prompt_optimizer.py`

Static analysis: token estimate, clarity/structure scores (0–100), ambiguity + redundancy detection, few-shot example extraction.

# Full analysis (human-readable report)
python3 scripts/prompt_optimizer.py prompt.txt --analyze

# Save machine-readable baseline for later comparison
python3 scripts/prompt_optimizer.py prompt.txt --analyze --json --output baseline.json

# Token estimate; cost only if you supply your provider's current rate
python3 scripts/prompt_optimizer.py prompt.txt --tokens --model claude --price-per-mtok 3.00

# Whitespace/redundancy-trimmed version
python3 scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt

# Extract Input/Output few-shot pairs to JSON
python3 scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json

# Compare a revision against the saved baseline
python3 scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json

--model accepts any string; only the tokenizer family is inferred (names containing "claude" → 3.5 chars/token, otherwise 4.0). Exit 0 on success, 1 on missing file.

2. RAG Evaluator — `scripts/rag_evaluator.py`

Measures retrieval and grounding quality from two JSON files (formats printed in --help).

python3 scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --k 10 --json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --output report.json --verbose
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --compare baseline_report.json

3. Agent Orchestrator — `scripts/agent_orchestrator.py`

Validates agent configs (YAML/JSON): tool wiring, missing required config, loop risk, token estimates.

python3 scripts/agent_orchestrator.py agent.yaml --validate
python3 scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
python3 scripts/agent_orchestrator.py agent.yaml --estimate-cost --runs 100 \
    --input-price-per-mtok 3.00 --output-price-per-mtok 15.00

Without the two price flags, --estimate-cost reports token estimates only. The model: field in the config is informational — any model name is accepted.

Workflows

Prompt Optimization (eval-gated)

Baseline: python3 scripts/prompt_optimizer.py current_prompt.txt --analyze --json --output baseline.json
Diagnose from the report: ambiguous verbs ("analyze", "handle"), redundant blocks, missing output contract, token waste.
Apply one change at a time, in this order of leverage: | Symptom | Fix | |---------|-----| | Malformed/unparseable output | Native structured outputs / JSON schema if the API supports it; explicit schema-in-prompt otherwise | | Inconsistent answers across runs | Tighten instructions + add 2–3 contrastive examples (one near-miss showing what NOT to do) | | Misses edge cases | Enumerate the edge cases explicitly; add a "when uncertain, do X" rule | | Token bloat on repeated calls | Move stable prefix (system rules, examples) first so prompt caching applies; trim redundancy | | Wrong reasoning on hard cases | Ask for stepwise reasoning in a scratch field the consumer ignores, or use the provider's extended-thinking mode |
Re-analyze and compare: python3 scripts/prompt_optimizer.py revised.txt --analyze --compare baseline.json

Eval gate (must pass before shipping): run the revised prompt over the eval set, write per-case pass/fail to eval_results.json, then assert:

python3 scripts/prompt_optimizer.py revised.txt --analyze --json --output revised.json \
  && python3 -c "
import json, sys
r = json.load(open('revised.json')); b = json.load(open('baseline.json'))
ok = r['clarity_score'] >= b['clarity_score'] and r['token_count'] <= b['token_count'] * 1.10
sys.exit(0 if ok else 1)"
echo "gate exit=$?"   # 0 = ship; 1 = regression, iterate again

Pair this structural gate with your task-level eval: the revision must not lose any previously-passing eval case (no-regression rule).

Few-Shot Example Design

Define the task contract first (input shape, output shape, edge-case policy).
Start with zero examples and measure — current models often need none. Add examples only for failure clusters the eval reveals.
When adding: 3–5 max, ordered simple → edge → negative (what NOT to extract), formatted identically to the real output contract.
Validate consistency: python3 scripts/prompt_optimizer.py prompt_with_examples.txt --extract-examples --output examples.json and inspect that every extracted pair parses against your schema.
Re-run the eval set; if a case passes only because it resembles an example, add a held-out variant to the eval set.

Structured Output Design

Write the JSON Schema first (types, enums, required, maxLength).
Prefer API-native enforcement: structured-outputs / response-schema / tool-call parameters guarantee shape; prompt text cannot.
Fallback (API without schema support): include the schema rendered as field-by-field rules + one valid example, and instruct "output only the JSON object".
Gate: pipe 10 eval outputs through a schema validator (python3 -c "import json,sys; [json.loads(l) for l in sys.stdin]" at minimum); 10/10 must parse, else return to step 2.

RAG Tuning Loop

Build questions.json (id, question, reference answer) and capture current retrievals to contexts.json.
python3 scripts/rag_evaluator.py --contexts contexts.json --questions questions.json --output rag_baseline.json
Fix the lowest metric first: relevance → chunking/embeddings/metadata filters; faithfulness → grounding instructions + "answer only from context" + citation requirement; coverage → retrieval k / query expansion.
Gate: python3 scripts/rag_evaluator.py --contexts new_contexts.json --questions questions.json --compare rag_baseline.json — every metric must be ≥ baseline; any regression blocks the change.

Agent Config Review

python3 scripts/agent_orchestrator.py agent.yaml --validate — must exit with VALIDATION PASSED; fix every error and warning (missing tool config, unbounded iterations, loop risk).
Check context discipline: each tool description ≤ 1–2 sentences, tool count minimal for the job, stable system prompt placed first (cache-friendly), iteration cap + early-exit condition present.
Budget: --estimate-cost --runs N with your current prices; if cost/run exceeds budget, cut tools or context before downgrading the model.

References

Related Skills

engineering-team/skills/senior-ml-engineer — model deployment and serving (this skill stops at the prompt/eval layer)
engineering/rag-architect — RAG system architecture (this skill measures RAG quality; that one designs the pipeline)
engineering/agent-designer — full agent system design (this skill validates configs; that one designs the architecture)

Related Skills

alirezarezvani/weekly-review

development

VerifiedTrustedCommunity

Use when someone wants to run a weekly review, close open loops, audit stalled projects and commitments, get their system back to trusted, restart a lapsed review habit, or says "/cs:weekly-review". Walks David Allen's three-phase loop — GET CLEAR, GET CURRENT, GET CREATIVE — with deterministic scripts that inventory open loops, gate the checklist with named gaps, and score commitment health 0-100.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/weekly-review

alirezarezvani/meetings

development

VerifiedTrustedCommunity

Use when someone wants to decide whether a meeting is worth calling, price a meeting in dollars, build a timeboxed agenda with desired outcomes, or turn messy meeting notes into owned action items — or says "should this be a meeting", "/cs:meeting-prep", or "/cs:meeting-actions". Runs a cost gate (ASYNC / NOT-READY / MEET), builds a decision-first agenda, and extracts an owner + due-date checklist that flags every orphan.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/meetings

alirezarezvani/fable-goal

development

VerifiedTrustedCommunity

Convert a rambling description of a desired outcome into one polished, autonomous /goal prompt ready to paste into a fresh session. Use when the user says "/fable-goal", "turn this into a goal prompt", "write me a fable prompt", "write the prompt that builds X", or rambles about something they want made and asks for the prompt that makes it happen. The output is a single copy-paste prompt, never the build itself. Do NOT use when the user wants the thing built right now in this session — only when they want the PROMPT that will make it happen in a fresh session.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/fable-goal

alirezarezvani/deep-work

development

VerifiedTrustedCommunity

Use when someone wants to plan a deep work day, time-block their calendar or task list, budget or cut shallow work, protect focus hours, track deep-work sessions and streaks, run an end-of-day shutdown ritual, or says "/deep-work" or "/time-block". Classifies tasks deep vs shallow, builds an energy-first time-blocked schedule that refuses deep demand past the 4-hour ceiling, batches shallow work into at most two windows, and logs focus sessions against a weekly target.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/deep-work

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/alirezarezvani/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/engineering-team/skills/senior-prompt-engineer ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

alirezarezvani/claude-skills

17,936 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

alirezarezvani/senior-prompt-engineer

$ install --global

Security Scan Results

SKILL.md

Senior Prompt Engineer

Operating Rules

Tools (exact CLIs, all stdlib)

1. Prompt Optimizer — scripts/prompt_optimizer.py

2. RAG Evaluator — scripts/rag_evaluator.py

3. Agent Orchestrator — scripts/agent_orchestrator.py

Workflows

Prompt Optimization (eval-gated)

Few-Shot Example Design

Structured Output Design

RAG Tuning Loop

Agent Config Review

References

Related Skills

Related Skills

alirezarezvani/weekly-review

alirezarezvani/meetings

alirezarezvani/fable-goal

alirezarezvani/deep-work

alirezarezvani/senior-prompt-engineer

$ install --global

Security Scan Results

SKILL.md

Senior Prompt Engineer

Operating Rules

Tools (exact CLIs, all stdlib)

1. Prompt Optimizer — scripts/prompt_optimizer.py

2. RAG Evaluator — scripts/rag_evaluator.py

3. Agent Orchestrator — scripts/agent_orchestrator.py

Workflows

Prompt Optimization (eval-gated)

Few-Shot Example Design

Structured Output Design

RAG Tuning Loop

Agent Config Review

References

Related Skills

Related Skills

alirezarezvani/weekly-review

alirezarezvani/meetings

alirezarezvani/fable-goal

alirezarezvani/deep-work

1. Prompt Optimizer — `scripts/prompt_optimizer.py`

2. RAG Evaluator — `scripts/rag_evaluator.py`

3. Agent Orchestrator — `scripts/agent_orchestrator.py`

1. Prompt Optimizer — `scripts/prompt_optimizer.py`

2. RAG Evaluator — `scripts/rag_evaluator.py`

3. Agent Orchestrator — `scripts/agent_orchestrator.py`