engineering-team/skills/senior-prompt-engineer/SKILL.md
Use when the user asks to optimize prompts, design prompt templates, evaluate LLM outputs with an eval set, measure RAG retrieval quality, validate agent/tool configurations, analyze token usage, or design structured-output contracts. Covers eval-driven prompt iteration, RAG metrics (relevance, faithfulness, coverage), agent workflow validation, and token/cost budgeting — all model-agnostic, with three stdlib Python tools.
npx skillsauth add alirezarezvani/claude-skills senior-prompt-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Eval-driven prompt engineering, RAG quality measurement, and agent workflow validation. Everything here is model-agnostic by design: techniques are framed by what they do, not by which model generation they were observed on, and the tools never hardcode model IDs or pricing — you supply your provider's current rates when you want dollar figures.
--analyze --output baseline.json), then compare every iteration against it.--price-per-mtok (never trust a cached price table — including any you remember).scripts/prompt_optimizer.pyStatic analysis: token estimate, clarity/structure scores (0–100), ambiguity + redundancy detection, few-shot example extraction.
# Full analysis (human-readable report)
python3 scripts/prompt_optimizer.py prompt.txt --analyze
# Save machine-readable baseline for later comparison
python3 scripts/prompt_optimizer.py prompt.txt --analyze --json --output baseline.json
# Token estimate; cost only if you supply your provider's current rate
python3 scripts/prompt_optimizer.py prompt.txt --tokens --model claude --price-per-mtok 3.00
# Whitespace/redundancy-trimmed version
python3 scripts/prompt_optimizer.py prompt.txt --optimize --output optimized.txt
# Extract Input/Output few-shot pairs to JSON
python3 scripts/prompt_optimizer.py prompt.txt --extract-examples --output examples.json
# Compare a revision against the saved baseline
python3 scripts/prompt_optimizer.py optimized.txt --analyze --compare baseline.json
--model accepts any string; only the tokenizer family is inferred (names containing "claude" → 3.5 chars/token, otherwise 4.0). Exit 0 on success, 1 on missing file.
scripts/rag_evaluator.pyMeasures retrieval and grounding quality from two JSON files (formats printed in --help).
python3 scripts/rag_evaluator.py --contexts retrieved.json --questions eval_set.json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --k 10 --json
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --output report.json --verbose
python3 scripts/rag_evaluator.py --contexts ctx.json --questions q.json --compare baseline_report.json
Reports context relevance, precision@k, coverage, answer faithfulness, groundedness. Treat relevance < 0.80 as a retrieval problem (chunking/embedding/filtering), not a prompt problem — fix retrieval before rewriting the generation prompt.
scripts/agent_orchestrator.pyValidates agent configs (YAML/JSON): tool wiring, missing required config, loop risk, token estimates.
python3 scripts/agent_orchestrator.py agent.yaml --validate
python3 scripts/agent_orchestrator.py agent.yaml --visualize --format mermaid
python3 scripts/agent_orchestrator.py agent.yaml --estimate-cost --runs 100 \
--input-price-per-mtok 3.00 --output-price-per-mtok 15.00
Without the two price flags, --estimate-cost reports token estimates only. The model: field in the config is informational — any model name is accepted.
python3 scripts/prompt_optimizer.py current_prompt.txt --analyze --json --output baseline.jsonpython3 scripts/prompt_optimizer.py revised.txt --analyze --compare baseline.jsoneval_results.json, then assert:
python3 scripts/prompt_optimizer.py revised.txt --analyze --json --output revised.json \
&& python3 -c "
import json, sys
r = json.load(open('revised.json')); b = json.load(open('baseline.json'))
ok = r['clarity_score'] >= b['clarity_score'] and r['token_count'] <= b['token_count'] * 1.10
sys.exit(0 if ok else 1)"
echo "gate exit=$?" # 0 = ship; 1 = regression, iterate again
Pair this structural gate with your task-level eval: the revision must not lose any previously-passing eval case (no-regression rule).python3 scripts/prompt_optimizer.py prompt_with_examples.txt --extract-examples --output examples.json and inspect that every extracted pair parses against your schema.python3 -c "import json,sys; [json.loads(l) for l in sys.stdin]" at minimum); 10/10 must parse, else return to step 2.questions.json (id, question, reference answer) and capture current retrievals to contexts.json.python3 scripts/rag_evaluator.py --contexts contexts.json --questions questions.json --output rag_baseline.jsonpython3 scripts/rag_evaluator.py --contexts new_contexts.json --questions questions.json --compare rag_baseline.json — every metric must be ≥ baseline; any regression blocks the change.python3 scripts/agent_orchestrator.py agent.yaml --validate — must exit with VALIDATION PASSED; fix every error and warning (missing tool config, unbounded iterations, loop risk).--estimate-cost --runs N with your current prices; if cost/run exceeds budget, cut tools or context before downgrading the model.| File | Contains | Load when user asks about |
|------|----------|---------------------------|
| references/prompt_engineering_patterns.md | 10 prompt patterns with input/output examples | "which pattern?", few-shot design, decomposition, meta-prompting |
| references/llm_evaluation_frameworks.md | Eval metrics, scoring methods, A/B testing | "how to evaluate?", "measure quality", "compare prompts" |
| references/agentic_system_design.md | Agent architectures (ReAct, Plan-Execute, Tool Use) | "build agent", "tool calling", "multi-agent" |
engineering-team/skills/senior-ml-engineer — model deployment and serving (this skill stops at the prompt/eval layer)engineering/rag-architect — RAG system architecture (this skill measures RAG quality; that one designs the pipeline)engineering/agent-designer — full agent system design (this skill validates configs; that one designs the architecture)data-ai
Use when you want to understand what Claude contributed vs what you drove in a session. Triggers on: /collab-proof, session retrospective, ai contribution analysis, collaboration evidence, what did claude do.
data-ai
Personal coach that teaches users to become Claude power users. Use this skill the FIRST time a user asks to "learn Claude", "be a power user", "coach me", "teach me Claude tricks", "what can Claude do", "make me better at prompting", or any variation. After activation, also use it on EVERY subsequent turn to detect missed optimization opportunities (vague prompts, ignored capabilities, manual work Claude could automate) and surface a single power-user tip. Trigger generously — most users do not know what they do not know, so err on the side of coaching.
development
Use when designing or revisiting product pricing — selecting a pricing model (subscription seat-based, usage-based, value-based, freemium, or hybrid), running Van Westendorp Price Sensitivity Meter analysis on WTP survey data, or designing Good/Better/Best packaging tiers. Recommends a model and a price range with trade-offs, never a single number. For Commercial leads, Product Marketing, and CMOs at the pricing-design moment — not deal-by-deal discounting, not brand positioning.
testing
Use when a startup is approached by a prospective partner and someone has to decide should we sign this partner, at what partner tier (referral / reseller / OEM / SI-consulting / strategic alliance), with what joint GTM commitment, and at what revshare. Classifies partner tier from independent-demand evidence vs. preferential-terms hunting, designs a 90-day joint GTM plan, models revshare against direct-sale margin, and surfaces kill criteria for unwinding under-performing partnerships. For Head of Partnerships, Head of BD, and Founder-CEOs doing reseller agreement, OEM deal, or strategic alliance review — not technical sale enablement, not channel cost economics, not M&A.