skills/prompt-lab/SKILL.md
LLM prompt engineering: analyzes failure modes, generates variants (direct, few-shot, CoT), designs rubrics, produces test suites. Triggers on: "prompt engineering", "generate prompt variants", "A/B test prompts", "optimize prompt", "improve this prompt". NOT for SKILL.md files, use skill-evaluator.
npx skillsauth add mathews-tom/armory prompt-labInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Replaces trial-and-error prompt engineering with structured methodology: objective definition, current prompt analysis, variant generation (instruction clarity, example strategies, output format specification), evaluation rubric design, test case creation, and failure mode identification.
| File | Contents | Load When |
| ---------------------------------- | ------------------------------------------------------------------------------ | -------------------------- |
| references/prompt-patterns.md | Prompt structure catalog: zero-shot, few-shot, CoT, persona, structured output | Always |
| references/evaluation-metrics.md | Quality metrics (accuracy, format compliance, completeness), rubric design | Evaluation needed |
| references/failure-modes.md | Common prompt failure taxonomy, detection strategies, mitigations | Failure analysis requested |
| references/output-constraints.md | Techniques for constraining LLM output format, JSON mode, schema enforcement | Format control needed |
If an existing prompt is provided:
references/failure-modes.md)
apply to this prompt?Create 2-4 prompt variants, each testing a different hypothesis:
| Variant Type | Hypothesis | When to Use | | ------------------ | ------------------------------------ | -------------------------------- | | Direct instruction | Clear instruction is sufficient | Simple tasks, capable models | | Few-shot | Examples improve output consistency | Pattern-following tasks | | Chain-of-thought | Reasoning improves accuracy | Multi-step logic, math, analysis | | Persona/role | Role framing improves tone/expertise | Domain-specific tasks | | Structured output | Format specification prevents errors | JSON, CSV, specific templates |
For each variant:
Rubric — Define weighted criteria:
| Criterion | What It Measures | Typical Weight | | ----------------- | ------------------------------ | -------------- | | Correctness | Output matches expected answer | 30-50% | | Format compliance | Follows specified structure | 15-25% | | Completeness | All required elements present | 15-25% | | Conciseness | No unnecessary content | 5-15% | | Tone/style | Matches requested voice | 5-10% |
Test cases — Minimum 5 cases covering:
Present variants, rubric, and test cases in a structured format ready for execution.
## Prompt Lab: {Task Name}
### Objective
{What the prompt should achieve — specific and measurable}
### Success Criteria
- [ ] {Criterion 1 — measurable}
- [ ] {Criterion 2 — measurable}
### Current Prompt Analysis
{If existing prompt provided}
- **Strengths:** {what works}
- **Weaknesses:** {what fails or is ambiguous}
- **Missing:** {what's not specified}
### Variants
#### Variant A: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant B: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
#### Variant C: {Strategy Name}
{Complete prompt text}
**Hypothesis:** {Why this approach might work}
**Risk:** {What could go wrong}
### Evaluation Rubric
| Criterion | Weight | Scoring |
|-----------|--------|---------|
| {criterion} | {%} | {how to score: 0-3 scale or pass/fail} |
### Test Cases
| # | Input | Expected Output | Tests Criteria |
|---|-------|-----------------|---------------|
| 1 | {standard input} | {expected} | Correctness, Format |
| 2 | {edge case} | {expected} | Completeness |
| 3 | {adversarial} | {expected} | Robustness |
### Failure Modes to Monitor
- {Failure mode 1}: {detection method}
- {Failure mode 2}: {detection method}
### Recommended Next Steps
1. Run all variants against the test suite
2. Score using the rubric
3. Select the highest-scoring variant
4. Iterate on the winner with targeted improvements
| Problem | Resolution | | ----------------------------------------------------- | --------------------------------------------------------------------------------------------- | | No clear objective | Ask the user to define what "good output" looks like with 2-3 examples. | | Prompt is for a task LLMs are bad at (math, counting) | Flag the limitation. Suggest tool-augmented approaches or pre/post-processing. | | Too many variables to test | Focus on the highest-impact variable first. Iterative refinement beats combinatorial testing. | | No existing prompt to analyze | Start with the simplest possible prompt. The first variant IS the baseline. | | Output format requirements are strict | Use structured output mode (JSON mode, function calling) instead of prompt-only constraints. |
Push back if:
testing
Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.
development
Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.
testing
Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.
development
Hypothesis-driven debugging with ranked hypotheses, git bisect strategy, instrumentation planning, and minimal reproduction design. Triggers on: "debug this systematically", "root cause analysis", "bisect this bug", "rank hypotheses", "isolate this issue", "minimal reproduction". NOT for general reasoning.