plugins/prompt-engineer/skills/prompt-testing/SKILL.md
A/B testing and performance metrics for prompts
npx skillsauth add fusengine/agents prompt-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Skill for testing, comparing, and measuring prompt performance.
1. DEFINE
└── Test objective
└── Metrics to measure
└── Success criteria
2. PREPARE
└── Variants A and B
└── Test dataset
└── Baseline (if existing)
3. EXECUTE
└── Run on dataset
└── Collect results
└── Document observations
4. ANALYZE
└── Calculate metrics
└── Compare variants
└── Identify patterns
5. DECIDE
└── Recommendation
└── Statistical confidence
└── Next iterations
| Metric | Description | Calculation | |--------|-------------|-------------| | Accuracy | Correct responses | Correct / Total | | Compliance | Format adherence | Compliant / Total | | Consistency | Response stability | 1 - Variance | | Relevance | Meeting the need | Average score (1-5) |
| Metric | Description | Calculation | |--------|-------------|-------------| | Tokens Input | Prompt size | Token count | | Tokens Output | Response size | Token count | | Latency | Response time | ms | | Cost | Price per request | Tokens × Price |
| Metric | Description | Calculation | |--------|-------------|-------------| | Edge Cases | Edge case handling | Passed / Total | | Jailbreak Resist | Bypass resistance | Blocked / Attempts | | Error Recovery | Error recovery | Recovered / Errors |
{
"name": "Test Dataset v1",
"description": "Dataset for testing prompt XYZ",
"cases": [
{
"id": "case_001",
"type": "standard",
"input": "Test input",
"expected": "Expected output",
"tags": ["basic", "format"]
},
{
"id": "case_002",
"type": "edge_case",
"input": "Edge input",
"expected": "Expected behavior",
"tags": ["edge", "error"]
}
]
}
# A/B Test Report: {{TEST_NAME}}
## Configuration
| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |
## Tested Variants
### Variant A (Baseline)
[Description or link to prompt A]
### Variant B (Challenger)
[Description or link to prompt B]
## Results
### Overall Scores
| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |
### Detail by Case Type
| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |
### Problematic Cases
| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |
## Analysis
### B's Strengths
- [Improvement 1]
- [Improvement 2]
### B's Weaknesses
- [Regression 1]
### Observations
[Qualitative insights]
## Recommendation
**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A
**Confidence**: High / Medium / Low
**Justification**:
[Explanation of recommendation]
## Next Steps
1. [Action 1]
2. [Action 2]
# Create a test
/prompt test create --name "Test v1" --dataset tests.json
# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json
# View results
/prompt test results --id test_001
# Compare two tests
/prompt test compare --tests test_001,test_002
IF:
- Accuracy B >= Accuracy A
AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
AND no regression on edge cases
THEN:
→ Adopt B
ELSE IF:
- Accuracy improvement > 10%
AND token regression < 20%
THEN:
→ Consider B (acceptable trade-off)
ELSE:
→ Keep A or iterate
development
Use when optimizing entity-based / semantic SEO 2026. Covers entity maps, Google Knowledge Graph resolution, salience scoring, passage-level ranking, about/sameAs/knowsAbout schema, Cloud Natural Language API validation.
development
Use when running SEO, GEO, schema, Core Web Vitals, sitemap, hreflang, E-E-A-T, AI Overviews, technical SEO, or structured data tasks. Covers full-site audits, single-page analysis, schema markup, content quality, AI search optimization, local SEO, sitemap/robots, internal linking, semantic clustering, and search experience.
development
Use when optimizing search experience (SXO). Covers intent matching, user personas, user stories, page-type analysis, dwell time, scroll depth, pogo-sticking prevention.
development
Use when optimizing local SEO. Covers Google Business Profile, NAP consistency, citations, reviews acquisition, Local Pack ranking, location pages, LocalBusiness schema.