Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

fusengine/prompt-testing

Name: prompt-testing
Author: fusengine

plugins/prompt-engineer/skills/prompt-testing/SKILL.md

npx skillsauth add fusengine/agents prompt-testing

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Prompt Testing

Skill for testing, comparing, and measuring prompt performance.

References

metrics.md - Load when: defining or scoring Quality/Efficiency/Robustness/UX metrics with thresholds and calculation formulas
methodology.md - Load when: running a full A/B test (hypothesis, dataset sizing, statistical significance, common pitfalls)
templates.md - Load when: writing a test dataset JSON or an A/B test report

Testing Workflow

1. DEFINE
   └── Test objective
   └── Metrics to measure
   └── Success criteria

2. PREPARE
   └── Variants A and B
   └── Test dataset
   └── Baseline (if existing)

3. EXECUTE
   └── Run on dataset
   └── Collect results
   └── Document observations

4. ANALYZE
   └── Calculate metrics
   └── Compare variants
   └── Identify patterns

5. DECIDE
   └── Recommendation
   └── Statistical confidence
   └── Next iterations

Performance Metrics

Quality

| Metric | Description | Calculation | |--------|-------------|-------------| | Accuracy | Correct responses | Correct / Total | | Compliance | Format adherence | Compliant / Total | | Consistency | Response stability | 1 - Variance | | Relevance | Meeting the need | Average score (1-5) |

Efficiency

| Metric | Description | Calculation | |--------|-------------|-------------| | Tokens Input | Prompt size | Token count | | Tokens Output | Response size | Token count | | Latency | Response time | ms | | Cost | Price per request | Tokens × Price |

Robustness

| Metric | Description | Calculation | |--------|-------------|-------------| | Edge Cases | Edge case handling | Passed / Total | | Jailbreak Resist | Bypass resistance | Blocked / Attempts | | Error Recovery | Error recovery | Recovered / Errors |

For full definitions, thresholds, and the UX metrics category, see metrics.md. For the test dataset and report formats, see templates.md.

Commands

# Create a test
/prompt test create --name "Test v1" --dataset tests.json

# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

# View results
/prompt test results --id test_001

# Compare two tests
/prompt test compare --tests test_001,test_002

Decision Criteria

When to adopt variant B?

IF:
  - Accuracy B >= Accuracy A
  AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
  AND no regression on edge cases
THEN:
  → Adopt B

ELSE IF:
  - Accuracy improvement > 10%
  AND token regression < 20%
THEN:
  → Consider B (acceptable trade-off)

ELSE:
  → Keep A or iterate

Best Practices

Minimum 20 test cases for significance
Include edge cases (15-20% of dataset)
Test multiple runs for consistency
Document hypotheses before testing
Version the prompts being tested

fusengine/prompt-testing

plugins/prompt-engineer/skills/prompt-testing/SKILL.md

A/B testing and performance metrics for prompts. Use when: comparing two prompt variants, defining quality/efficiency/robustness metrics, or deciding whether to adopt a challenger prompt over a baseline.

19 stars

testing

Updated Jul 16, 2026

$ install --global

skillsauth

npx skillsauth add fusengine/agents prompt-testing

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 16, 2026, 2:48 AM138.9s6 files scanned

SKILL.md

name:: prompt-testing
description:: A/B testing and performance metrics for prompts. Use when: comparing two prompt variants, defining quality/efficiency/robustness metrics, or deciding whether to adopt a challenger prompt over a baseline.
allowed-tools:: Read, Write, Bash

Prompt Testing

Skill for testing, comparing, and measuring prompt performance.

References

metrics.md - Load when: defining or scoring Quality/Efficiency/Robustness/UX metrics with thresholds and calculation formulas
methodology.md - Load when: running a full A/B test (hypothesis, dataset sizing, statistical significance, common pitfalls)
templates.md - Load when: writing a test dataset JSON or an A/B test report

Testing Workflow

1. DEFINE
   └── Test objective
   └── Metrics to measure
   └── Success criteria

2. PREPARE
   └── Variants A and B
   └── Test dataset
   └── Baseline (if existing)

3. EXECUTE
   └── Run on dataset
   └── Collect results
   └── Document observations

4. ANALYZE
   └── Calculate metrics
   └── Compare variants
   └── Identify patterns

5. DECIDE
   └── Recommendation
   └── Statistical confidence
   └── Next iterations

Performance Metrics

Quality

Efficiency

Robustness

For full definitions, thresholds, and the UX metrics category, see metrics.md. For the test dataset and report formats, see templates.md.

Commands

# Create a test
/prompt test create --name "Test v1" --dataset tests.json

# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

# View results
/prompt test results --id test_001

# Compare two tests
/prompt test compare --tests test_001,test_002

Decision Criteria

When to adopt variant B?

IF:
  - Accuracy B >= Accuracy A
  AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
  AND no regression on edge cases
THEN:
  → Adopt B

ELSE IF:
  - Accuracy improvement > 10%
  AND token regression < 20%
THEN:
  → Consider B (acceptable trade-off)

ELSE:
  → Keep A or iterate

Best Practices

Minimum 20 test cases for significance
Include edge cases (15-20% of dataset)
Test multiple runs for consistency
Document hypotheses before testing
Version the prompts being tested

Related Skills

fusengine/ux-copy

testing

VerifiedTrustedCommunity

Copy self-audit and ban-lists — filler verbs/hype adjectives, slop placeholder names, fake-precise numbers, Title Case headlines, humor in error copy ('Oops!'), em-dash crutch, one copy register per page.

20SKILL.mdUpdated Jul 15, 2026

fusengine/design-webapp

development

VerifiedTrustedCommunity

Logged-in web apps — dashboards, auth flows, settings, onboarding, data tables, command palettes, modals, toasts. Register `product`: density and glance-speed over marketing polish, no hero/CTA-tricks, every data surface covers empty/loading/error explicitly, tables and dataviz follow preattentive-processing rules.

20SKILL.mdUpdated Jul 15, 2026

fusengine/design-webapp

fusengine/design-web

development

VerifiedTrustedCommunity

Marketing sites, landing pages, campaign pages — register `brand` (design IS the product). Structure comes from the register's POV + a macrostructure pick, never from copying an inspiration site's section flow. Hero discipline, deviated section order, asymmetric grids, and a silhouette lookalike-test gate before ship.

20SKILL.mdUpdated Jul 15, 2026

fusengine/design-system

development

VerifiedTrustedCommunity

Token-strategy core — OKLCH color rules, neutral tinting, accent-commitment levels, type scale, 8pt spacing grid, touch targets, and the canonical output format of design-system.md (the file the harness gates on). This is routing step 1 of design-method/SKILL.md — read it before design-web/design-webapp/design-ios/design-android, before picking or auditing a single color/type/spacing value.

20SKILL.mdUpdated Jul 13, 2026

fusengine/design-system

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/fusengine/agents.git

# Copy into Claude Code skills folder (global)
cp -r agents/plugins/prompt-engineer/skills/prompt-testing ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

fusengine/agents

19 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT