Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jamesgray-ai/test

Name: test
Author: jamesgray-ai

.claude/skills/test/SKILL.md

npx skillsauth add jamesgray-ai/handsonai test

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Test Workflow

Structured testing and evaluation of AI workflow artifacts. Walk the user through running their workflow against real scenarios, scoring output quality, diagnosing issues back to specific building blocks, and deciding whether the workflow is ready for deployment.

Workflow

1. Load context

Read the Building Block Spec (including Evaluation Criteria) to understand what was built, expected behavior, and how to evaluate. Identify the test scenarios and scoring dimensions defined during Design.

2. Quick smoke test

One representative input, manual check: does the workflow run end-to-end and produce something reasonable? This is a sanity check before systematic evaluation — catch showstoppers early.

3. Run eval suite

Execute each test scenario from the Evaluation Criteria (defined in Design). For each scenario:

Run the workflow with the scenario's input
Score output on each eval dimension (1–5 scale)
Note specific issues with concrete examples

Guide the user through scoring with plain-language prompts:

"On accuracy, was this a 1 (not close), 3 (mostly right), or 5 (nailed it)?"
"On completeness, did it cover everything you'd expect? 1 means major gaps, 5 means nothing missing."
"On tone/style, does this sound like it came from you? 1 means completely off, 5 means indistinguishable."

Adapt the dimension names and prompts to match whatever eval dimensions were defined in the spec.

4. Building block evals

Test individual skills and prompts in isolation — not just end-to-end. For each skill or prompt in the workflow:

Run it with a known input
Check: did this specific building block produce the right output?
Isolating components helps pinpoint where problems originate vs. where they cascade

5. Establish baseline

Record the eval scores as the reference point for future regression testing in Step 7 (Improve). This baseline captures:

Scores per scenario per dimension
Overall averages
Known limitations and accepted tradeoffs

6. Diagnose issues

For each problem identified in the eval, map it to which building block to adjust:

| Symptom | Building Block to Adjust | |---------|--------------------------| | Generic output | Add more Context (examples, style guides, reference materials) | | Steps skipped or misunderstood | Refine the Prompt (more explicit instructions) | | Missing expertise | Build a Skill for that step (codify domain knowledge) | | Unpredictable decisions | Convert to Agent (let AI plan its approach) |

7. Readiness decision

Based on eval scores across all scenarios:

Ready — scores meet the minimum quality bar defined in the spec → proceed to Step 6: Run
Not ready — document specific adjustments needed, return to Step 4: Build, then re-test

Output

Write results to outputs/[workflow-name]-test-results.md.

Include an eval scorecard with this format:

Scenarios tested — list each scenario with its input description
Scores per dimension — table of scenario × dimension scores (1–5)
Issues identified — specific problems with concrete examples and diagnosed building block
Baseline established — summary scores to use as regression reference in Step 7
Overall readiness assessment — Ready or Not Ready, with rationale

Guidelines

2–4 testing iterations is normal before reaching readiness. Don't treat the first round of issues as failure — it's expected.
Use plain-language scoring guidance. Never say "write an eval" — instead say "rate your output across real scenarios."
Keep the user focused on concrete examples, not abstract quality judgments. "Show me the sentence that's wrong" beats "was it good?"
If no Evaluation Criteria were defined in the spec, help the user create them now — but note this as a gap to fix in the Design step for future workflows.

jamesgray-ai/test

.claude/skills/test/SKILL.md

Guide structured testing of AI workflow artifacts, evaluate output quality, identify which building blocks need adjustment, and determine readiness for deployment. Use when the user has built workflow artifacts and needs to test them. This is Step 5 (Test) of the Business-First AI Framework.

3 stars

development

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add jamesgray-ai/handsonai test

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 2:05 PM25.8s1 file scanned

SKILL.md

name:: test
description:: >
user-invocable:: true

Test Workflow

Workflow

1. Load context

2. Quick smoke test

One representative input, manual check: does the workflow run end-to-end and produce something reasonable? This is a sanity check before systematic evaluation — catch showstoppers early.

3. Run eval suite

Execute each test scenario from the Evaluation Criteria (defined in Design). For each scenario:

Run the workflow with the scenario's input
Score output on each eval dimension (1–5 scale)
Note specific issues with concrete examples

Guide the user through scoring with plain-language prompts:

"On accuracy, was this a 1 (not close), 3 (mostly right), or 5 (nailed it)?"
"On completeness, did it cover everything you'd expect? 1 means major gaps, 5 means nothing missing."
"On tone/style, does this sound like it came from you? 1 means completely off, 5 means indistinguishable."

Adapt the dimension names and prompts to match whatever eval dimensions were defined in the spec.

4. Building block evals

Test individual skills and prompts in isolation — not just end-to-end. For each skill or prompt in the workflow:

Run it with a known input
Check: did this specific building block produce the right output?
Isolating components helps pinpoint where problems originate vs. where they cascade

5. Establish baseline

Record the eval scores as the reference point for future regression testing in Step 7 (Improve). This baseline captures:

Scores per scenario per dimension
Overall averages
Known limitations and accepted tradeoffs

6. Diagnose issues

For each problem identified in the eval, map it to which building block to adjust:

7. Readiness decision

Based on eval scores across all scenarios:

Ready — scores meet the minimum quality bar defined in the spec → proceed to Step 6: Run
Not ready — document specific adjustments needed, return to Step 4: Build, then re-test

Output

Write results to outputs/[workflow-name]-test-results.md.

Include an eval scorecard with this format:

Scenarios tested — list each scenario with its input description
Scores per dimension — table of scenario × dimension scores (1–5)
Issues identified — specific problems with concrete examples and diagnosed building block
Baseline established — summary scores to use as regression reference in Step 7
Overall readiness assessment — Ready or Not Ready, with rationale

Guidelines

2–4 testing iterations is normal before reaching readiness. Don't treat the first round of issues as failure — it's expected.
Use plain-language scoring guidance. Never say "write an eval" — instead say "rate your output across real scenarios."
Keep the user focused on concrete examples, not abstract quality judgments. "Show me the sentence that's wrong" beats "was it good?"
If no Evaluation Criteria were defined in the spec, help the user create them now — but note this as a gap to fix in the Design step for future workflows.

Related Skills

jamesgray-ai/writing-workflow-sops

documentation

VerifiedTrustedCommunity

Write Standard Operating Procedure documentation for workflows and save as markdown files. Selects full or lightweight SOP template based on autonomy level (deterministic vs. guided/autonomous), then adapts for workflow type (Manual, Augmented, Automated). Use when the user asks to write an SOP, document a workflow, create procedure documentation, or capture how a workflow is executed. Triggers on "write an SOP", "document this workflow", "create operating instructions", "how is this workflow executed".

3SKILL.mdUpdated Apr 15, 2026

jamesgray-ai/writing-workflow-sops

jamesgray-ai/writing-process-guides

documentation

VerifiedTrustedCommunity

Write Business Process Guide documentation that explains when, why, and how to execute a complete business process with its component workflows, and save as markdown files. Use when documenting a business process end-to-end, creating playbooks, or explaining how multiple workflows fit together. Triggers on "write process guide", "document this process", "create a playbook for", "how do these workflows connect".

3SKILL.mdUpdated Apr 15, 2026

jamesgray-ai/writing-process-guides

jamesgray-ai/syncing-skills-to-github

development

VerifiedTrustedCommunity

This skill should be used when the user wants to sync skills to GitHub, push skill changes to a remote repository, or back up local skills. Syncs Claude Agent Skills from ~/.claude/skills/ (local) to GitHub repository using git commands. Commits changes, pushes to remote, and updates Notion AI Building Blocks with GitHub URLs.

3SKILL.mdUpdated Apr 15, 2026

jamesgray-ai/syncing-skills-to-github

jamesgray-ai/registering-building-blocks

development

VerifiedTrustedCommunity

This skill should be used when the user wants to register or update AI building blocks (Skills, Agents, Prompts, Context MDs) in the Notion AI Building Blocks database. Triggers after skill creation, agent creation, prompt authoring, context MD updates, or when the user asks to register, add, or track a building block in Notion.

3SKILL.mdUpdated Apr 15, 2026

jamesgray-ai/registering-building-blocks

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jamesgray-ai/handsonai.git

# Copy into Claude Code skills folder (global)
cp -r handsonai/.claude/skills/test ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jamesgray-ai/handsonai

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT