Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

pkuppens/skill-benchmark

Name: skill-benchmark
Author: pkuppens

skills/validation/skill-benchmark/SKILL.md

npx skillsauth add pkuppens/pkuppens skill-benchmark

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill Benchmark

Measures the real-world impact of a skill by comparing agent output with and without skill activation.

When to use:

Validating that a new or revised skill improves output quality
Creating evidence for a skill-related pull request or issue
Regression-testing a skill after edits
Demonstrating AI skill value to stakeholders

Instructions:

Select a benchmark task:
- Choose a concrete, reproducible task relevant to the skill under test
- Example tasks:
  - Create a GitHub issue for adding BM25 retrieval to on_prem_rag
  - Write an ADR for choosing FastAPI over Flask
  - Produce a deployment checklist for on_prem_rag 0.9.0
- Record the task prompt in tmp/skills/benchmark/<skill-name>/task.md
Run baseline (without skill):
- Invoke the agent with the task prompt only — no skill context
- Save the full output to tmp/skills/benchmark/<skill-name>/output-baseline.md
- Do NOT provide any skill text, examples, or hints
Run with skill activated:
- Invoke the agent with the task prompt + skill name/trigger
- Save the full output to tmp/skills/benchmark/<skill-name>/output-with-skill.md
Score both outputs on 4 dimensions (1–5 scale each):

| Dimension | Description | |-----------|-------------| | Coverage | Did the output address all required sub-tasks? | | Specificity | Are commands, formats, and values concrete (not vague)? | | Correctness | Is the output factually and technically correct? | | Completeness | Could an engineer act on this output without further clarification? |
Produce comparison table:

| Dimension | Without Skill | With Skill | Delta | |-----------|--------------|------------|-------| | Coverage | /5 | /5 | +N | | Specificity | /5 | /5 | +N | | Correctness | /5 | /5 | +N | | Completeness | /5 | /5 | +N | | Total | /20 | /20 | +N |
Write benchmark report in tmp/skills/benchmark/<skill-name>/report.md:
- Task prompt used
- Scoring rationale for each dimension
- Key qualitative differences (specific examples from outputs)
- Verdict: skill improves output (delta ≥ +4) / marginal / no measurable improvement

Output format:

## Skill Benchmark — <skill-name>
- Task: <one-line description>
- Without skill total: <N>/20
- With skill total: <N>/20
- Delta: +<N>
- Verdict: significant improvement | marginal | no improvement
- Key observation: <one sentence on the most important qualitative difference>

Benchmark task library: Reusable prompts live under docs/skills/benchmark/tasks/ in this repo (versioned). Local copies may also sit in tmp/skills/benchmark/tasks/ during runs.

issue-creation.md — create a GitHub issue for a feature
adr-authoring.md — write an ADR for a technology choice
deployment-checklist.md — produce a deployment checklist for a release

Committed reports (examples / evidence): docs/skills/benchmark/<skill-name>/report.md. Session transcripts can remain in tmp/skills/benchmark/<skill-name>/.

Anti-patterns:

Using the same session for both runs (context contamination)
Choosing tasks where the skill is explicitly referenced in the prompt
Scoring on style rather than actionability

Integration: Output feeds architecture-risks-debt if skills have known gaps. Store runnable evidence under docs/skills/benchmark/ (or tmp/skills/benchmark/ for scratch). See COOPERATION.md.

pkuppens/skill-benchmark

skills/validation/skill-benchmark/SKILL.md

Evaluates skill effectiveness by comparing agent output on a task with and without skills activated. Use when validating that a skill improves output quality, when creating evidence for a skill PR, or when regression-testing skill changes.

testing

Updated May 15, 2026

$ install --global

skillsauth

npx skillsauth add pkuppens/pkuppens skill-benchmark

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 15, 2026, 5:43 AM54.4s1 file scanned

SKILL.md

name:: skill-benchmark
description:: Evaluates skill effectiveness by comparing agent output on a task with and without skills activated. Use when validating that a skill improves output quality, when creating evidence for a skill PR, or when regression-testing skill changes.

Skill Benchmark

Measures the real-world impact of a skill by comparing agent output with and without skill activation.

When to use:

Validating that a new or revised skill improves output quality
Creating evidence for a skill-related pull request or issue
Regression-testing a skill after edits
Demonstrating AI skill value to stakeholders

Instructions:

Select a benchmark task:
- Choose a concrete, reproducible task relevant to the skill under test
- Example tasks:
  - Create a GitHub issue for adding BM25 retrieval to on_prem_rag
  - Write an ADR for choosing FastAPI over Flask
  - Produce a deployment checklist for on_prem_rag 0.9.0
- Record the task prompt in tmp/skills/benchmark/<skill-name>/task.md
Run baseline (without skill):
- Invoke the agent with the task prompt only — no skill context
- Save the full output to tmp/skills/benchmark/<skill-name>/output-baseline.md
- Do NOT provide any skill text, examples, or hints
Run with skill activated:
- Invoke the agent with the task prompt + skill name/trigger
- Save the full output to tmp/skills/benchmark/<skill-name>/output-with-skill.md
Score both outputs on 4 dimensions (1–5 scale each):

| Dimension | Description | |-----------|-------------| | Coverage | Did the output address all required sub-tasks? | | Specificity | Are commands, formats, and values concrete (not vague)? | | Correctness | Is the output factually and technically correct? | | Completeness | Could an engineer act on this output without further clarification? |
Produce comparison table:

| Dimension | Without Skill | With Skill | Delta | |-----------|--------------|------------|-------| | Coverage | /5 | /5 | +N | | Specificity | /5 | /5 | +N | | Correctness | /5 | /5 | +N | | Completeness | /5 | /5 | +N | | Total | /20 | /20 | +N |
Write benchmark report in tmp/skills/benchmark/<skill-name>/report.md:
- Task prompt used
- Scoring rationale for each dimension
- Key qualitative differences (specific examples from outputs)
- Verdict: skill improves output (delta ≥ +4) / marginal / no measurable improvement

Output format:

## Skill Benchmark — <skill-name>
- Task: <one-line description>
- Without skill total: <N>/20
- With skill total: <N>/20
- Delta: +<N>
- Verdict: significant improvement | marginal | no improvement
- Key observation: <one sentence on the most important qualitative difference>

Benchmark task library: Reusable prompts live under docs/skills/benchmark/tasks/ in this repo (versioned). Local copies may also sit in tmp/skills/benchmark/tasks/ during runs.

issue-creation.md — create a GitHub issue for a feature
adr-authoring.md — write an ADR for a technology choice
deployment-checklist.md — produce a deployment checklist for a release

Committed reports (examples / evidence): docs/skills/benchmark/<skill-name>/report.md. Session transcripts can remain in tmp/skills/benchmark/<skill-name>/.

Anti-patterns:

Using the same session for both runs (context contamination)
Choosing tasks where the skill is explicitly referenced in the prompt
Scoring on style rather than actionability

Integration: Output feeds architecture-risks-debt if skills have known gaps. Store runnable evidence under docs/skills/benchmark/ (or tmp/skills/benchmark/ for scratch). See COOPERATION.md.

Related Skills

pkuppens/sync-branch

testing

VerifiedTrustedCommunity

Syncs remote default branch locally (checkout, fetch --prune, pull) and returns to the previous branch when it still exists. Reports stashes and worktrees not yet handled. Use when the user asks to sync main, update default branch, fetch/pull origin, or run /sync-branch.

SKILL.mdUpdated Jun 6, 2026

pkuppens/azure-devops-work-items

tools

VerifiedTrustedCommunity

Creates, queries, updates, and links Azure Boards work items via az boards CLI. Use when filing ADO work items, running WIQL queries, or setting area path, iteration, tags, and assignee.

SKILL.mdUpdated May 29, 2026

pkuppens/azure-devops-work-items

pkuppens/azure-devops-repos

tools

VerifiedTrustedCommunity

Creates, reviews, and completes Azure Repos pull requests and branch policies via az repos CLI. Use when opening ADO PRs, setting required reviewers, or configuring build validation policies.

SKILL.mdUpdated May 29, 2026

pkuppens/azure-devops-repos

pkuppens/azure-devops-pipelines

development

VerifiedTrustedCommunity

Guides Azure Pipelines YAML structure, build validation on PRs, and staged deployment with environments and approvals. Use when authoring azure-pipelines.yml or configuring CI/CD on Azure DevOps.

SKILL.mdUpdated May 29, 2026

pkuppens/azure-devops-pipelines

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/pkuppens/pkuppens.git

# Copy into Claude Code skills folder (global)
cp -r pkuppens/skills/validation/skill-benchmark ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

pkuppens/pkuppens

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT