Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jungho-git/benchmark

Name: benchmark
Author: jungho-git

public/claude/skills/quality/benchmark/SKILL.md

npx skillsauth add jungho-git/jllm benchmark

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Benchmark

Shared examples and formatting reference: references/EXAMPLE.md.
Keep new guidance, snippets, and edits aligned with that file.

Scope

Use this rule when:

comparing two or more agent configurations
measuring token, cost, latency, cache, or tool-use impact
choosing between models for a fixed task
validating prompt or system-instruction changes
checking whether a tool, context, or cache option changes performance

Benchmarking should isolate one variable and report measured evidence.

Fixed Task Contract

Keep the task, input artifacts, allowed tools, and success criteria fixed.
Change one variable per comparison.
Run enough samples to reduce single-run noise.
Record cache state for each run.
Treat different tool-call counts or different task completion quality as a changed task surface.
Do not compare costs across different task definitions.

Without a fixed task, token and latency deltas are not meaningful.

What To Measure

input tokens
output tokens
cache creation tokens
cache read tokens
tool-call count and tool latency
time to first token when streaming
total wall time
success/failure outcome
answer quality or pass/fail score
price table or pricing date used for cost calculation

Report raw usage before derived cost.

Protocol

Define baseline, variant, task, inputs, and success criteria.
Run a cold baseline and cold variant when cache matters.
Run warm passes when prompt caching is part of the benchmark.
Keep environment and network assumptions stable.
Capture usage objects, timestamps, and command or harness version when available.
Calculate per-run and aggregate deltas.
Mark invalid runs when the model used different tools, failed the task, or changed output requirements.

Prefer a small reliable benchmark over broad anecdotal comparison.

Cost Rules

Include token counts and pricing inputs.
Separate cold cost, warm steady-state cost, and amortized break-even when cache is involved.
Do not conflate cache creation with steady-state read cost.
Report TTFT and total time separately for streaming experiences.
State when pricing was unavailable or estimated.

Cost claims must be reproducible from reported numbers.

Output Shape

benchmark name
fixed task and input
compared variable
run table
aggregate token, cost, and latency deltas
validity notes
decision and trade-off
residual uncertainty

Keep conclusions tied to measured data.

Boundaries

Do not use a single run as a final result unless the user explicitly accepts a smoke check.
Do not report only cost without usage.
Do not hide failed or invalid runs.
Do not change model, prompt, tools, and context at the same time.
Do not claim quality parity without a stated quality check.

jungho-git/benchmark

public/claude/skills/quality/benchmark/SKILL.md

Agent benchmark workflow for comparing token usage, latency, cost, cache behavior, model choice, prompt variants, tool availability, and configuration changes with fixed tasks and measured evidence.

1 stars

tools

Updated May 14, 2026

$ install --global

skillsauth

npx skillsauth add jungho-git/jllm benchmark

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 13, 2026, 7:19 AM156.3s3 files scanned

SKILL.md

name:: benchmark
description:: Agent benchmark workflow for comparing token usage, latency, cost, cache behavior, model choice, prompt variants, tool availability, and configuration changes with fixed tasks and measured evidence.

Benchmark

Shared examples and formatting reference: references/EXAMPLE.md.
Keep new guidance, snippets, and edits aligned with that file.

Scope

Use this rule when:

comparing two or more agent configurations
measuring token, cost, latency, cache, or tool-use impact
choosing between models for a fixed task
validating prompt or system-instruction changes
checking whether a tool, context, or cache option changes performance

Benchmarking should isolate one variable and report measured evidence.

Fixed Task Contract

Keep the task, input artifacts, allowed tools, and success criteria fixed.
Change one variable per comparison.
Run enough samples to reduce single-run noise.
Record cache state for each run.
Treat different tool-call counts or different task completion quality as a changed task surface.
Do not compare costs across different task definitions.

Without a fixed task, token and latency deltas are not meaningful.

What To Measure

input tokens
output tokens
cache creation tokens
cache read tokens
tool-call count and tool latency
time to first token when streaming
total wall time
success/failure outcome
answer quality or pass/fail score
price table or pricing date used for cost calculation

Report raw usage before derived cost.

Protocol

Define baseline, variant, task, inputs, and success criteria.
Run a cold baseline and cold variant when cache matters.
Run warm passes when prompt caching is part of the benchmark.
Keep environment and network assumptions stable.
Capture usage objects, timestamps, and command or harness version when available.
Calculate per-run and aggregate deltas.
Mark invalid runs when the model used different tools, failed the task, or changed output requirements.

Prefer a small reliable benchmark over broad anecdotal comparison.

Cost Rules

Include token counts and pricing inputs.
Separate cold cost, warm steady-state cost, and amortized break-even when cache is involved.
Do not conflate cache creation with steady-state read cost.
Report TTFT and total time separately for streaming experiences.
State when pricing was unavailable or estimated.

Cost claims must be reproducible from reported numbers.

Output Shape

benchmark name
fixed task and input
compared variable
run table
aggregate token, cost, and latency deltas
validity notes
decision and trade-off
residual uncertainty

Keep conclusions tied to measured data.

Boundaries

Do not use a single run as a final result unless the user explicitly accepts a smoke check.
Do not report only cost without usage.
Do not hide failed or invalid runs.
Do not change model, prompt, tools, and context at the same time.
Do not claim quality parity without a stated quality check.

Related Skills

jungho-git/workflow

testing

VerifiedTrustedCommunity

Required phase order for non-trivial tasks: Plan, Explore, Implement, Verify, Finalize. Use for multi-step work, scoped exploration, re-planning, validation, and final synthesis.

1SKILL.mdUpdated May 19, 2026

jungho-git/response

development

VerifiedTrustedCommunity

Final response format: Korean-first, concise Process / Checks / Issues / Updates, optional Usage, with only actual changes, actual validation, real blockers, changed files, and measured token data when available.

1SKILL.mdUpdated May 19, 2026

jungho-git/execution

development

VerifiedTrustedCommunity

Smallest complete change rule: preserve local code shape, extend existing patterns, avoid speculative extraction or cleanup, and include required coupled updates for correctness.

1SKILL.mdUpdated May 19, 2026

jungho-git/comments

development

VerifiedTrustedCommunity

Code comment policy: numbered one-line `―` dividers for touched declarations and logical sections, paired outer blocks only for long regions, concise purpose comments, and no comment churn.

1SKILL.mdUpdated May 19, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jungho-git/jllm.git

# Copy into Claude Code skills folder (global)
cp -r jllm/public/claude/skills/quality/benchmark ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jungho-git/jllm

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT