Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aiskillstore/judge

Name: judge
Author: aiskillstore

skills/2389-research/test-kitchen/judge/SKILL.md

npx skillsauth add aiskillstore/marketplace judge

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Test Kitchen Judge

Score implementations using the 5-criteria framework. Fill out ALL sections exactly as shown.

Terminology: This skill uses "impl" but works for both:

Cookoff: impl-1, impl-2, impl-3 (same design, different implementations)
Omakase: variant-a, variant-b (different approaches/designs)

REQUIRED OUTPUT FORMAT

You MUST produce this exact structure. Do not summarize or abbreviate.

## Gate Check
| Impl | Tests Pass | Design Adherence |
|------|------------|------------------|
| impl-1 | X/X ✓ or ✗ | Yes/No |
| impl-2 | X/X ✓ or ✗ | Yes/No |

## Feasibility Check
| Impl | Status | Notes |
|------|--------|-------|
| impl-1 | ✓ OK / ⚠️ Flag | Details |
| impl-2 | ✓ OK / ⚠️ Flag | Details |

## Scoring Worksheet

### impl-1
**Fitness for Purpose** (Does it solve the actual problem?)

*Functional requirements:*
- [ ] Primary use case works end-to-end?
- [ ] All explicitly stated requirements implemented?
- [ ] Handles realistic scenarios, not just happy path?

*User needs (beyond literal requirements):*
- [ ] Would the user actually use this, or just demo it?
- [ ] Does it solve the real problem, not just the literal request?
- [ ] Does deployment/distribution match stated needs?

*Future considerations (if relevant):*
- [ ] If growth/scaling mentioned, does architecture support it?
- [ ] If team/collaboration mentioned, is it maintainable by others?

Checklist: _/8 YES → **Score: _/5** (7-8=5, 5-6=4, 4=3, 2-3=2, 0-1=1)
*Note: Not all items apply to every project. Score based on relevant items.*

**Justified Complexity** (Every line earning its keep?)
- Unnecessary abstractions: ___
- Dead code: ___
- Bloat estimate: ___%

*Line count comparison (if multiple impls):*
- This impl: ___ lines
- Smallest impl: ___ lines
- Extra lines justified by: ___

→ **Score: _/5** (5=minimal, 4=slight bloat <10%, 3=10-25% bloat, 2=25-50%, 1=>50%)

**Readability** (Understand core flow in 5 min?)
Violations:
- [ ] Single-letter vars (not loop index): +1 each = __
- [ ] Functions >50 lines: +1 each = __
- [ ] Nesting >3 levels: +1 each = __
- [ ] Magic numbers: +1 each = __
- [ ] Bad function names: +1 each = __
Total violations: __ → **Score: _/5** (0=5, 1-2=4, 3-4=3, 5-7=2, 8+=1)

**Robustness & Scale** (Handles unexpected + growth?)
- [ ] Input validation?
- [ ] External call error handling?
- [ ] Useful error messages?
- [ ] Null/empty handling?
- [ ] Async timeouts?
- [ ] No unbounded loops?
- [ ] O(n log n) or better?
- [ ] Bounded memory?
- [ ] Queries paginated?
- [ ] No blocking I/O in hot path?
- [ ] Backoff/retry logic?
- [ ] Handles 10x load?
Checklist: _/12 YES + feasibility flags → **Score: _/5**
(11-12 + no flags=5, 9-10 or minor flag=4, 7-8=3, 5-6 or major flag=2, <5 or critical flag=1)

**Maintainability** (Pain of next change?)
- [ ] Single responsibility per function?
- [ ] Explicit dependencies (no globals)?
- [ ] Business logic separated from infra?
- [ ] New feature = ≤3 files changed?
- [ ] Config externalized?
- [ ] Tests catch regressions?
Checklist: _/6 YES → **Score: _/5** (6=5, 5=4, 4=3, 2-3=2, 0-1=1)

### impl-2
[REPEAT SAME FORMAT]

### impl-3 (if applicable)
[REPEAT SAME FORMAT]

## Judge Scorecard
| Criterion | impl-1 | impl-2 | impl-3 | Best |
|-----------|--------|--------|--------|------|
| Fitness for Purpose | | | | |
| Justified Complexity | | | | |
| Readability | | | | |
| Robustness & Scale | | | | |
| Maintainability | | | | |
| **TOTAL** | /25 | /25 | /25 | |

## Hard Gates
| Gate | Result |
|------|--------|
| Fitness Gate (Δ ≥ 2) | Triggered/Not triggered |
| Critical Flaw (any = 1) | Triggered/Not triggered |

## Winner Selection
**Winner: impl-X** (Score: __/25)

**Selection rationale:**
[2-3 sentences explaining WHY this implementation won]

**Trade-offs acknowledged:**
[What the other implementations did better]

Scoring Reference

Scores Meaning

| Score | Meaning | |-------|---------| | 5 | Excellent - exceeds expectations | | 4 | Good - fully meets requirements | | 3 | Adequate - core works, some gaps | | 2 | Poor - significant issues | | 1 | Critical flaw - disqualifying |

Hard Gates (Automatic)

Fitness Gate: If Fitness Δ ≥ 2 between impls → Higher fitness WINS immediately
Critical Flaw: If ANY criterion = 1 → That impl is ELIMINATED

Fitness Gate Interpretation

The Fitness Gate triggers the same way in both contexts, but means different things:

| Context | What Fitness Δ ≥ 2 Means | |---------|--------------------------| | Cookoff | One implementation deviated from or misunderstood the design. All impls should have similar Fitness since they're implementing the same spec. A large gap is a red flag. | | Omakase | One approach genuinely solves the problem better. Different approaches can legitimately have different Fitness. A large gap means one approach is clearly superior. |

In both cases, higher Fitness wins. The interpretation just explains why the gap exists.

Feasibility Red Flags

Check before scoring:

O(n²) or worse on unbounded data
Unbounded memory growth
Self-DDoS patterns (polling, no backoff)
Missing pagination
Blocking I/O in hot path
No error recovery

Process

Read all implementation code (should already be in context)
Fill out the worksheet for EACH implementation - do not skip sections
Check hard gates
Announce winner with rationale

CRITICAL: Use integer scores only (1-5). Do not use half points like 4.5.

CRITICAL: Fill out every checkbox. Do not summarize or abbreviate the worksheet.

aiskillstore/judge

skills/2389-research/test-kitchen/judge/SKILL.md

Scoring framework for test-kitchen cookoff and omakase-off. Invoked at Phase 4 to evaluate implementations using 5-criteria scoring. Do not invoke directly - called by cookoff/omakase-off.

230 stars

development

Updated Mar 27, 2026

$ install --global

skillsauth

npx skillsauth add aiskillstore/marketplace judge

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 28, 2026, 12:20 PM36.2s1 file scanned

SKILL.md

name:: judge
description:: Scoring framework for test-kitchen cookoff and omakase-off. Invoked at Phase 4 to evaluate implementations using 5-criteria scoring. Do not invoke directly - called by cookoff/omakase-off.

Test Kitchen Judge

Score implementations using the 5-criteria framework. Fill out ALL sections exactly as shown.

Terminology: This skill uses "impl" but works for both:

Cookoff: impl-1, impl-2, impl-3 (same design, different implementations)
Omakase: variant-a, variant-b (different approaches/designs)

REQUIRED OUTPUT FORMAT

You MUST produce this exact structure. Do not summarize or abbreviate.

## Gate Check
| Impl | Tests Pass | Design Adherence |
|------|------------|------------------|
| impl-1 | X/X ✓ or ✗ | Yes/No |
| impl-2 | X/X ✓ or ✗ | Yes/No |

## Feasibility Check
| Impl | Status | Notes |
|------|--------|-------|
| impl-1 | ✓ OK / ⚠️ Flag | Details |
| impl-2 | ✓ OK / ⚠️ Flag | Details |

## Scoring Worksheet

### impl-1
**Fitness for Purpose** (Does it solve the actual problem?)

*Functional requirements:*
- [ ] Primary use case works end-to-end?
- [ ] All explicitly stated requirements implemented?
- [ ] Handles realistic scenarios, not just happy path?

*User needs (beyond literal requirements):*
- [ ] Would the user actually use this, or just demo it?
- [ ] Does it solve the real problem, not just the literal request?
- [ ] Does deployment/distribution match stated needs?

*Future considerations (if relevant):*
- [ ] If growth/scaling mentioned, does architecture support it?
- [ ] If team/collaboration mentioned, is it maintainable by others?

Checklist: _/8 YES → **Score: _/5** (7-8=5, 5-6=4, 4=3, 2-3=2, 0-1=1)
*Note: Not all items apply to every project. Score based on relevant items.*

**Justified Complexity** (Every line earning its keep?)
- Unnecessary abstractions: ___
- Dead code: ___
- Bloat estimate: ___%

*Line count comparison (if multiple impls):*
- This impl: ___ lines
- Smallest impl: ___ lines
- Extra lines justified by: ___

→ **Score: _/5** (5=minimal, 4=slight bloat <10%, 3=10-25% bloat, 2=25-50%, 1=>50%)

**Readability** (Understand core flow in 5 min?)
Violations:
- [ ] Single-letter vars (not loop index): +1 each = __
- [ ] Functions >50 lines: +1 each = __
- [ ] Nesting >3 levels: +1 each = __
- [ ] Magic numbers: +1 each = __
- [ ] Bad function names: +1 each = __
Total violations: __ → **Score: _/5** (0=5, 1-2=4, 3-4=3, 5-7=2, 8+=1)

**Robustness & Scale** (Handles unexpected + growth?)
- [ ] Input validation?
- [ ] External call error handling?
- [ ] Useful error messages?
- [ ] Null/empty handling?
- [ ] Async timeouts?
- [ ] No unbounded loops?
- [ ] O(n log n) or better?
- [ ] Bounded memory?
- [ ] Queries paginated?
- [ ] No blocking I/O in hot path?
- [ ] Backoff/retry logic?
- [ ] Handles 10x load?
Checklist: _/12 YES + feasibility flags → **Score: _/5**
(11-12 + no flags=5, 9-10 or minor flag=4, 7-8=3, 5-6 or major flag=2, <5 or critical flag=1)

**Maintainability** (Pain of next change?)
- [ ] Single responsibility per function?
- [ ] Explicit dependencies (no globals)?
- [ ] Business logic separated from infra?
- [ ] New feature = ≤3 files changed?
- [ ] Config externalized?
- [ ] Tests catch regressions?
Checklist: _/6 YES → **Score: _/5** (6=5, 5=4, 4=3, 2-3=2, 0-1=1)

### impl-2
[REPEAT SAME FORMAT]

### impl-3 (if applicable)
[REPEAT SAME FORMAT]

## Judge Scorecard
| Criterion | impl-1 | impl-2 | impl-3 | Best |
|-----------|--------|--------|--------|------|
| Fitness for Purpose | | | | |
| Justified Complexity | | | | |
| Readability | | | | |
| Robustness & Scale | | | | |
| Maintainability | | | | |
| **TOTAL** | /25 | /25 | /25 | |

## Hard Gates
| Gate | Result |
|------|--------|
| Fitness Gate (Δ ≥ 2) | Triggered/Not triggered |
| Critical Flaw (any = 1) | Triggered/Not triggered |

## Winner Selection
**Winner: impl-X** (Score: __/25)

**Selection rationale:**
[2-3 sentences explaining WHY this implementation won]

**Trade-offs acknowledged:**
[What the other implementations did better]

Scoring Reference

Scores Meaning

Hard Gates (Automatic)

Fitness Gate: If Fitness Δ ≥ 2 between impls → Higher fitness WINS immediately
Critical Flaw: If ANY criterion = 1 → That impl is ELIMINATED

Fitness Gate Interpretation

The Fitness Gate triggers the same way in both contexts, but means different things:

In both cases, higher Fitness wins. The interpretation just explains why the gap exists.

Feasibility Red Flags

Check before scoring:

O(n²) or worse on unbounded data
Unbounded memory growth
Self-DDoS patterns (polling, no backoff)
Missing pagination
Blocking I/O in hot path
No error recovery

Process

Read all implementation code (should already be in context)
Fill out the worksheet for EACH implementation - do not skip sections
Check hard gates
Announce winner with rationale

CRITICAL: Use integer scores only (1-5). Do not use half points like 4.5.

CRITICAL: Fill out every checkbox. Do not summarize or abbreviate the worksheet.

Related Skills

aiskillstore/hig-components-content

development

VerifiedTrustedCommunity

Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/hig-components-content

aiskillstore/helpdesk-automation

tools

VerifiedTrustedCommunity

Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/helpdesk-automation

aiskillstore/haskell-pro

testing

VerifiedTrustedCommunity

Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/haskell-pro

aiskillstore/graphql

tools

VerifiedTrustedCommunity

GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.

244SKILL.mdUpdated Apr 10, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aiskillstore/marketplace.git

# Copy into Claude Code skills folder (global)
cp -r marketplace/skills/2389-research/test-kitchen/judge ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aiskillstore/marketplace

230 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT