Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ronniegeraghty/grader-system

Name: grader-system
Author: ronniegeraghty

.agents/skills/grader-system/SKILL.md

npx skillsauth add ronniegeraghty/hyoka grader-system

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Context

Hyoka's grading system is pluggable and multi-layered. Six independent grader types inspect generated code and action timelines from different angles, then consolidate into a holistic assessment. Graders are advisory — they report findings, they don't gate evaluation completion.

Grader Architecture

All graders implement:

type Grader interface {
    Kind() string
    Name() string
    Grade(ctx context.Context, input GraderInput) (GraderResult, error)
}

GraderInput

type GraderInput struct {
    Code           string          // Generated code
    Language       string          // e.g., "python"
    ActionLog      []ActionEvent   // Timeline of agent actions
    BuildStatus    string          // "success", "failed", "skipped"
    BuildOutput    string          // Compiler/interpreter output
}

GraderResult

type GraderResult struct {
    Kind    string                  // e.g., "behavior", "lint"
    Name    string                  // Grader instance name
    Pass    bool                    // Critical gate (true = safe to deploy)
    Score   float64                 // 0.0-1.0 numeric score
    Message string                  // Human-readable summary
    Details interface{}             // Type-specific details
}

Six Grader Types

1. Behavior Grader

Inspects action timeline for required/forbidden tool usage and turn limits.

graders:
  - kind: behavior
    name: tool_compliance
    required_tools: [file_write, read_file]
    forbidden_tools: [rm, sudo]
    max_turns: 25

Details: BehaviorGraderDetails with ToolsUsed, MaxTurns, Violations

2. Lint Grader

Runs language-specific linters on generated code.

graders:
  - kind: lint
    name: python_lint
    linters: [pylint, black, mypy]
    threshold: 0.8  # Must pass 80% of linters

Details: LintGraderDetails with per-linter pass/fail, warnings

3. Build Grader

Verifies code builds (or interprets) without errors.

graders:
  - kind: build
    name: cargo_build

Details: BuildGraderDetails with exit code, stderr excerpt

4. File Grader

Checks generated file structure (count, naming, organization).

graders:
  - kind: file
    name: file_structure
    min_files: 2
    max_files: 50
    required_files: [main.py, tests.py]

Details: FileGraderDetails with file list, violations

5. Program Grader

Runs generated code and checks output against expected results.

graders:
  - kind: program
    name: integration_test
    test_command: python tests.py
    expected_output: "All tests passed"

Details: ProgramGraderDetails with actual vs. expected output

6. Prompt Grader

Uses an LLM to score code against semantic criteria (a.k.a. "LLM-as-judge").

graders:
  - kind: prompt
    name: semantic_correctness
    rubric: "Does the code correctly implement the requested feature?"
    model: claude-opus-4.6

Details: PromptGraderDetails with rubric reasoning, score breakdown

Gate Semantics

Soft gates (reporting):

Graders run independently in parallel
Timeout on one grader doesn't block others
Failure on one grader doesn't prevent report generation

Hard gates (evaluation completion):

If generation or review phases hard-fail (e.g., timeout, SDK crash), eval stops
Grader failures do NOT stop evaluation (graders are advisory)

Pluggable Registry

Graders are registered via factory functions:

type GraderFactory func(name string, cfg map[string]any) (Grader, error)

var registry = map[string]GraderFactory{
    "behavior": NewBehaviorGrader,
    "lint":     NewLintGrader,
    "build":    NewBuildGrader,
    "file":     NewFileGrader,
    "program":  NewProgramGrader,
    "prompt":   NewPromptGrader,
}

// New grader types can be added by updating registry

Configuration

Graders are defined in config YAML:

graders:
  - kind: behavior
    name: required_tools
    required_tools: [file_write, bash]
  
  - kind: lint
    name: python_style
    linters: [pylint]
    threshold: 0.9

  - kind: prompt
    name: correctness
    model: gpt-5.4

Error Handling

Each grader catches its own errors:

func (g *LintGrader) Grade(ctx context.Context, input GraderInput) (GraderResult, error) {
    // Run linter
    cmd := exec.CommandContext(ctx, "pylint", ...)
    
    // Timeout?
    if ctx.Err() != nil {
        return GraderResult{
            Pass: false,
            Message: "Linter timeout",
        }, nil  // Return error object, not error value
    }
}

Grader errors are not fatal — they're reported in the grader result.

Code Locations

Grader interface and registry: hyoka/internal/graders/grader.go
Individual grader implementations: hyoka/internal/graders/{kind}_grader.go
Example grader tests: hyoka/internal/graders/*_test.go

Anti-Patterns

Using grader failures as eval blockers (they're advisory only)
Hardcoding grader lists in engine (add via config)
Ignoring grader timeout errors (report them)
Assuming all graders finish synchronously (they may timeout)

ronniegeraghty/grader-system

.agents/skills/grader-system/SKILL.md

Pluggable grader architecture (6 types, gate semantics)

1 stars

tools

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ronniegeraghty/hyoka grader-system

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 5:47 PM5.8s1 file scanned

SKILL.md

name:: grader-system
description:: Pluggable grader architecture (6 types, gate semantics)
domain:: architecture
confidence:: high
source:: hyoka/internal/graders/grader.go, hyoka/internal/graders/registry.go

Context

Grader Architecture

All graders implement:

type Grader interface {
    Kind() string
    Name() string
    Grade(ctx context.Context, input GraderInput) (GraderResult, error)
}

GraderInput

type GraderInput struct {
    Code           string          // Generated code
    Language       string          // e.g., "python"
    ActionLog      []ActionEvent   // Timeline of agent actions
    BuildStatus    string          // "success", "failed", "skipped"
    BuildOutput    string          // Compiler/interpreter output
}

GraderResult

type GraderResult struct {
    Kind    string                  // e.g., "behavior", "lint"
    Name    string                  // Grader instance name
    Pass    bool                    // Critical gate (true = safe to deploy)
    Score   float64                 // 0.0-1.0 numeric score
    Message string                  // Human-readable summary
    Details interface{}             // Type-specific details
}

Six Grader Types

1. Behavior Grader

Inspects action timeline for required/forbidden tool usage and turn limits.

graders:
  - kind: behavior
    name: tool_compliance
    required_tools: [file_write, read_file]
    forbidden_tools: [rm, sudo]
    max_turns: 25

Details: BehaviorGraderDetails with ToolsUsed, MaxTurns, Violations

2. Lint Grader

Runs language-specific linters on generated code.

graders:
  - kind: lint
    name: python_lint
    linters: [pylint, black, mypy]
    threshold: 0.8  # Must pass 80% of linters

Details: LintGraderDetails with per-linter pass/fail, warnings

3. Build Grader

Verifies code builds (or interprets) without errors.

graders:
  - kind: build
    name: cargo_build

Details: BuildGraderDetails with exit code, stderr excerpt

4. File Grader

Checks generated file structure (count, naming, organization).

graders:
  - kind: file
    name: file_structure
    min_files: 2
    max_files: 50
    required_files: [main.py, tests.py]

Details: FileGraderDetails with file list, violations

5. Program Grader

Runs generated code and checks output against expected results.

graders:
  - kind: program
    name: integration_test
    test_command: python tests.py
    expected_output: "All tests passed"

Details: ProgramGraderDetails with actual vs. expected output

6. Prompt Grader

Uses an LLM to score code against semantic criteria (a.k.a. "LLM-as-judge").

graders:
  - kind: prompt
    name: semantic_correctness
    rubric: "Does the code correctly implement the requested feature?"
    model: claude-opus-4.6

Details: PromptGraderDetails with rubric reasoning, score breakdown

Gate Semantics

Soft gates (reporting):

Graders run independently in parallel
Timeout on one grader doesn't block others
Failure on one grader doesn't prevent report generation

Hard gates (evaluation completion):

If generation or review phases hard-fail (e.g., timeout, SDK crash), eval stops
Grader failures do NOT stop evaluation (graders are advisory)

Pluggable Registry

Graders are registered via factory functions:

type GraderFactory func(name string, cfg map[string]any) (Grader, error)

var registry = map[string]GraderFactory{
    "behavior": NewBehaviorGrader,
    "lint":     NewLintGrader,
    "build":    NewBuildGrader,
    "file":     NewFileGrader,
    "program":  NewProgramGrader,
    "prompt":   NewPromptGrader,
}

// New grader types can be added by updating registry

Configuration

Graders are defined in config YAML:

graders:
  - kind: behavior
    name: required_tools
    required_tools: [file_write, bash]
  
  - kind: lint
    name: python_style
    linters: [pylint]
    threshold: 0.9

  - kind: prompt
    name: correctness
    model: gpt-5.4

Error Handling

Each grader catches its own errors:

func (g *LintGrader) Grade(ctx context.Context, input GraderInput) (GraderResult, error) {
    // Run linter
    cmd := exec.CommandContext(ctx, "pylint", ...)
    
    // Timeout?
    if ctx.Err() != nil {
        return GraderResult{
            Pass: false,
            Message: "Linter timeout",
        }, nil  // Return error object, not error value
    }
}

Grader errors are not fatal — they're reported in the grader result.

Code Locations

Grader interface and registry: hyoka/internal/graders/grader.go
Individual grader implementations: hyoka/internal/graders/{kind}_grader.go
Example grader tests: hyoka/internal/graders/*_test.go

Anti-Patterns

Using grader failures as eval blockers (they're advisory only)
Hardcoding grader lists in engine (add via config)
Ignoring grader timeout errors (report them)
Assuming all graders finish synchronously (they may timeout)

Related Skills

ronniegeraghty/sdk-version-check

development

VerifiedTrustedCommunity

Identifies Azure SDK packages in generated code and checks whether they are the latest available versions. Use during code review to catch outdated dependencies.

1SKILL.mdUpdated Apr 16, 2026

ronniegeraghty/sdk-version-check

ronniegeraghty/reviewer-build

development

VerifiedTrustedCommunity

Sets up build environments for generated Azure SDK code samples and attempts to compile/build without modifying generated files. Use during review to verify code compiles correctly.

1SKILL.mdUpdated Apr 16, 2026

ronniegeraghty/reviewer-build

ronniegeraghty/skills/reviewer/java-sdk-validation

development

VerifiedTrustedCommunity

# Java SDK Validation Skill You are a **Java Azure SDK validation reviewer** for generated code samples. Your job is to check whether generated Java code follows modern Azure SDK for Java conventions and flag violations of common anti-patterns that LLMs frequently produce. ## Rules 1. **NEVER modify generated code.** You are evaluating, not fixing. 2. Report all findings honestly — pass or fail with specific evidence. 3. Check every rule below. A single violation in a category means that cate

1SKILL.mdUpdated Apr 16, 2026

ronniegeraghty/skills/reviewer/java-sdk-validation

ronniegeraghty/code-review-comments

development

VerifiedTrustedCommunity

Reads generated Azure SDK code files and adds inline review comments without changing any actual code. Use during code review to annotate quality issues, best practices, and suggestions.

1SKILL.mdUpdated Apr 16, 2026

ronniegeraghty/code-review-comments

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ronniegeraghty/hyoka.git

# Copy into Claude Code skills folder (global)
cp -r hyoka/.agents/skills/grader-system ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ronniegeraghty/hyoka

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT