.agents/skills/grader-system/SKILL.md
Pluggable grader architecture (6 types, gate semantics)
npx skillsauth add ronniegeraghty/hyoka grader-systemInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Hyoka's grading system is pluggable and multi-layered. Six independent grader types inspect generated code and action timelines from different angles, then consolidate into a holistic assessment. Graders are advisory — they report findings, they don't gate evaluation completion.
All graders implement:
type Grader interface {
Kind() string
Name() string
Grade(ctx context.Context, input GraderInput) (GraderResult, error)
}
type GraderInput struct {
Code string // Generated code
Language string // e.g., "python"
ActionLog []ActionEvent // Timeline of agent actions
BuildStatus string // "success", "failed", "skipped"
BuildOutput string // Compiler/interpreter output
}
type GraderResult struct {
Kind string // e.g., "behavior", "lint"
Name string // Grader instance name
Pass bool // Critical gate (true = safe to deploy)
Score float64 // 0.0-1.0 numeric score
Message string // Human-readable summary
Details interface{} // Type-specific details
}
Inspects action timeline for required/forbidden tool usage and turn limits.
graders:
- kind: behavior
name: tool_compliance
required_tools: [file_write, read_file]
forbidden_tools: [rm, sudo]
max_turns: 25
Details: BehaviorGraderDetails with ToolsUsed, MaxTurns, Violations
Runs language-specific linters on generated code.
graders:
- kind: lint
name: python_lint
linters: [pylint, black, mypy]
threshold: 0.8 # Must pass 80% of linters
Details: LintGraderDetails with per-linter pass/fail, warnings
Verifies code builds (or interprets) without errors.
graders:
- kind: build
name: cargo_build
Details: BuildGraderDetails with exit code, stderr excerpt
Checks generated file structure (count, naming, organization).
graders:
- kind: file
name: file_structure
min_files: 2
max_files: 50
required_files: [main.py, tests.py]
Details: FileGraderDetails with file list, violations
Runs generated code and checks output against expected results.
graders:
- kind: program
name: integration_test
test_command: python tests.py
expected_output: "All tests passed"
Details: ProgramGraderDetails with actual vs. expected output
Uses an LLM to score code against semantic criteria (a.k.a. "LLM-as-judge").
graders:
- kind: prompt
name: semantic_correctness
rubric: "Does the code correctly implement the requested feature?"
model: claude-opus-4.6
Details: PromptGraderDetails with rubric reasoning, score breakdown
Soft gates (reporting):
Hard gates (evaluation completion):
Graders are registered via factory functions:
type GraderFactory func(name string, cfg map[string]any) (Grader, error)
var registry = map[string]GraderFactory{
"behavior": NewBehaviorGrader,
"lint": NewLintGrader,
"build": NewBuildGrader,
"file": NewFileGrader,
"program": NewProgramGrader,
"prompt": NewPromptGrader,
}
// New grader types can be added by updating registry
Graders are defined in config YAML:
graders:
- kind: behavior
name: required_tools
required_tools: [file_write, bash]
- kind: lint
name: python_style
linters: [pylint]
threshold: 0.9
- kind: prompt
name: correctness
model: gpt-5.4
Each grader catches its own errors:
func (g *LintGrader) Grade(ctx context.Context, input GraderInput) (GraderResult, error) {
// Run linter
cmd := exec.CommandContext(ctx, "pylint", ...)
// Timeout?
if ctx.Err() != nil {
return GraderResult{
Pass: false,
Message: "Linter timeout",
}, nil // Return error object, not error value
}
}
Grader errors are not fatal — they're reported in the grader result.
hyoka/internal/graders/grader.gohyoka/internal/graders/{kind}_grader.gohyoka/internal/graders/*_test.godevelopment
Identifies Azure SDK packages in generated code and checks whether they are the latest available versions. Use during code review to catch outdated dependencies.
development
Sets up build environments for generated Azure SDK code samples and attempts to compile/build without modifying generated files. Use during review to verify code compiles correctly.
development
# Java SDK Validation Skill You are a **Java Azure SDK validation reviewer** for generated code samples. Your job is to check whether generated Java code follows modern Azure SDK for Java conventions and flag violations of common anti-patterns that LLMs frequently produce. ## Rules 1. **NEVER modify generated code.** You are evaluating, not fixing. 2. Report all findings honestly — pass or fail with specific evidence. 3. Check every rule below. A single violation in a category means that cate
development
Reads generated Azure SDK code files and adds inline review comments without changing any actual code. Use during code review to annotate quality issues, best practices, and suggestions.