skills/meta-code-standards-calibration/SKILL.md
Evaluate `code-standards-gate` against human PR/MR review evidence or explicit conversation-history corrections. Use when the user asks to benchmark, score, validate, calibrate, or improve the review skill; compare human findings with skill findings; diagnose missed review standards; or decide whether evidence belongs in `code-standards-gate`, project rules, tooling, or should stay local. Do not use for ordinary code review; use `code-standards-gate` for reviewing code directly.
npx skillsauth add plimeor/agent-skills meta-code-standards-calibrationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate code-standards-gate against human review evidence and recommend the smallest durable improvement, without promoting sample-specific rules into global skill behavior.
This skill replaces standalone standards extraction for review evidence. Use code-standards-gate when the task is only to run a review.
A good evaluation:
matched, partial, missed, not-reviewable, or out-of-scope.valid-extra, weak-extra, invalid-extra, or duplicate.code-standards-gate, project rules, tooling/checks, and keep-local evidence.Keep all evaluation artifacts under a local workspace, for example:
code-standards-gate-workspace/
<case-id>/
human-review/
skill-run/
comparison/
iterations/
Do not write sample-specific findings into code-standards-gate/SKILL.md or code-standards-gate/sub-agent.md. Use sample evidence only in evaluation artifacts and iteration summaries.
Do not paste tokens into artifacts. If evaluating hosted review evidence, use an authenticated gh / glab session or equivalent local credentials without recording secrets.
Evaluation may recommend edits. Apply edits only when the user explicitly asks to optimize or modify files.
Every evaluation needs:
~/.agents/skills/code-standards-gateAsk one narrow question only when a missing input changes target, risk, authorization, or comparability.
For PR/MR evidence, collect enough to identify the review target, changed files, top-level comments/reviews, inline threads/discussions, resolution/outdated state, and head/base commits.
Continue retrieval only when comments are truncated, the review target is unclear, branch/head commit is missing, a human finding cannot be located, or the user asked for exhaustive coverage.
Do not retrieve again for phrasing, background, or nonessential examples. Store raw evidence first, then normalize it; if raw capture is impossible, record the source and limitation.
For conversation-history corrections, collect the relevant user corrections, concrete before/after examples, and explicit boundaries. If the relevant conversation is not in context, ask for the excerpt or session pointer.
For GitHub, collect PR metadata and review comments:
gh pr view "$PR_URL" \
--json url,number,title,body,state,baseRefName,headRefName,baseRefOid,headRefOid,files,reviews,comments \
> human-review/pr.json
Collect inline review threads with GraphQL:
gh api graphql \
-F owner="$OWNER" \
-F repo="$REPO" \
-F number="$PR_NUMBER" \
-f query='
query($owner: String!, $repo: String!, $number: Int!) {
repository(owner: $owner, name: $repo) {
pullRequest(number: $number) {
reviewThreads(first: 100) {
nodes {
id
isResolved
isOutdated
path
line
originalLine
startLine
diffSide
comments(first: 100) {
nodes {
id
author { login }
body
createdAt
updatedAt
url
path
line
originalLine
diffHunk
commit { oid }
}
}
}
}
}
}
}' > human-review/review-threads.json
If the PR has more than 100 threads, rerun with pagination rather than truncating.
For GitLab:
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID" > human-review/mr.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/discussions" --paginate > human-review/discussions.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/notes" --paginate > human-review/notes.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/changes" > human-review/changes.json
Normalize human review into human-review/human-findings.md:
H01 title
- source: review thread/comment URL, comment ID, or conversation pointer
- file/line: path:line when available
- status: unresolved | resolved | outdated | discussion
- issue: what the reviewer objected to
- why: review principle behind the objection
- expected correction: smallest code/spec/test change implied by the comment
- category: contract | type-shape | persisted-state | parse-rewrite | wrapper | generated-output | tests | package | other
Normalize skill review into skill-run/skill-findings.md with the same atomic shape and S01 ids.
Do not normalize broad theme findings as one issue when they imply several independent edits. Split them before scoring and note if the final synthesis lost granularity.
Run from the local project path so rules, dependencies, and diff context are real. Use the installed skill when testing actual Codex behavior.
For large or multi-surface diffs, use one isolated run per associated risk batch when available. For small diffs, record why one run is sufficient. If local branch state does not match the target head, create or checkout a matching worktree and record the exact commit in skill-run/run-metadata.md.
The run output should preserve batch plan, raw batch finding count, final finding count, dropped/merged finding reasons, inventory map, and stable finding ids.
Create comparison/mapping.md.
Human finding statuses:
matched: skill found the same issue with the same or stronger correction.partial: skill found the area but missed an important surface, reason, or correction.missed: skill did not find it.not-reviewable: human finding depends on private intent or unavailable context.out-of-scope: outside the requested review boundary.Skill-only statuses:
valid-extra: valid issue not present in human review.weak-extra: plausible but lower-confidence or lower-value issue.invalid-extra: incorrect, contradicted by code, or outside scope.duplicate: same issue as another skill finding.Use compact tables:
| Human ID | Skill ID | Status | Category | Notes |
|---|---|---|---|---|
| H01 | S03 | matched | persisted-state | Same field and same deletion correction. |
| H02 | S07 | partial | wrapper | Found URL rewrite but missed native-owner failure boundary. |
| H03 | - | missed | tests | No tests batch reviewed this failure path. |
| Skill ID | Status | Category | Notes |
|---|---|---|
| S12 | valid-extra | package | Reproduced packed install failure. |
Score useful review replacement value, not raw finding count.
human atomic recall: (matched + 0.5 * partial) / reviewable human findingsprecision: valid skill findings divided by all skill findings, with weak extras weighted as 0.5granularity: final findings preserve atomic batch findingsbatch discipline: associated risk-ranked batches match the diff riskactionability: findings name concrete surface, evidence, why, and smallest correctionUse this summary:
human_reviewable_findings:
matched:
partial:
missed:
not_reviewable:
out_of_scope:
skill_final_findings:
valid_extra:
weak_extra:
invalid_extra:
duplicates:
human_atomic_recall:
precision:
granularity:
batch_discipline:
actionability:
score_100:
Rough guide: 90+ close substitute on this PR/MR; 80-89 useful with a human checker; 70-79 mechanism works but misses too much; <70 needs improvement before trust.
For every missed or partial human finding, classify the failure:
collection missscope missbatch misssubagent misssynthesis missstandard missguide misstooling missDo not edit SKILL.md for collection, scope, or tooling misses unless the skill itself caused them.
Use the smallest durable home:
code-standards-gate: reusable cross-project review standards, batching, output granularity, or subagent guidance.Do not treat every human comment as a reason to edit code-standards-gate.
Write comparison/recommendations.md with Keep, Change, Do Not Add, Placement, and Next Eval.
Iterate only when the user asks to optimize. Snapshot current files under iterations/iteration-XX/before/, apply the smallest generalizable edit, sync the installed skill if the runner loads ~/.agents/skills, rerun the same evaluation, and compare against the previous iteration.
Stop when the target score is reached or the next change would be sample-specific.
Minimum artifacts:
human-review/human-findings.mdskill-run/skill-findings.mdcomparison/mapping.mdcomparison/score.mdcomparison/recommendations.mdFinal response should state the evaluation target, score, important misses, recommended durable changes, validation performed, and any blockers.
Stop when the raw evidence, normalized findings, comparable skill run, mapping, score, miss diagnosis, and placement recommendations are complete or blocked with the blocker named.
Do not promote sample-specific rules. Do not continue searching, scoring, or iterating after the core comparison can answer the user's request.
development
Set up, resume, or repair a compact active execution workbench for long-horizon, multi-session or checkpointed work. Use when a task needs durable handoff, unattended iteration, human gates, auditable evidence, or active-vs-archive routing that keeps a current packet separate from stale historical context. Do not use for one-session tasks, ordinary plans/reviews/audits, one-session bug fixes, direct code edits, or simple docs cleanup; complete those directly.
tools
Decide whether and how to use authorized sub-agents, then coordinate delegated work while preserving the main agent's context. Use when the user asks for orchestration, parallel agents, delegation, background workers, context isolation, or when another skill needs delegated research, review, implementation, or verification. Owns host-policy checks, delegation packets, non-overlap, report verification, and stop rules. Do not use to bypass tool policy, infer user authorization, or add coordination overhead to simple single-threaded tasks.
development
Use before finalizing a non-trivial answer, recommendation, review, or decision to reconsider it and raise its quality, especially when shallow reasoning, context inertia, false framing, overconfidence, unfit analogy transfer, or an obvious-but-missed defect could distort the result. Trigger especially before applying external evidence, familiar frameworks, or comparisons to the user's specific request, and when the user asks to reconsider, double-check, take a second look, or sanity-check an answer.
tools
Route durable rules and context to the right layer — task, project, skill, tooling, hooks, MCP, or global. Use for global rules files (~/.claude/CLAUDE.md, global AGENTS.md), repo-local AGENTS.md/CLAUDE.md, task context packs, hook placement (Codex/Claude Code settings.json), collaboration friction diagnosis, and rule-placement decisions.