Code Standards Calibration

Goal

Evaluate code-standards-gate against human review evidence and recommend the smallest durable improvement, without promoting sample-specific rules into global skill behavior.

This skill replaces standalone standards extraction for review evidence. Use code-standards-gate when the task is only to run a review.

Success Criteria

A good evaluation:

Normalizes human review evidence into atomic findings, or names the missing evidence.
Runs the skill against the same review target, local commit, and requested boundary, or names the blocker.
Normalizes skill findings at the same granularity as human findings.
Classifies every human finding as matched, partial, missed, not-reviewable, or out-of-scope.
Classifies every skill-only finding as valid-extra, weak-extra, invalid-extra, or duplicate.
Diagnoses misses by failure type.
Separates recommendations for code-standards-gate, project rules, tooling/checks, and keep-local evidence.

Constraints

Keep all evaluation artifacts under a local workspace, for example:

code-standards-gate-workspace/
  <case-id>/
    human-review/
    skill-run/
    comparison/
    iterations/

Do not write sample-specific findings into code-standards-gate/SKILL.md or code-standards-gate/sub-agent.md. Use sample evidence only in evaluation artifacts and iteration summaries.

Do not paste tokens into artifacts. If evaluating hosted review evidence, use an authenticated gh / glab session or equivalent local credentials without recording secrets.

Evaluation may recommend edits. Apply edits only when the user explicitly asks to optimize or modify files.

Required Inputs

Every evaluation needs:

local project path
review target: GitHub PR URL, GitLab MR URL, local review evidence, or conversation evidence
branch/range when it is not obvious from PR/MR metadata
current installed skill path, usually ~/.agents/skills/code-standards-gate
output directory for artifacts

Ask one narrow question only when a missing input changes target, risk, authorization, or comparability.

Evidence Budget

For PR/MR evidence, collect enough to identify the review target, changed files, top-level comments/reviews, inline threads/discussions, resolution/outdated state, and head/base commits.

Continue retrieval only when comments are truncated, the review target is unclear, branch/head commit is missing, a human finding cannot be located, or the user asked for exhaustive coverage.

Do not retrieve again for phrasing, background, or nonessential examples. Store raw evidence first, then normalize it; if raw capture is impossible, record the source and limitation.

For conversation-history corrections, collect the relevant user corrections, concrete before/after examples, and explicit boundaries. If the relevant conversation is not in context, ask for the excerpt or session pointer.

Collection Recipes

For GitHub, collect PR metadata and review comments:

gh pr view "$PR_URL" \
  --json url,number,title,body,state,baseRefName,headRefName,baseRefOid,headRefOid,files,reviews,comments \
  > human-review/pr.json

Collect inline review threads with GraphQL:

gh api graphql \
  -F owner="$OWNER" \
  -F repo="$REPO" \
  -F number="$PR_NUMBER" \
  -f query='
query($owner: String!, $repo: String!, $number: Int!) {
  repository(owner: $owner, name: $repo) {
    pullRequest(number: $number) {
      reviewThreads(first: 100) {
        nodes {
          id
          isResolved
          isOutdated
          path
          line
          originalLine
          startLine
          diffSide
          comments(first: 100) {
            nodes {
              id
              author { login }
              body
              createdAt
              updatedAt
              url
              path
              line
              originalLine
              diffHunk
              commit { oid }
            }
          }
        }
      }
    }
  }
}' > human-review/review-threads.json

If the PR has more than 100 threads, rerun with pagination rather than truncating.

For GitLab:

glab api "projects/$PROJECT_ID/merge_requests/$MR_IID" > human-review/mr.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/discussions" --paginate > human-review/discussions.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/notes" --paginate > human-review/notes.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/changes" > human-review/changes.json

Finding Shape

Normalize human review into human-review/human-findings.md:

H01 title
- source: review thread/comment URL, comment ID, or conversation pointer
- file/line: path:line when available
- status: unresolved | resolved | outdated | discussion
- issue: what the reviewer objected to
- why: review principle behind the objection
- expected correction: smallest code/spec/test change implied by the comment
- category: contract | type-shape | persisted-state | parse-rewrite | wrapper | generated-output | tests | package | other

Normalize skill review into skill-run/skill-findings.md with the same atomic shape and S01 ids.

Do not normalize broad theme findings as one issue when they imply several independent edits. Split them before scoring and note if the final synthesis lost granularity.

Run Skill Under Evaluation

Run from the local project path so rules, dependencies, and diff context are real. Use the installed skill when testing actual Codex behavior.

For large or multi-surface diffs, use one isolated run per associated risk batch when available. For small diffs, record why one run is sufficient. If local branch state does not match the target head, create or checkout a matching worktree and record the exact commit in skill-run/run-metadata.md.

The run output should preserve batch plan, raw batch finding count, final finding count, dropped/merged finding reasons, inventory map, and stable finding ids.

Compare Reviews

Create comparison/mapping.md.

Human finding statuses:

matched: skill found the same issue with the same or stronger correction.
partial: skill found the area but missed an important surface, reason, or correction.
missed: skill did not find it.
not-reviewable: human finding depends on private intent or unavailable context.
out-of-scope: outside the requested review boundary.

Skill-only statuses:

valid-extra: valid issue not present in human review.
weak-extra: plausible but lower-confidence or lower-value issue.
invalid-extra: incorrect, contradicted by code, or outside scope.
duplicate: same issue as another skill finding.

Use compact tables:

| Human ID | Skill ID | Status | Category | Notes |
|---|---|---|---|---|
| H01 | S03 | matched | persisted-state | Same field and same deletion correction. |
| H02 | S07 | partial | wrapper | Found URL rewrite but missed native-owner failure boundary. |
| H03 | - | missed | tests | No tests batch reviewed this failure path. |

| Skill ID | Status | Category | Notes |
|---|---|---|
| S12 | valid-extra | package | Reproduced packed install failure. |

Scoring

Score useful review replacement value, not raw finding count.

human atomic recall: (matched + 0.5 * partial) / reviewable human findings
precision: valid skill findings divided by all skill findings, with weak extras weighted as 0.5
granularity: final findings preserve atomic batch findings
batch discipline: associated risk-ranked batches match the diff risk
actionability: findings name concrete surface, evidence, why, and smallest correction

Use this summary:

human_reviewable_findings:
matched:
partial:
missed:
not_reviewable:
out_of_scope:
skill_final_findings:
valid_extra:
weak_extra:
invalid_extra:
duplicates:
human_atomic_recall:
precision:
granularity:
batch_discipline:
actionability:
score_100:

Rough guide: 90+ close substitute on this PR/MR; 80-89 useful with a human checker; 70-79 mechanism works but misses too much; <70 needs improvement before trust.

Diagnose Misses

For every missed or partial human finding, classify the failure:

collection miss
scope miss
batch miss
subagent miss
synthesis miss
standard miss
guide miss
tooling miss

Do not edit SKILL.md for collection, scope, or tooling misses unless the skill itself caused them.

Recommend Placement

Use the smallest durable home:

code-standards-gate: reusable cross-project review standards, batching, output granularity, or subagent guidance.
Project rules: repo-specific recurring expectations.
Tooling/checks: deterministic formatting, naming, schema, or test requirements.
Keep local: one-off findings, weak signals, or implementation-specific lessons.

Do not treat every human comment as a reason to edit code-standards-gate.

Write comparison/recommendations.md with Keep, Change, Do Not Add, Placement, and Next Eval.

Iteration

Iterate only when the user asks to optimize. Snapshot current files under iterations/iteration-XX/before/, apply the smallest generalizable edit, sync the installed skill if the runner loads ~/.agents/skills, rerun the same evaluation, and compare against the previous iteration.

Stop when the target score is reached or the next change would be sample-specific.

Output

Minimum artifacts:

human-review/human-findings.md
skill-run/skill-findings.md
comparison/mapping.md
comparison/score.md
comparison/recommendations.md

Final response should state the evaluation target, score, important misses, recommended durable changes, validation performed, and any blockers.

Stop Rules

Stop when the raw evidence, normalized findings, comparable skill run, mapping, score, miss diagnosis, and placement recommendations are complete or blocked with the blocker named.

Do not promote sample-specific rules. Do not continue searching, scoring, or iterating after the core comparison can answer the user's request.

Code Standards Calibration

Goal

Evaluate code-standards-gate against human review evidence and recommend the smallest durable improvement, without promoting sample-specific rules into global skill behavior.

This skill replaces standalone standards extraction for review evidence. Use code-standards-gate when the task is only to run a review.

Success Criteria

A good evaluation:

Normalizes human review evidence into atomic findings, or names the missing evidence.
Runs the skill against the same review target, local commit, and requested boundary, or names the blocker.
Normalizes skill findings at the same granularity as human findings.
Classifies every human finding as matched, partial, missed, not-reviewable, or out-of-scope.
Classifies every skill-only finding as valid-extra, weak-extra, invalid-extra, or duplicate.
Diagnoses misses by failure type.
Separates recommendations for code-standards-gate, project rules, tooling/checks, and keep-local evidence.

Constraints

Keep all evaluation artifacts under a local workspace, for example:

code-standards-gate-workspace/
  <case-id>/
    human-review/
    skill-run/
    comparison/
    iterations/

Do not write sample-specific findings into code-standards-gate/SKILL.md or code-standards-gate/sub-agent.md. Use sample evidence only in evaluation artifacts and iteration summaries.

Do not paste tokens into artifacts. If evaluating hosted review evidence, use an authenticated gh / glab session or equivalent local credentials without recording secrets.

Evaluation may recommend edits. Apply edits only when the user explicitly asks to optimize or modify files.

Required Inputs

Every evaluation needs:

local project path
review target: GitHub PR URL, GitLab MR URL, local review evidence, or conversation evidence
branch/range when it is not obvious from PR/MR metadata
current installed skill path, usually ~/.agents/skills/code-standards-gate
output directory for artifacts

Ask one narrow question only when a missing input changes target, risk, authorization, or comparability.

Evidence Budget

For PR/MR evidence, collect enough to identify the review target, changed files, top-level comments/reviews, inline threads/discussions, resolution/outdated state, and head/base commits.

Continue retrieval only when comments are truncated, the review target is unclear, branch/head commit is missing, a human finding cannot be located, or the user asked for exhaustive coverage.

Do not retrieve again for phrasing, background, or nonessential examples. Store raw evidence first, then normalize it; if raw capture is impossible, record the source and limitation.

Collection Recipes

For GitHub, collect PR metadata and review comments:

gh pr view "$PR_URL" \
  --json url,number,title,body,state,baseRefName,headRefName,baseRefOid,headRefOid,files,reviews,comments \
  > human-review/pr.json

Collect inline review threads with GraphQL:

gh api graphql \
  -F owner="$OWNER" \
  -F repo="$REPO" \
  -F number="$PR_NUMBER" \
  -f query='
query($owner: String!, $repo: String!, $number: Int!) {
  repository(owner: $owner, name: $repo) {
    pullRequest(number: $number) {
      reviewThreads(first: 100) {
        nodes {
          id
          isResolved
          isOutdated
          path
          line
          originalLine
          startLine
          diffSide
          comments(first: 100) {
            nodes {
              id
              author { login }
              body
              createdAt
              updatedAt
              url
              path
              line
              originalLine
              diffHunk
              commit { oid }
            }
          }
        }
      }
    }
  }
}' > human-review/review-threads.json

If the PR has more than 100 threads, rerun with pagination rather than truncating.

For GitLab:

glab api "projects/$PROJECT_ID/merge_requests/$MR_IID" > human-review/mr.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/discussions" --paginate > human-review/discussions.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/notes" --paginate > human-review/notes.json
glab api "projects/$PROJECT_ID/merge_requests/$MR_IID/changes" > human-review/changes.json

Finding Shape

Normalize human review into human-review/human-findings.md:

H01 title
- source: review thread/comment URL, comment ID, or conversation pointer
- file/line: path:line when available
- status: unresolved | resolved | outdated | discussion
- issue: what the reviewer objected to
- why: review principle behind the objection
- expected correction: smallest code/spec/test change implied by the comment
- category: contract | type-shape | persisted-state | parse-rewrite | wrapper | generated-output | tests | package | other

Normalize skill review into skill-run/skill-findings.md with the same atomic shape and S01 ids.

Do not normalize broad theme findings as one issue when they imply several independent edits. Split them before scoring and note if the final synthesis lost granularity.

Run Skill Under Evaluation

Run from the local project path so rules, dependencies, and diff context are real. Use the installed skill when testing actual Codex behavior.

The run output should preserve batch plan, raw batch finding count, final finding count, dropped/merged finding reasons, inventory map, and stable finding ids.

Compare Reviews

Create comparison/mapping.md.

Human finding statuses:

matched: skill found the same issue with the same or stronger correction.
partial: skill found the area but missed an important surface, reason, or correction.
missed: skill did not find it.
not-reviewable: human finding depends on private intent or unavailable context.
out-of-scope: outside the requested review boundary.

Skill-only statuses:

valid-extra: valid issue not present in human review.
weak-extra: plausible but lower-confidence or lower-value issue.
invalid-extra: incorrect, contradicted by code, or outside scope.
duplicate: same issue as another skill finding.

Use compact tables:

| Human ID | Skill ID | Status | Category | Notes |
|---|---|---|---|---|
| H01 | S03 | matched | persisted-state | Same field and same deletion correction. |
| H02 | S07 | partial | wrapper | Found URL rewrite but missed native-owner failure boundary. |
| H03 | - | missed | tests | No tests batch reviewed this failure path. |

| Skill ID | Status | Category | Notes |
|---|---|---|
| S12 | valid-extra | package | Reproduced packed install failure. |

Scoring

Score useful review replacement value, not raw finding count.

human atomic recall: (matched + 0.5 * partial) / reviewable human findings
precision: valid skill findings divided by all skill findings, with weak extras weighted as 0.5
granularity: final findings preserve atomic batch findings
batch discipline: associated risk-ranked batches match the diff risk
actionability: findings name concrete surface, evidence, why, and smallest correction

Use this summary:

human_reviewable_findings:
matched:
partial:
missed:
not_reviewable:
out_of_scope:
skill_final_findings:
valid_extra:
weak_extra:
invalid_extra:
duplicates:
human_atomic_recall:
precision:
granularity:
batch_discipline:
actionability:
score_100:

Rough guide: 90+ close substitute on this PR/MR; 80-89 useful with a human checker; 70-79 mechanism works but misses too much; <70 needs improvement before trust.

Diagnose Misses

For every missed or partial human finding, classify the failure:

collection miss
scope miss
batch miss
subagent miss
synthesis miss
standard miss
guide miss
tooling miss

Do not edit SKILL.md for collection, scope, or tooling misses unless the skill itself caused them.

Recommend Placement

Use the smallest durable home:

code-standards-gate: reusable cross-project review standards, batching, output granularity, or subagent guidance.
Project rules: repo-specific recurring expectations.
Tooling/checks: deterministic formatting, naming, schema, or test requirements.
Keep local: one-off findings, weak signals, or implementation-specific lessons.

Do not treat every human comment as a reason to edit code-standards-gate.

Write comparison/recommendations.md with Keep, Change, Do Not Add, Placement, and Next Eval.

Iteration

Stop when the target score is reached or the next change would be sample-specific.

Output

Minimum artifacts:

human-review/human-findings.md
skill-run/skill-findings.md
comparison/mapping.md
comparison/score.md
comparison/recommendations.md

Final response should state the evaluation target, score, important misses, recommended durable changes, validation performed, and any blockers.

Stop Rules

Stop when the raw evidence, normalized findings, comparable skill run, mapping, score, miss diagnosis, and placement recommendations are complete or blocked with the blocker named.

Do not promote sample-specific rules. Do not continue searching, scoring, or iterating after the core comparison can answer the user's request.

Adoption

plimeor/meta-code-standards-calibration

$ install --global

Security Scan Results

SKILL.md

Code Standards Calibration

Goal

Success Criteria

Constraints

Required Inputs

Evidence Budget

Collection Recipes

Finding Shape

Run Skill Under Evaluation

Compare Reviews

Scoring

Diagnose Misses

Recommend Placement

Iteration

Output

Stop Rules

Related Skills

plimeor/workbench

plimeor/subagent-delegation

plimeor/reconsider

plimeor/context-engineering

plimeor/meta-code-standards-calibration

$ install --global

Security Scan Results

SKILL.md

Code Standards Calibration

Goal

Success Criteria

Constraints

Required Inputs

Evidence Budget

Collection Recipes

Finding Shape

Run Skill Under Evaluation

Compare Reviews

Scoring

Diagnose Misses

Recommend Placement

Iteration

Output

Stop Rules

Related Skills

plimeor/workbench

plimeor/subagent-delegation

plimeor/reconsider

plimeor/context-engineering