src/autoskillit/skills_extended/resolve-claims-review/SKILL.md
Fetch claim findings from audit-claims, run citation-aware intent validation (ACCEPT/REJECT/DISCUSS), apply targeted citation fixes, escalate findings requiring experiment reruns, and post inline replies.
npx skillsauth add talont-org/autoskillit resolve-claims-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Apply changes_requested claim findings from a research PR to the research
worktree. Reads open review threads from audit-claims, runs citation-aware
intent validation, applies targeted citation fixes by fix-strategy taxonomy,
escalates unrerunnable findings, resolves addressed threads, and posts inline
replies so a follow-up push closes the review cycle.
/autoskillit:resolve-claims-review {worktree_path} {base_branch}
Called by the research recipe when audit_claims routes changes_requested via
route_claims_resolve. Bounded by retries: 2 — on exhaustion routes to
merge_escalations.
NEVER:
re_push_research step handles push{{AUTOSKILLIT_TEMP}}/resolve-claims-review/run_in_background: true is prohibited)ALWAYS:
gh is unavailable or no PR is foundThe skill reads the claims_review namespace from .autoskillit/config.yaml:
claims_review:
validation_command: null # command to validate research artifacts after fixes; null = skip validation
validation_timeout: 120 # seconds before treating validation as a failure
This namespace is not in defaults.yaml (it has no global default value); if absent, the
skill uses the in-code defaults shown above.
Read claims_review.validation_command (default: null) and
claims_review.validation_timeout (default: 120) from .autoskillit/config.yaml.
Parse two positional arguments: worktree_path and base_branch.
Derive feature_branch via:
feature_branch=$(git -C "$worktree_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
if [ -z "$feature_branch" ]; then
echo "Error: could not determine feature_branch from '$worktree_path'" >&2
exit 1
fi
Read config:
import yaml, pathlib
cfg = yaml.safe_load(pathlib.Path(".autoskillit/config.yaml").read_text()) if pathlib.Path(".autoskillit/config.yaml").exists() else {}
cr_cfg = cfg.get("claims_review", {})
validation_command = cr_cfg.get("validation_command", None)
validation_timeout = cr_cfg.get("validation_timeout", 120)
If either positional arg is missing, abort with:
"Usage: /autoskillit:resolve-claims-review <worktree_path> <base_branch>"
PR_LIST_OUTPUT=$(gh pr list --head "$feature_branch" --base "$base_branch" \
--json number,url -q '.[0] | "\(.number) \(.url)"' 2>/dev/null || echo "")
PR_NUMBER=$(echo "$PR_LIST_OUTPUT" | awk '{print $1}')
PR_URL=$(echo "$PR_LIST_OUTPUT" | awk '{print $2}')
# Graceful degradation: .[0] returns null when no PR matches; PR_NUMBER will be empty
if [ -z "$PR_NUMBER" ] || [ "$PR_NUMBER" = "null" ]; then
echo "No PR found or gh unavailable — skipping claims review resolution"
exit 0
fi
Get owner/repo:
gh repo view --json nameWithOwner -q .nameWithOwner
If gh is unavailable or not authenticated, or no PR is found:
Fetch inline comments (anchored to specific file lines):
gh api repos/{owner}/{repo}/pulls/{number}/comments --paginate
Fetch top-level review bodies (summary reviews):
gh api repos/{owner}/{repo}/pulls/{number}/reviews --paginate
Fetch review thread node IDs using cursor-based pagination to handle PRs with more than 100 threads:
gh api graphql \
-f query='query($owner:String!,$repo:String!,$number:Int!,$after:String){repository(owner:$owner,name:$repo){pullRequest(number:$number){reviewThreads(first:100,after:$after){pageInfo{hasNextPage endCursor}nodes{id isResolved comments(first:1){nodes{databaseId}}}}}}}' \
-F owner="$owner" \
-F repo="$repo" \
-F number=$number \
-F after=""
Build comment_id_to_thread_id: dict[int, str] map. Skip threads where isResolved
is already true.
If the GraphQL call fails, log a warning and set comment_id_to_thread_id = {}.
Thread resolution will be silently skipped in Step 6.
Save to:
{{AUTOSKILLIT_TEMP}}/resolve-claims-review/inline_comments_{pr}.json{{AUTOSKILLIT_TEMP}}/resolve-claims-review/reviews_{pr}.json{{AUTOSKILLIT_TEMP}}/resolve-claims-review/threads_{pr}.jsonFrom inline comments, extract per comment:
path — file path relative to repo rootline — the line being commented onbody — the reviewer's messagediff_hunk — surrounding contextid — the comment's REST database IDthread_node_id — look up comment_id_to_thread_id.get(id)Classify each finding by severity (same as resolve-review):
critical — body contains: "must", "critical", "security", "data loss", "wrong",
"broken", "incorrect", "bug", "error", "never"warning — body contains: "should", "consider", "recommend", "prefer", "suggest",
"missing", "lacks"info — body contains: "nit", "optional", "minor", "style", "cosmetic", "could"Include critical and warning only. Skip info findings.
Dimension extraction — comments posted by audit-claims have format
[severity] dimension: message. Extract the dimension label using:
import re
DIMENSION_PATTERN = re.compile(r'^\[(?:critical|warning|info)\]\s+(\S+):\s+')
Apply DIMENSION_PATTERN to each comment body to extract the dimension label.
Dimension group mapping:
| Comment dimension | Group key |
|---|---|
| external | citations |
| methodological | methodology |
| comparative | comparisons |
| (unparseable / no match) | unknown |
Note: experimental findings never appear — they are self-evidencing and generate no
findings in audit-claims.
Save dimension_groups_{pr}.json with findings keyed by group.
Before applying any fix, validate every critical and warning finding against the actual codebase and git history. This analysis phase runs entirely before code changes are made.
Dimension grouping: Group findings by their extracted dimension group key
(citations, methodology, comparisons, unknown).
This is dimension-based grouping, NOT file-path grouping.
Launch one parallel subagent (Task tool, model: "sonnet") per non-empty dimension
group. Each subagent receives:
ACCEPT, REJECT, or DISCUSS with:
verdict — the classificationevidence — specific references (line numbers, section names, citation markers)category (REJECT only): one of claim_is_supported, citation_not_required,
experimental_claim_misclassified, out_of_scope, stale_commentfix_strategy (ACCEPT only): one of add_citation, qualify_claim, remove_claim,
rerun_required (escalated), design_flaw (escalated)escalate: true if fix_strategy is rerun_required or design_flawdimension — the extracted dimension labelREJECT category guidance:
claim_is_supported — evidence for the claim exists in the diff but subagent missed itcitation_not_required — claim is general knowledge not requiring citationexperimental_claim_misclassified — claim is self-evidencing (derived from experiment data)out_of_scope — comment addresses content outside this PRstale_comment — comment refers to text no longer presentfix_strategy guidance (ACCEPT only):
add_citation — add a missing reference in the report documentqualify_claim — add hedging language ("in our experiments", "preliminary evidence suggests")remove_claim — delete an unsupported claim with no path to citationrerun_required — fix requires re-running experiment to generate supporting datadesign_flaw — fundamental scope/methodology issue; cannot fix with citation aloneProtocol deviation rule (rerun_required):
When a claim depends on experimental results, and the experiment plan specifies a
replication count, sample size, or other methodological parameter that the actual
execution did not follow, classify as rerun_required if the deviation materially
undermines the evidence supporting the claim. Qualifying the claim with hedging
language (qualify_claim) does not constitute remediation when the underlying data
is statistically invalid — the claim needs new supporting data, not softer language.
Exception — justified deviations: If the research report provides a substantive
rationale for why the conclusions remain valid despite the methodological difference,
and that rationale withstands scrutiny, then qualify_claim or remove_claim may
be appropriate. The key test is: does this deviation materially undermine the
evidence cited in support of the claim?
Invalid statistics rule (rerun_required):
When a finding identifies confidence intervals, p-values, or significance claims
cited as evidence for a claim, and those statistics are computed from the wrong unit
of analysis (e.g., within-run iterations treated as independent replicates), classify
as rerun_required if the invalid statistics remain in the report. Classification
as remove_claim applies if the claim and its invalid statistical evidence are both
fully removed. Classification as qualify_claim applies only if the statistical
artifacts are fully removed and the claim is restated without statistical backing.
Fallback: If a subagent fails or times out, discard any partial output from that
subagent (partial JSON must not be used — using partial output risks silently demoting
ACCEPT findings to DISCUSS without surfacing the data loss). Classify all comments in the
failed group as DISCUSS and log the failure with the error message, domain group name,
and affected comment IDs.
Merge results into classification_map: dict[comment_id, verdict_entry].
Save classification_map_{pr}.json.
Write analysis report to {{AUTOSKILLIT_TEMP}}/resolve-claims-review/analysis_{pr}_{ts}.md
with banner (BEFORE any code changes):
Analysis complete (BEFORE any code changes)
ACCEPT: N | REJECT: N | DISCUSS: N
(add_citation: N, qualify_claim: N, remove_claim: N, rerun_required: N ESCALATED, design_flaw: N ESCALATED)
Track accept_count, reject_count, discuss_count, and per-strategy counts.
Initialize before processing:
addressed_thread_ids: list[str] = []
escalation_records: list = []
Processing order within ACCEPT findings (critical before warning within each tier):
add_citation — add reference entries or inline citationsqualify_claim — add qualifying languageremove_claim — delete unsupported assertionsFor each ACCEPT finding, route by fix_strategy:
rerun_required or design_flaw → ESCALATE:
escalation_records with full finding details and strategy fieldaddressed_thread_idsadd_citation / qualify_claim / remove_claim → apply edit → commit:
git -C "$worktree_path" add {file}
# If pre-commit hooks exist:
pre-commit run --files {file} && git -C "$worktree_path" add {file}
git -C "$worktree_path" commit -m "fix(claims-review): {description} [{dimension}]"
Append thread_node_id to addressed_thread_ids (if not None).
Classification gate — REJECT/DISCUSS bypass:
addressed_thread_idsif validation_command is None:
# Skip validation step entirely — null validation_command means skip
validation_status = "SKIPPED"
else:
# Run with retry logic (max 3 iterations)
# Timeout expiration is treated as failure and counts toward the iteration limit.
for iteration in range(1, 4):
result = run(validation_command, timeout=validation_timeout)
# returncode non-zero OR timeout (TimeoutExpired) both count as failure.
if result.returncode == 0:
validation_status = "PASS"
break
if iteration >= 3:
validation_status = "FAIL"
# Report failure, leave working directory intact, exit non-zero
exit(1)
# Analyze failures, revert/adjust problematic commit, retry
When validation_command is null, skip validation — do not run any command.
When configured, enforce max 3 iteration retry loop before exiting non-zero.
For each thread_id in addressed_thread_ids:
gh api graphql \
-f query='mutation($threadId:ID!){resolveReviewThread(input:{threadId:$threadId}){thread{isResolved}}}' \
-f threadId="$thread_id"
isResolved: true): increment resolved_count"Warning: could not resolve thread {thread_id}: {error}",
continue. Do not modify exit code.Track resolved_count and resolve_failed_count. Thread resolution failure must never
cause exit non-zero.
For every analyzed comment (critical + warning), post one reply using the reply API.
Reply templates:
ACCEPT (applied): "Addressed in {sha}: {evidence}"
ACCEPT (skipped): "Investigated but could not apply: {reason}"
REJECT: "Investigated — {category}. {evidence}"
DISCUSS: "Valid observation — flagged for human judgment. {evidence}"
ESCALATION (rerun): "[ESCALATION] Requires re-running experiment. {message}"
ESCALATION (design): "[ESCALATION] Fundamental design issue. {message}"
API endpoint:
gh api repos/{owner}/{repo}/pulls/{pr_number}/comments/{comment_id}/replies \
--method POST --field body="..."
Track reply_posted_count and reply_failed_count. Best-effort — failure to post
any reply must not affect exit code.
Save all REJECT-classified comments to:
{{AUTOSKILLIT_TEMP}}/resolve-claims-review/reject_patterns_{pr}_{ts}.json
Schema:
{
"comment_id": 123,
"path": "research/report.md",
"line": 42,
"body": "...",
"evidence": "...",
"category": "citation_not_required",
"dimension": "external",
"pr_number": 99,
"feature_branch": "..."
}
Save escalation records to:
{{AUTOSKILLIT_TEMP}}/resolve-claims-review/escalation_records_{pr}.json
Each escalation record must include a "strategy" field set to the fix_strategy value
(either "rerun_required" or "design_flaw"). This field is read by the structured
output determination logic.
Print structured summary:
resolve-claims-review complete
PR: #{pr_number} ({feature_branch} → {base_branch})
Findings fetched: {total}
- critical: {n}, warning: {n}, info: {n} (skipped)
Intent validation:
- ACCEPT: {n} (add_citation: {n}, qualify_claim: {n}, remove_claim: {n},
rerun_required: {n} ESCALATED, design_flaw: {n} ESCALATED)
- REJECT: {n}
- DISCUSS: {n}
Fixes applied: {n}
Escalations: {n}
Validation: {SKIPPED | PASS | FAIL}
Threads resolved: {n}/{total}
Inline replies: {reply_posted_count} posted / {reply_failed_count} failed
Status: PASS
Save full report to {{AUTOSKILLIT_TEMP}}/resolve-claims-review/report_{pr}_{ts}.md.
Exit 0.
{{AUTOSKILLIT_TEMP}}/resolve-claims-review/
├── inline_comments_{pr}.json
├── reviews_{pr}.json
├── threads_{pr}.json
├── dimension_groups_{pr}.json
├── classification_map_{pr}.json
├── escalation_records_{pr}.json
├── analysis_{pr}_{ts}.md (written BEFORE code changes)
├── reject_patterns_{pr}_{ts}.json
└── report_{pr}_{ts}.md
After completing all thread processing (addressed + escalated), emit structured output tokens:
Verdict decision:
Fixes applied: N where N >= 1): verdict = real_fixverdict = already_greenneeds_rerun = {true|false}
verdict = {real_fix|already_green}
fixes_applied = {N}
needs_rerun = true: At least one finding was classified as rerun_required in
the escalation records. This includes: (a) fixes requiring re-running the experiment
to generate supporting data, (b) protocol deviations where the experiment execution
diverged from the plan in ways that materially undermine the report's claims, or
(c) invalid statistical analyses (e.g., CIs from the wrong unit of analysis) that
remain cited as evidence for claims.needs_rerun = false: No rerun_required escalations exist. May still have
design_flaw escalations (these are informational and do not require re-running).Determination logic: After writing escalation_records_{pr}.json, check whether any
entry has "strategy": "rerun_required". If yes → true. If no entries or all entries
are design_flaw → false.
needs_rerun is mandatory. The recipe captures it as claims_needs_rerun to route via
merge_escalations.
Emit the structured output tokens as the very last lines as your final output:
IMPORTANT: Emit the tokens as literal plain text with no code fences, no markdown formatting. The recipe capture system reads raw stdout.
needs_rerun = {true|false}
verdict = {real_fix|already_green}
fixes_applied = {N}
Summary: {{AUTOSKILLIT_TEMP}}/resolve-claims-review/report_{pr}_{ts}.md (relative to the current working directory)
development
Generate YAML recipes for .autoskillit/recipes/. Use when user says "make script skill", "generate script", "script a workflow", "write a script", "create a script", "new recipe", "write a pipeline", or when loaded by other skills for script formatting.
data-ai
Create Uncertainty Representation visualization planning spec showing error bar definitions, distribution-aware alternatives, and multi-seed variance protocols. Statistical lens answering "How is uncertainty honestly represented?"
data-ai
Create Temporal Dynamics visualization planning spec showing axis scaling (linear vs log), smoothing disclosure, epoch/step alignment, run aggregation (mean + variance bands), early-stopping markers, and wall-clock vs step-count x-axis. Temporal lens answering "Are training dynamics shown clearly and honestly?"
data-ai
Create Narrative Story Arc visualization planning spec showing visual consistency across the report (same color = same model everywhere), logical figure progression, redundant figure detection, and narrative dependency between figures. Narrative lens answering "Do the figures tell a coherent story across the report?"