Result-to-Claim Gate

Codex assurance: deterministic evidence existence can be accepted, while the base semantic claim judgment records review_independence: same-family and acceptance_status: provisional. Cross-family overlays may record accepted; reviewer failure emits BLOCKED.

Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a secondary Codex judgment, then auto-route based on the verdict.

Context: $ARGUMENTS

When to Use

After a set of experiments completes (main results, not just sanity checks)
Before committing to claims in a paper or review response
When results are ambiguous and you need an objective second opinion

Workflow

Step 1: Collect Results

Gather experiment data from whatever sources are available in the project:

W&B (preferred): wandb.Api().run("<entity>/<project>/<run_id>").history() — metrics, training curves, comparisons
EXPERIMENT_LOG.md: full results table with baselines and verdicts
EXPERIMENT_TRACKER.md: check which experiments are DONE vs still running
Log files: ssh server "tail -100 /path/to/training.log" if no other source
idea-stage/docs/research_contract.md (legacy fallback: docs/research_contract.md): intended claims and experiment design

Assemble the key information:

What experiments were run (method, dataset, config)
Main metrics and baseline comparisons (deltas)
The intended claim these experiments were designed to test
Any known confounds or caveats

Step 1.5: Deterministic evidence pre-check

Before the reviewer call, resolve and run evidence_check.py per evidence-precheck.md:

if [ -z "${ARIS_REPO:-}" ] && [ -f .aris/installed-skills-codex.txt ]; then
  ARIS_REPO=$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null) || true
fi
EVIDENCE_CHECK=""
[ -n "${ARIS_REPO:-}" ] && [ -f "$ARIS_REPO/tools/evidence_check.py" ] && EVIDENCE_CHECK="$ARIS_REPO/tools/evidence_check.py"
[ -z "$EVIDENCE_CHECK" ] && [ -f tools/evidence_check.py ] && EVIDENCE_CHECK="tools/evidence_check.py"
mkdir -p .aris
if [ -n "$EVIDENCE_CHECK" ]; then
  python3 "$EVIDENCE_CHECK" . --batch .aris/claims.json \
    > .aris/evidence_precheck.json 2>.aris/evidence_precheck.err || true
else
  echo "WARN: evidence_check.py unresolved; semantic review will still run" >&2
fi

Treat path_missing and value_not_found as unsupported evidence before the semantic review. verified means only that the cited value exists; it does not prove the claim. Pass the pre-check JSON path to the fresh reviewer. The Codex reviewer's positive result remains review_independence: same-family and acceptance_status: provisional; a deterministic evidence check never upgrades a semantic claim to accepted by itself.

Step 2: Codex Judgment

Send the collected results to a secondary Codex agent for objective evaluation:

spawn_agent:
  model: gpt-5.6-sol
  reasoning_effort: ultra
  message: |
    RESULT-TO-CLAIM EVALUATION

    I need you to judge whether experimental results support the intended claim.

    Intended claim: [the claim these experiments test]

    Experiments run:
    [list experiments with method, dataset, metrics]

    Results:
    [paste key numbers, comparison deltas, significance]

    Baselines:
    [baseline numbers and sources — reproduced or from paper]

    Known caveats:
    [any confounding factors, limited datasets, missing comparisons]

    Please evaluate:
    1. claim_supported: yes | partial | no
    2. what_results_support: what the data actually shows
    3. what_results_dont_support: where the data falls short of the claim
    4. missing_evidence: specific evidence gaps
    5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
    6. next_experiments_needed: specific experiments to fill gaps (if any)
    7. confidence: high | medium | low

    Be honest. Do not inflate claims beyond what the data supports.
    A single positive result on one dataset does not support a general claim.

Step 3: Parse and Normalize

Extract structured fields from the secondary Codex response:

- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low

Step 3.5: Check Experiment Integrity (if audit exists)

Skip this step if EXPERIMENT_AUDIT.json does not exist.

if EXPERIMENT_AUDIT.json exists:
    read integrity_status from file
    attach to verdict output:
        integrity_status: pass | warn | fail

    if integrity_status == "fail":
        append to verdict: "[INTEGRITY CONCERN] — audit found issues, see EXPERIMENT_AUDIT.md"
        downgrade confidence to "low" regardless of Codex judgment

    if integrity_status == "warn":
        append to verdict: "[INTEGRITY: WARN] — audit flagged potential issues"
else:
    integrity_status = "unavailable"
    verdict is labeled "provisional — no integrity audit run"
    (this does NOT block anything — pipeline continues normally)

See shared-references/experiment-integrity.md for the full integrity protocol.

Step 4: Route Based on Verdict

`no` — Claim not supported

Record postmortem in findings.md (Research Findings section):
- What was tested, what failed, hypotheses for why
- Constraints for future attempts (what NOT to try again)
Update the project pipeline status in AGENTS.md or project notes
Decide whether to pivot to next idea from IDEA_CANDIDATES.md or try an alternative approach

`partial` — Claim partially supported

Update the working claim to reflect what IS supported
Record the gap in findings.md
Design and run supplementary experiments to fill evidence gaps
Re-run result-to-claim after supplementary experiments complete
Multiple rounds of partial on the same claim → record analysis in findings.md, consider whether to narrow the claim scope or switch ideas

`yes` — Claim supported

Record confirmed claim in project notes
If ablation studies are incomplete → trigger /ablation-planner
If all evidence is in → ready for paper writing

Step 5: Update Research Wiki (if active)

Skip this step entirely if research-wiki/ does not exist.

if research-wiki/ exists:
    # Resolve the helper (Codex chain). If unavailable, skip wiki writes; still report verdict.
    ARIS_REPO="${ARIS_REPO:-$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null)}"
    WIKI_SCRIPT=""
    [ -n "$ARIS_REPO" ] && [ -f "$ARIS_REPO/tools/research_wiki.py" ] && WIKI_SCRIPT="$ARIS_REPO/tools/research_wiki.py"
    [ -z "$WIKI_SCRIPT" ] && [ -f tools/research_wiki.py ] && WIKI_SCRIPT="tools/research_wiki.py"
    [ -z "$WIKI_SCRIPT" ] && [ -f ~/.codex/skills/research-wiki/research_wiki.py ] && WIKI_SCRIPT="$HOME/.codex/skills/research-wiki/research_wiki.py"
    [ -n "$WIKI_SCRIPT" ] || echo "WARN: research_wiki.py unreachable; skipping wiki writes (verdict still reported)." >&2

    # 1. Create/refresh the experiment node FIRST (verdict OWNER → --update-on-exist so a
    #    re-judge overwrites the stale verdict). The supports/invalidates edges in #2 point
    #    FROM exp:<id> and add_edge does NOT verify node existence, so only add them if the
    #    experiment node was born (EXP_NODE_OK); otherwise skip the wiki edges.
    EXP_NODE_OK=0
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_experiment research-wiki/ \
         --slug "<exp_id>" --idea "idea:<active_idea>" \
         --verdict "<yes|partial|no>" --confidence "<high|medium|low>" \
         --date "<date>" --hardware "<hw>" --duration "<dur>" \
         --metrics "<key metrics>" --reasoning "<one-line why this verdict>" \
         --provenance "<EXPERIMENT_AUDIT.md / run dir>" --update-on-exist && EXP_NODE_OK=1

    # 2. Record empirical support as EDGES ONLY, and ONLY if EXP_NODE_OK. NEVER edit a
    #    claim page's `status`: that is the PROOF axis (verified / refuted / unproven /
    #    sound-modulo-imports / drafted / retracted), owned by /proof-checker (the claim
    #    birth point) — the ARIS helper REJECTS "supported"/"partial"/"invalidated".
    if [ "$EXP_NODE_OK" = 1 ]:
        for each claim resolved by this verdict:
            if verdict == "yes":
                python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "<metric>"
            elif verdict == "partial":
                python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "partial: <metric>"
            else:
                python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type invalidates --evidence "<why>"

    # 3. Update idea outcome (raw markdown, helper-free — preserves the rich idea body)
    Update research-wiki/ideas/<idea_id>.md:
      - outcome: positive | mixed | negative
      - If negative: fill "Failure / Risk Notes" and "Lessons Learned"
      - If positive: fill "Actual Outcome" and "Reusable Components"

    # 4. Rebuild + log (reflect the new edges; only if WIKI_SCRIPT resolved)
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" rebuild_query_pack research-wiki/
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" log research-wiki/ "result-to-claim: exp:<id> verdict=<verdict> for idea:<idea_id>"

    # 5. Re-ideation suggestion
    Count failed/partial ideas since last /idea-creator run.
    If >= 3: print "💡 3+ ideas tested since last ideation. Consider re-running /idea-creator — the wiki now knows what doesn't work."

Rules

The secondary Codex agent is the judge, not the local executor. The local executor collects evidence and routes; the reviewer agent evaluates. This prevents post-hoc rationalization.
Do not inflate claims beyond what the data supports. If Codex says "partial", do not round up to "yes".
A single positive result on one dataset does not support a general claim. Be honest about scope.
If confidence is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.
Fail closed if the reviewer is unavailable. Follow the capability fallback in reviewer-routing.md (gpt-5.6-sol + ultra → gpt-5.6-sol + xhigh → gpt-5.5 + xhigh), and never downgrade on timeout, rate-limit, auth, transport, server, or context errors. If no allowed pair succeeds, write a traced BLOCKED review record with the unavailable route and evidence paths, write CLAIMS_FROM_RESULTS.md containing only verdict: REVIEW_UNAVAILABLE, record the same in findings.md, and stop. Do not emit a local PASS/WARN substitute or advance a submission-facing claim; only an explicitly non-submission evidence-gathering phase may continue.
Always record the verdict and reasoning in findings.md, regardless of outcome.

Review Tracing

After the secondary Codex judgment, save a trace following ../shared-references/review-tracing.md. Write files directly to .aris/traces/result-to-claim/<date>_run<NN>/ and include the prompt, raw reviewer response, parsed verdict, routing action, and whether the result is [pending external review]. Respect the --- trace: parameter when present (default: full).

Result-to-Claim Gate

Codex assurance: deterministic evidence existence can be accepted, while the base semantic claim judgment records review_independence: same-family and acceptance_status: provisional. Cross-family overlays may record accepted; reviewer failure emits BLOCKED.

Experiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a secondary Codex judgment, then auto-route based on the verdict.

Context: $ARGUMENTS

When to Use

After a set of experiments completes (main results, not just sanity checks)
Before committing to claims in a paper or review response
When results are ambiguous and you need an objective second opinion

Workflow

Step 1: Collect Results

Gather experiment data from whatever sources are available in the project:

W&B (preferred): wandb.Api().run("<entity>/<project>/<run_id>").history() — metrics, training curves, comparisons
EXPERIMENT_LOG.md: full results table with baselines and verdicts
EXPERIMENT_TRACKER.md: check which experiments are DONE vs still running
Log files: ssh server "tail -100 /path/to/training.log" if no other source
idea-stage/docs/research_contract.md (legacy fallback: docs/research_contract.md): intended claims and experiment design

Assemble the key information:

What experiments were run (method, dataset, config)
Main metrics and baseline comparisons (deltas)
The intended claim these experiments were designed to test
Any known confounds or caveats

Step 1.5: Deterministic evidence pre-check

Before the reviewer call, resolve and run evidence_check.py per evidence-precheck.md:

if [ -z "${ARIS_REPO:-}" ] && [ -f .aris/installed-skills-codex.txt ]; then
  ARIS_REPO=$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null) || true
fi
EVIDENCE_CHECK=""
[ -n "${ARIS_REPO:-}" ] && [ -f "$ARIS_REPO/tools/evidence_check.py" ] && EVIDENCE_CHECK="$ARIS_REPO/tools/evidence_check.py"
[ -z "$EVIDENCE_CHECK" ] && [ -f tools/evidence_check.py ] && EVIDENCE_CHECK="tools/evidence_check.py"
mkdir -p .aris
if [ -n "$EVIDENCE_CHECK" ]; then
  python3 "$EVIDENCE_CHECK" . --batch .aris/claims.json \
    > .aris/evidence_precheck.json 2>.aris/evidence_precheck.err || true
else
  echo "WARN: evidence_check.py unresolved; semantic review will still run" >&2
fi

Step 2: Codex Judgment

Send the collected results to a secondary Codex agent for objective evaluation:

spawn_agent:
  model: gpt-5.6-sol
  reasoning_effort: ultra
  message: |
    RESULT-TO-CLAIM EVALUATION

    I need you to judge whether experimental results support the intended claim.

    Intended claim: [the claim these experiments test]

    Experiments run:
    [list experiments with method, dataset, metrics]

    Results:
    [paste key numbers, comparison deltas, significance]

    Baselines:
    [baseline numbers and sources — reproduced or from paper]

    Known caveats:
    [any confounding factors, limited datasets, missing comparisons]

    Please evaluate:
    1. claim_supported: yes | partial | no
    2. what_results_support: what the data actually shows
    3. what_results_dont_support: where the data falls short of the claim
    4. missing_evidence: specific evidence gaps
    5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
    6. next_experiments_needed: specific experiments to fill gaps (if any)
    7. confidence: high | medium | low

    Be honest. Do not inflate claims beyond what the data supports.
    A single positive result on one dataset does not support a general claim.

Step 3: Parse and Normalize

Extract structured fields from the secondary Codex response:

- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | low

Step 3.5: Check Experiment Integrity (if audit exists)

Skip this step if EXPERIMENT_AUDIT.json does not exist.

if EXPERIMENT_AUDIT.json exists:
    read integrity_status from file
    attach to verdict output:
        integrity_status: pass | warn | fail

    if integrity_status == "fail":
        append to verdict: "[INTEGRITY CONCERN] — audit found issues, see EXPERIMENT_AUDIT.md"
        downgrade confidence to "low" regardless of Codex judgment

    if integrity_status == "warn":
        append to verdict: "[INTEGRITY: WARN] — audit flagged potential issues"
else:
    integrity_status = "unavailable"
    verdict is labeled "provisional — no integrity audit run"
    (this does NOT block anything — pipeline continues normally)

See shared-references/experiment-integrity.md for the full integrity protocol.

Step 4: Route Based on Verdict

`no` — Claim not supported

Record postmortem in findings.md (Research Findings section):
- What was tested, what failed, hypotheses for why
- Constraints for future attempts (what NOT to try again)
Update the project pipeline status in AGENTS.md or project notes
Decide whether to pivot to next idea from IDEA_CANDIDATES.md or try an alternative approach

`partial` — Claim partially supported

Update the working claim to reflect what IS supported
Record the gap in findings.md
Design and run supplementary experiments to fill evidence gaps
Re-run result-to-claim after supplementary experiments complete
Multiple rounds of partial on the same claim → record analysis in findings.md, consider whether to narrow the claim scope or switch ideas

`yes` — Claim supported

Record confirmed claim in project notes
If ablation studies are incomplete → trigger /ablation-planner
If all evidence is in → ready for paper writing

Step 5: Update Research Wiki (if active)

Skip this step entirely if research-wiki/ does not exist.

if research-wiki/ exists:
    # Resolve the helper (Codex chain). If unavailable, skip wiki writes; still report verdict.
    ARIS_REPO="${ARIS_REPO:-$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null)}"
    WIKI_SCRIPT=""
    [ -n "$ARIS_REPO" ] && [ -f "$ARIS_REPO/tools/research_wiki.py" ] && WIKI_SCRIPT="$ARIS_REPO/tools/research_wiki.py"
    [ -z "$WIKI_SCRIPT" ] && [ -f tools/research_wiki.py ] && WIKI_SCRIPT="tools/research_wiki.py"
    [ -z "$WIKI_SCRIPT" ] && [ -f ~/.codex/skills/research-wiki/research_wiki.py ] && WIKI_SCRIPT="$HOME/.codex/skills/research-wiki/research_wiki.py"
    [ -n "$WIKI_SCRIPT" ] || echo "WARN: research_wiki.py unreachable; skipping wiki writes (verdict still reported)." >&2

    # 1. Create/refresh the experiment node FIRST (verdict OWNER → --update-on-exist so a
    #    re-judge overwrites the stale verdict). The supports/invalidates edges in #2 point
    #    FROM exp:<id> and add_edge does NOT verify node existence, so only add them if the
    #    experiment node was born (EXP_NODE_OK); otherwise skip the wiki edges.
    EXP_NODE_OK=0
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" add_experiment research-wiki/ \
         --slug "<exp_id>" --idea "idea:<active_idea>" \
         --verdict "<yes|partial|no>" --confidence "<high|medium|low>" \
         --date "<date>" --hardware "<hw>" --duration "<dur>" \
         --metrics "<key metrics>" --reasoning "<one-line why this verdict>" \
         --provenance "<EXPERIMENT_AUDIT.md / run dir>" --update-on-exist && EXP_NODE_OK=1

    # 2. Record empirical support as EDGES ONLY, and ONLY if EXP_NODE_OK. NEVER edit a
    #    claim page's `status`: that is the PROOF axis (verified / refuted / unproven /
    #    sound-modulo-imports / drafted / retracted), owned by /proof-checker (the claim
    #    birth point) — the ARIS helper REJECTS "supported"/"partial"/"invalidated".
    if [ "$EXP_NODE_OK" = 1 ]:
        for each claim resolved by this verdict:
            if verdict == "yes":
                python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "<metric>"
            elif verdict == "partial":
                python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type supports --evidence "partial: <metric>"
            else:
                python3 "$WIKI_SCRIPT" add_edge research-wiki/ --from "exp:<id>" --to "claim:<cid>" --type invalidates --evidence "<why>"

    # 3. Update idea outcome (raw markdown, helper-free — preserves the rich idea body)
    Update research-wiki/ideas/<idea_id>.md:
      - outcome: positive | mixed | negative
      - If negative: fill "Failure / Risk Notes" and "Lessons Learned"
      - If positive: fill "Actual Outcome" and "Reusable Components"

    # 4. Rebuild + log (reflect the new edges; only if WIKI_SCRIPT resolved)
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" rebuild_query_pack research-wiki/
    [ -n "$WIKI_SCRIPT" ] && python3 "$WIKI_SCRIPT" log research-wiki/ "result-to-claim: exp:<id> verdict=<verdict> for idea:<idea_id>"

    # 5. Re-ideation suggestion
    Count failed/partial ideas since last /idea-creator run.
    If >= 3: print "💡 3+ ideas tested since last ideation. Consider re-running /idea-creator — the wiki now knows what doesn't work."

Rules

The secondary Codex agent is the judge, not the local executor. The local executor collects evidence and routes; the reviewer agent evaluates. This prevents post-hoc rationalization.
Do not inflate claims beyond what the data supports. If Codex says "partial", do not round up to "yes".
A single positive result on one dataset does not support a general claim. Be honest about scope.
If confidence is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.
Fail closed if the reviewer is unavailable. Follow the capability fallback in reviewer-routing.md (gpt-5.6-sol + ultra → gpt-5.6-sol + xhigh → gpt-5.5 + xhigh), and never downgrade on timeout, rate-limit, auth, transport, server, or context errors. If no allowed pair succeeds, write a traced BLOCKED review record with the unavailable route and evidence paths, write CLAIMS_FROM_RESULTS.md containing only verdict: REVIEW_UNAVAILABLE, record the same in findings.md, and stop. Do not emit a local PASS/WARN substitute or advance a submission-facing claim; only an explicitly non-submission evidence-gathering phase may continue.
Always record the verdict and reasoning in findings.md, regardless of outcome.

Adoption

wanshuiyin/result-to-claim

$ install --global

Security Scan Results

SKILL.md

Result-to-Claim Gate

Context: $ARGUMENTS

When to Use

Workflow

Step 1: Collect Results

Step 1.5: Deterministic evidence pre-check

Step 2: Codex Judgment

Step 3: Parse and Normalize

Step 3.5: Check Experiment Integrity (if audit exists)

Step 4: Route Based on Verdict

no — Claim not supported

partial — Claim partially supported

yes — Claim supported

Step 5: Update Research Wiki (if active)

Rules

Review Tracing

Related Skills

wanshuiyin/web-debug-search

wanshuiyin/web-debug-search

wanshuiyin/integrity-forensics

wanshuiyin/meta-apply

wanshuiyin/result-to-claim

$ install --global

Security Scan Results

SKILL.md

Result-to-Claim Gate

Context: $ARGUMENTS

When to Use

Workflow

Step 1: Collect Results

Step 1.5: Deterministic evidence pre-check

Step 2: Codex Judgment

Step 3: Parse and Normalize

Step 3.5: Check Experiment Integrity (if audit exists)

Step 4: Route Based on Verdict

no — Claim not supported

partial — Claim partially supported

yes — Claim supported

Step 5: Update Research Wiki (if active)

Rules

Review Tracing

Related Skills

wanshuiyin/web-debug-search

wanshuiyin/web-debug-search

wanshuiyin/integrity-forensics

wanshuiyin/meta-apply

`no` — Claim not supported

`partial` — Claim partially supported

`yes` — Claim supported

`no` — Claim not supported

`partial` — Claim partially supported

`yes` — Claim supported