Auto Review Loop: Autonomous Research Improvement

Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.

Context: $ARGUMENTS

Constants

MAX_ROUNDS = 4
POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
REVIEW_DOC: AUTO_REVIEW.md in project root (cumulative log)
REVIEWER_MODEL = gpt-5.4 — Model used via Codex MCP. Must be an OpenAI model (e.g., gpt-5.4, o3, gpt-4o)
HUMAN_CHECKPOINT = false — When true, pause after each round's review (Phase B) and present the score + weaknesses to the user. Wait for user input before proceeding to Phase C. The user can: approve the suggested fixes, provide custom modification instructions, skip specific fixes, or stop the loop early. When false (default), the loop runs fully autonomously.
COMPACT = false — When true, (1) read EXPERIMENT_LOG.md and findings.md instead of parsing full logs on session recovery, (2) append key findings to findings.md after each round.

💡 Override: /auto-review-loop "topic" — compact: true, human checkpoint: true

State Persistence (Compact Recovery)

Long-running loops may hit the context window limit, triggering automatic compaction. To survive this, persist state to REVIEW_STATE.json after each round:

{
  "round": 2,
  "threadId": "019cd392-...",
  "status": "in_progress",
  "last_score": 5.0,
  "last_verdict": "not ready",
  "pending_experiments": ["screen_name_1"],
  "timestamp": "2026-03-13T21:00:00"
}

Write this file at the end of every Phase E (after documenting the round). Overwrite each time — only the latest state matters.

On completion (positive assessment or max rounds), set "status": "completed" so future invocations don't accidentally resume a finished loop.

Workflow

Initialization

Check for REVIEW_STATE.json in project root:
- If it does not exist: fresh start (normal case, identical to behavior before this feature existed)
- If it exists AND status is "completed": fresh start (previous loop finished normally)
- If it exists AND status is "in_progress" AND timestamp is older than 24 hours: fresh start (stale state from a killed/abandoned run — delete the file and start over)
- If it exists AND status is "in_progress" AND timestamp is within 24 hours: resume
  - Read the state file to recover round, threadId, last_score, pending_experiments
  - Read AUTO_REVIEW.md to restore full context of prior rounds
  - If pending_experiments is non-empty, check if they have completed (e.g., check screen sessions)
  - Resume from the next round (round = saved round + 1)
  - Log: "Recovered from context compaction. Resuming at Round N."
Read project narrative documents, memory files, and any prior review documents. When COMPACT = true and compact files exist: read findings.md + EXPERIMENT_LOG.md instead of full AUTO_REVIEW.md and raw logs — saves context window.
Read recent experiment results (check output directories, logs)
Identify current weaknesses and open TODOs from prior reviews
Initialize round counter = 1 (unless recovered from state file)
Create/update AUTO_REVIEW.md with header and timestamp

Loop (repeat up to MAX_ROUNDS)

Phase A: Review

Send comprehensive context to the external reviewer:

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N/MAX_ROUNDS of autonomous review loop]

    [Full research context: claims, methods, results, known weaknesses]
    [Changes since last round, if any]

    Please act as a senior ML reviewer (NeurIPS/ICML level).

    1. Score this work 1-10 for a top venue
    2. List remaining critical weaknesses (ranked by severity)
    3. For each weakness, specify the MINIMUM fix (experiment, analysis, or reframing)
    4. State clearly: is this READY for submission? Yes/No/Almost

    Be brutally honest. If the work is ready, say so clearly.

If this is round 2+, use mcp__codex__codex-reply with the saved threadId to maintain conversation context.

Phase B: Parse Assessment

CRITICAL: Save the FULL raw response from the external reviewer verbatim (store in a variable for Phase E). Do NOT discard or summarize — the raw text is the primary record.

Then extract structured fields:

Score (numeric 1-10)
Verdict ("ready" / "almost" / "not ready")
Action items (ranked list of fixes)

STOP CONDITION: If score >= 6 AND verdict contains "ready" or "almost" → stop loop, document final state.

Human Checkpoint (if enabled)

Skip this step entirely if HUMAN_CHECKPOINT = false.

When HUMAN_CHECKPOINT = true, present the review results and wait for user input:

📋 Round N/MAX_ROUNDS review complete.

Score: X/10 — [verdict]
Top weaknesses:
1. [weakness 1]
2. [weakness 2]
3. [weakness 3]

Suggested fixes:
1. [fix 1]
2. [fix 2]
3. [fix 3]

Options:
- Reply "go" or "continue" → implement all suggested fixes
- Reply with custom instructions → implement your modifications instead
- Reply "skip 2" → skip fix #2, implement the rest
- Reply "stop" → end the loop, document current state

Wait for the user's response. Parse their input:

Approval ("go", "continue", "ok", "proceed"): proceed to Phase C with all suggested fixes
Custom instructions (any other text): treat as additional/replacement guidance for Phase C. Merge with reviewer suggestions where appropriate
Skip specific fixes ("skip 1,3"): remove those fixes from the action list
Stop ("stop", "enough", "done"): terminate the loop, jump to Termination

Feishu Notification (if configured)

After parsing the score, check if ~/.claude/feishu.json exists and mode is not "off":

Send a review_scored notification: "Round N: X/10 — [verdict]" with top 3 weaknesses
If interactive mode and verdict is "almost": send as checkpoint, wait for user reply on whether to continue or stop
If config absent or mode off: skip entirely (no-op)

Phase C: Implement Fixes (if not stopping)

For each action item (highest priority first):

Code changes: Write/modify experiment scripts, model code, analysis scripts
Run experiments: Deploy to GPU server via SSH + screen/tmux
Analysis: Run evaluation, collect results, update figures/tables
Documentation: Update project notes and review document

Prioritization rules:

Skip fixes requiring excessive compute (flag for manual follow-up)
Skip fixes requiring external data/models not available
Prefer reframing/analysis over new experiments when both address the concern
Always implement metric additions (cheap, high impact)

Phase D: Wait for Results

If experiments were launched:

Monitor remote sessions for completion
Collect results from output files and logs
Training quality check — if W&B is configured, invoke /training-check to verify training was healthy (no NaN, no divergence, no plateau). If W&B not available, skip silently. Flag any quality issues in the next review round.

Phase E: Document Round

Append to AUTO_REVIEW.md:

## Round N (timestamp)

### Assessment (Summary)
- Score: X/10
- Verdict: [ready/almost/not ready]
- Key criticisms: [bullet list]

### Reviewer Raw Response

<details>
<summary>Click to expand full reviewer response</summary>

[Paste the COMPLETE raw response from the external reviewer here — verbatim, unedited.
This is the authoritative record. Do NOT truncate or paraphrase.]

</details>

### Actions Taken
- [what was implemented/changed]

### Results
- [experiment outcomes, if any]

### Status
- [continuing to round N+1 / stopping]

Write REVIEW_STATE.json with current round, threadId, score, verdict, and any pending experiments.

Append to findings.md (when COMPACT = true): one-line entry per key finding this round:

- [Round N] [positive/negative/unexpected]: [one-sentence finding] (metric: X.XX → Y.YY)

Increment round counter → back to Phase A.

Termination

When loop ends (positive assessment or max rounds):

Update REVIEW_STATE.json with "status": "completed"
Write final summary to AUTO_REVIEW.md
Update project notes with conclusions
Write method/pipeline description to AUTO_REVIEW.md under a ## Method Description section — a concise 1-2 paragraph description of the final method, its architecture, and data flow. This serves as input for /paper-illustration in Workflow 3 (so it can generate architecture diagrams automatically).
Generate claims from results — invoke /result-to-claim to convert experiment results from AUTO_REVIEW.md into structured paper claims. Output: CLAIMS_FROM_RESULTS.md. This bridges Workflow 2 → Workflow 3 so /paper-plan can directly use validated claims instead of extracting them from scratch. If /result-to-claim is not available, skip silently.
If stopped at max rounds without positive assessment:
- List remaining blockers
- Estimate effort needed for each
- Suggest whether to continue manually or pivot
Feishu notification (if configured): Send pipeline_done with final score progression table

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
ALWAYS use config: {"model_reasoning_effort": "xhigh"} for maximum reasoning depth
Save threadId from first call, use mcp__codex__codex-reply for subsequent rounds
Anti-hallucination citations: When adding references during fixes, NEVER fabricate BibTeX. Use the same DBLP → CrossRef → [VERIFY] chain as /paper-write: (1) curl -s "https://dblp.org/search/publ/api?q=TITLE&format=json" → get key → curl -s "https://dblp.org/rec/{key}.bib", (2) if not found, curl -sLH "Accept: application/x-bibtex" "https://doi.org/{doi}", (3) if both fail, mark with % [VERIFY]. Do NOT generate BibTeX from memory.
Be honest — include negative results and failed experiments
Do NOT hide weaknesses to game a positive score
Implement fixes BEFORE re-reviewing (don't just promise to fix)
Exhaust before surrendering — before marking any reviewer concern as "cannot address": (1) try at least 2 different solution paths, (2) for experiment issues, adjust hyperparameters or try an alternative baseline, (3) for theory issues, provide a weaker version of the result or an alternative argument, (4) only then concede narrowly and bound the damage. Never give up on the first attempt.
If an experiment takes > 30 minutes, launch it and continue with other fixes while waiting
Document EVERYTHING — the review log should be self-contained
Update project notes after each round, not just at the end

Prompt Template for Round 2+

mcp__codex__codex-reply:
  threadId: [saved from round 1]
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N update]

    Since your last review, we have:
    1. [Action 1]: [result]
    2. [Action 2]: [result]
    3. [Action 3]: [result]

    Updated results table:
    [paste metrics]

    Please re-score and re-assess. Are the remaining concerns addressed?
    Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.

Auto Review Loop: Autonomous Research Improvement

Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.

Context: $ARGUMENTS

Constants

MAX_ROUNDS = 4
POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
REVIEW_DOC: AUTO_REVIEW.md in project root (cumulative log)
REVIEWER_MODEL = gpt-5.4 — Model used via Codex MCP. Must be an OpenAI model (e.g., gpt-5.4, o3, gpt-4o)
HUMAN_CHECKPOINT = false — When true, pause after each round's review (Phase B) and present the score + weaknesses to the user. Wait for user input before proceeding to Phase C. The user can: approve the suggested fixes, provide custom modification instructions, skip specific fixes, or stop the loop early. When false (default), the loop runs fully autonomously.
COMPACT = false — When true, (1) read EXPERIMENT_LOG.md and findings.md instead of parsing full logs on session recovery, (2) append key findings to findings.md after each round.

💡 Override: /auto-review-loop "topic" — compact: true, human checkpoint: true

State Persistence (Compact Recovery)

Long-running loops may hit the context window limit, triggering automatic compaction. To survive this, persist state to REVIEW_STATE.json after each round:

{
  "round": 2,
  "threadId": "019cd392-...",
  "status": "in_progress",
  "last_score": 5.0,
  "last_verdict": "not ready",
  "pending_experiments": ["screen_name_1"],
  "timestamp": "2026-03-13T21:00:00"
}

Write this file at the end of every Phase E (after documenting the round). Overwrite each time — only the latest state matters.

On completion (positive assessment or max rounds), set "status": "completed" so future invocations don't accidentally resume a finished loop.

Workflow

Initialization

Check for REVIEW_STATE.json in project root:
- If it does not exist: fresh start (normal case, identical to behavior before this feature existed)
- If it exists AND status is "completed": fresh start (previous loop finished normally)
- If it exists AND status is "in_progress" AND timestamp is older than 24 hours: fresh start (stale state from a killed/abandoned run — delete the file and start over)
- If it exists AND status is "in_progress" AND timestamp is within 24 hours: resume
  - Read the state file to recover round, threadId, last_score, pending_experiments
  - Read AUTO_REVIEW.md to restore full context of prior rounds
  - If pending_experiments is non-empty, check if they have completed (e.g., check screen sessions)
  - Resume from the next round (round = saved round + 1)
  - Log: "Recovered from context compaction. Resuming at Round N."
Read project narrative documents, memory files, and any prior review documents. When COMPACT = true and compact files exist: read findings.md + EXPERIMENT_LOG.md instead of full AUTO_REVIEW.md and raw logs — saves context window.
Read recent experiment results (check output directories, logs)
Identify current weaknesses and open TODOs from prior reviews
Initialize round counter = 1 (unless recovered from state file)
Create/update AUTO_REVIEW.md with header and timestamp

Loop (repeat up to MAX_ROUNDS)

Phase A: Review

Send comprehensive context to the external reviewer:

mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N/MAX_ROUNDS of autonomous review loop]

    [Full research context: claims, methods, results, known weaknesses]
    [Changes since last round, if any]

    Please act as a senior ML reviewer (NeurIPS/ICML level).

    1. Score this work 1-10 for a top venue
    2. List remaining critical weaknesses (ranked by severity)
    3. For each weakness, specify the MINIMUM fix (experiment, analysis, or reframing)
    4. State clearly: is this READY for submission? Yes/No/Almost

    Be brutally honest. If the work is ready, say so clearly.

If this is round 2+, use mcp__codex__codex-reply with the saved threadId to maintain conversation context.

Phase B: Parse Assessment

CRITICAL: Save the FULL raw response from the external reviewer verbatim (store in a variable for Phase E). Do NOT discard or summarize — the raw text is the primary record.

Then extract structured fields:

Score (numeric 1-10)
Verdict ("ready" / "almost" / "not ready")
Action items (ranked list of fixes)

STOP CONDITION: If score >= 6 AND verdict contains "ready" or "almost" → stop loop, document final state.

Human Checkpoint (if enabled)

Skip this step entirely if HUMAN_CHECKPOINT = false.

When HUMAN_CHECKPOINT = true, present the review results and wait for user input:

📋 Round N/MAX_ROUNDS review complete.

Score: X/10 — [verdict]
Top weaknesses:
1. [weakness 1]
2. [weakness 2]
3. [weakness 3]

Suggested fixes:
1. [fix 1]
2. [fix 2]
3. [fix 3]

Options:
- Reply "go" or "continue" → implement all suggested fixes
- Reply with custom instructions → implement your modifications instead
- Reply "skip 2" → skip fix #2, implement the rest
- Reply "stop" → end the loop, document current state

Wait for the user's response. Parse their input:

Approval ("go", "continue", "ok", "proceed"): proceed to Phase C with all suggested fixes
Custom instructions (any other text): treat as additional/replacement guidance for Phase C. Merge with reviewer suggestions where appropriate
Skip specific fixes ("skip 1,3"): remove those fixes from the action list
Stop ("stop", "enough", "done"): terminate the loop, jump to Termination

Feishu Notification (if configured)

After parsing the score, check if ~/.claude/feishu.json exists and mode is not "off":

Send a review_scored notification: "Round N: X/10 — [verdict]" with top 3 weaknesses
If interactive mode and verdict is "almost": send as checkpoint, wait for user reply on whether to continue or stop
If config absent or mode off: skip entirely (no-op)

Phase C: Implement Fixes (if not stopping)

For each action item (highest priority first):

Code changes: Write/modify experiment scripts, model code, analysis scripts
Run experiments: Deploy to GPU server via SSH + screen/tmux
Analysis: Run evaluation, collect results, update figures/tables
Documentation: Update project notes and review document

Prioritization rules:

Skip fixes requiring excessive compute (flag for manual follow-up)
Skip fixes requiring external data/models not available
Prefer reframing/analysis over new experiments when both address the concern
Always implement metric additions (cheap, high impact)

Phase D: Wait for Results

If experiments were launched:

Monitor remote sessions for completion
Collect results from output files and logs
Training quality check — if W&B is configured, invoke /training-check to verify training was healthy (no NaN, no divergence, no plateau). If W&B not available, skip silently. Flag any quality issues in the next review round.

Phase E: Document Round

Append to AUTO_REVIEW.md:

## Round N (timestamp)

### Assessment (Summary)
- Score: X/10
- Verdict: [ready/almost/not ready]
- Key criticisms: [bullet list]

### Reviewer Raw Response

<details>
<summary>Click to expand full reviewer response</summary>

[Paste the COMPLETE raw response from the external reviewer here — verbatim, unedited.
This is the authoritative record. Do NOT truncate or paraphrase.]

</details>

### Actions Taken
- [what was implemented/changed]

### Results
- [experiment outcomes, if any]

### Status
- [continuing to round N+1 / stopping]

Write REVIEW_STATE.json with current round, threadId, score, verdict, and any pending experiments.

Append to findings.md (when COMPACT = true): one-line entry per key finding this round:

- [Round N] [positive/negative/unexpected]: [one-sentence finding] (metric: X.XX → Y.YY)

Increment round counter → back to Phase A.

Termination

When loop ends (positive assessment or max rounds):

Update REVIEW_STATE.json with "status": "completed"
Write final summary to AUTO_REVIEW.md
Update project notes with conclusions
Write method/pipeline description to AUTO_REVIEW.md under a ## Method Description section — a concise 1-2 paragraph description of the final method, its architecture, and data flow. This serves as input for /paper-illustration in Workflow 3 (so it can generate architecture diagrams automatically).
Generate claims from results — invoke /result-to-claim to convert experiment results from AUTO_REVIEW.md into structured paper claims. Output: CLAIMS_FROM_RESULTS.md. This bridges Workflow 2 → Workflow 3 so /paper-plan can directly use validated claims instead of extracting them from scratch. If /result-to-claim is not available, skip silently.
If stopped at max rounds without positive assessment:
- List remaining blockers
- Estimate effort needed for each
- Suggest whether to continue manually or pivot
Feishu notification (if configured): Send pipeline_done with final score progression table

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
ALWAYS use config: {"model_reasoning_effort": "xhigh"} for maximum reasoning depth
Save threadId from first call, use mcp__codex__codex-reply for subsequent rounds
Anti-hallucination citations: When adding references during fixes, NEVER fabricate BibTeX. Use the same DBLP → CrossRef → [VERIFY] chain as /paper-write: (1) curl -s "https://dblp.org/search/publ/api?q=TITLE&format=json" → get key → curl -s "https://dblp.org/rec/{key}.bib", (2) if not found, curl -sLH "Accept: application/x-bibtex" "https://doi.org/{doi}", (3) if both fail, mark with % [VERIFY]. Do NOT generate BibTeX from memory.
Be honest — include negative results and failed experiments
Do NOT hide weaknesses to game a positive score
Implement fixes BEFORE re-reviewing (don't just promise to fix)
Exhaust before surrendering — before marking any reviewer concern as "cannot address": (1) try at least 2 different solution paths, (2) for experiment issues, adjust hyperparameters or try an alternative baseline, (3) for theory issues, provide a weaker version of the result or an alternative argument, (4) only then concede narrowly and bound the damage. Never give up on the first attempt.
If an experiment takes > 30 minutes, launch it and continue with other fixes while waiting
Document EVERYTHING — the review log should be self-contained
Update project notes after each round, not just at the end

Prompt Template for Round 2+

mcp__codex__codex-reply:
  threadId: [saved from round 1]
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    [Round N update]

    Since your last review, we have:
    1. [Action 1]: [result]
    2. [Action 2]: [result]
    3. [Action 3]: [result]

    Updated results table:
    [paste metrics]

    Please re-score and re-assess. Are the remaining concerns addressed?
    Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.

Adoption

brycewang-stanford/auto-review-loop

$ install --global

Security Scan Results

SKILL.md

Auto Review Loop: Autonomous Research Improvement

Context: $ARGUMENTS

Constants

State Persistence (Compact Recovery)

Workflow

Initialization

Loop (repeat up to MAX_ROUNDS)

Phase A: Review

Phase B: Parse Assessment

Human Checkpoint (if enabled)

Feishu Notification (if configured)

Phase C: Implement Fixes (if not stopping)

Phase D: Wait for Results

Phase E: Document Round

Termination

Key Rules

Prompt Template for Round 2+

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

brycewang-stanford/auto-review-loop

$ install --global

Security Scan Results

SKILL.md

Auto Review Loop: Autonomous Research Improvement

Context: $ARGUMENTS

Constants

State Persistence (Compact Recovery)

Workflow

Initialization

Loop (repeat up to MAX_ROUNDS)

Phase A: Review

Phase B: Parse Assessment

Human Checkpoint (if enabled)

Feishu Notification (if configured)

Phase C: Implement Fixes (if not stopping)

Phase D: Wait for Results

Phase E: Document Round

Termination

Key Rules

Prompt Template for Round 2+

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill