Auto Paper Improvement Loop: Review → Fix → Recompile

Autonomously improve the paper at: $ARGUMENTS

Context

This skill is designed to run after Workflow 3 (/paper-plan → /paper-figure → /paper-write → /paper-compile). It takes a compiled paper and iteratively improves it through external LLM review.

Unlike /auto-review-loop (which iterates on research — running experiments, collecting data, rewriting narrative), this skill iterates on paper writing quality — fixing theoretical inconsistencies, softening overclaims, adding missing content, and improving presentation.

Constants

MAX_ROUNDS = 2 — Two rounds of review→fix→recompile. Empirically, Round 1 catches structural issues (4→6/10), Round 2 catches remaining presentation issues (6→7/10). Diminishing returns beyond 2 rounds for writing-only improvements.
REVIEWER_MODEL = gpt-5.4 — Model used via a secondary Codex agent for paper review.
REVIEW_LOG = PAPER_IMPROVEMENT_LOG.md — Cumulative log of all rounds, stored in paper directory.
HUMAN_CHECKPOINT = false — When true, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. When false (default), runs fully autonomously.

💡 Override: /auto-paper-improvement-loop "paper/" — human checkpoint: true

Inputs

Compiled paper — paper/main.pdf + LaTeX source files
All section .tex files — concatenated for review prompt

State Persistence (Compact Recovery)

If the context window fills up mid-loop, Codex auto-compacts. To recover, this skill writes PAPER_IMPROVEMENT_STATE.json after each round:

{
  "current_round": 1,
  "agent_id": "019ce736-...",
  "last_score": 6,
  "status": "in_progress",
  "timestamp": "2026-03-13T21:00:00"
}

On startup: if PAPER_IMPROVEMENT_STATE.json exists with "status": "in_progress" AND timestamp is within 24 hours, read it + PAPER_IMPROVEMENT_LOG.md to recover context, then resume from the next round. Otherwise (file absent, "status": "completed", or older than 24 hours), start fresh.

After each round: overwrite the state file. On completion: set "status": "completed".

Workflow

Step 0: Preserve Original

cp paper/main.pdf paper/main_round0_original.pdf

Step 1: Collect Paper Text

Concatenate all section files into a single text block for the review prompt:

# Collect all sections in order
for f in paper/sections/*.tex; do
    echo "% === $(basename $f) ==="
    cat "$f"
done > /tmp/paper_full_text.txt

Step 2: Round 1 Review

Send the full paper text to GPT-5.4 xhigh:

spawn_agent:
  model: gpt-5.4
  reasoning_effort: xhigh
  message: |
    You are reviewing a [VENUE] paper. Please provide a detailed, structured review.

    ## Full Paper Text:
    [paste concatenated sections]

    ## Review Instructions
    Please act as a senior ML reviewer ([VENUE] level). Provide:
    1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
    2. **Summary** (2-3 sentences)
    3. **Strengths** (bullet list, ranked)
    4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
    5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
    6. **Missing References** (if any)
    7. **Verdict**: Ready for submission? Yes / Almost / No

    Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
    self-containedness, notation consistency.

Save the agent id for Round 2.

Step 2b: Human Checkpoint (if enabled)

Skip if HUMAN_CHECKPOINT = false.

Present the review results and wait for user input:

📋 Round 1 review complete.

Score: X/10 — [verdict]
Key weaknesses (by severity):
1. [CRITICAL] ...
2. [MAJOR] ...
3. [MINOR] ...

Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end.

Parse user response same as /auto-review-loop: approve / custom instructions / skip / stop.

Step 3: Implement Round 1 Fixes

Parse the review and implement fixes by severity:

Priority order:

CRITICAL fixes (assumption mismatches, internal contradictions)
MAJOR fixes (overclaims, missing content, notation issues)
MINOR fixes (if time permits)

Common fix patterns:

| Issue | Fix Pattern | |-------|-------------| | Assumption-model mismatch | Rewrite assumption to match the model, add formal proposition bridging the gap | | Overclaims | Soften language: "validate" → "demonstrate practical relevance", "comparable" → "qualitatively competitive" | | Missing metrics | Add quantitative table with honest parameter counts and caveats | | Theorem not self-contained | Add "Interpretation" paragraph listing all dependencies | | Notation confusion | Rename conflicting symbols globally, add Notation paragraph | | Missing references | Add to references.bib, cite in appropriate locations | | Theory-practice gap | Explicitly frame theory as idealized; add synthetic validation subsection |

Step 4: Recompile Round 1

cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round1.pdf

Verify: 0 undefined references, 0 undefined citations.

Step 5: Round 2 Review

Use send_input with the saved agent id:

send_input:
  id: [saved from Round 1]
  model: gpt-5.4
  reasoning_effort: xhigh
  message: |
    [Round 2 update]

    Since your last review, we have implemented:
    1. [Fix 1]: [description]
    2. [Fix 2]: [description]
    ...

    Please re-score and re-assess. Same format:
    Score, Summary, Strengths, Weaknesses, Actionable fixes, Verdict.

Step 5b: Human Checkpoint (if enabled)

Skip if HUMAN_CHECKPOINT = false. Same as Step 2b — present Round 2 review, wait for user input.

Step 6: Implement Round 2 Fixes

Same process as Step 3. Typical Round 2 fixes:

Add controlled synthetic experiments validating theory
Further soften any remaining overclaims
Formalize informal arguments (e.g., truncation → formal proposition)
Strengthen limitations section

Step 7: Recompile Round 2

cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round2.pdf

Step 8: Format Check

After the final recompilation, run a format compliance check:

# 1. Page count vs venue limit
PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}')
echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)"

# 2. Overfull hbox warnings (content exceeding margins)
OVERFULL=$(grep -c "Overfull" paper/main.log 2>/dev/null || echo 0)
echo "Overfull hbox warnings: $OVERFULL"
grep "Overfull" paper/main.log 2>/dev/null | head -10

# 3. Underfull hbox warnings (loose spacing)
UNDERFULL=$(grep -c "Underfull" paper/main.log 2>/dev/null || echo 0)
echo "Underfull hbox warnings: $UNDERFULL"

# 4. Bad boxes summary
grep -c "badness" paper/main.log 2>/dev/null || echo "0 badness warnings"

Auto-fix patterns:

| Issue | Fix | |-------|-----| | Overfull hbox in equation | Wrap in \resizebox or split with \split/aligned | | Overfull hbox in table | Reduce font (\small/\footnotesize) or use \resizebox{\linewidth}{!}{...} | | Overfull hbox in text | Rephrase sentence or add \allowbreak / \- hints | | Over page limit | Move content to appendix, compress tables, reduce figure sizes | | Underfull hbox (loose) | Rephrase for better line filling or add \looseness=-1 |

If any overfull hbox > 10pt is found, fix it and recompile before documenting.

Step 9: Document Results

Create PAPER_IMPROVEMENT_LOG.md in the paper directory:

# Paper Improvement Log

## Score Progression

| Round | Score | Verdict | Key Changes |
|-------|-------|---------|-------------|
| Round 0 (original) | X/10 | No/Almost/Yes | Baseline |
| Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] |
| Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] |

## Round 1 Review & Fixes

<details>
<summary>GPT-5.4 xhigh Review (Round 1)</summary>

[Full raw review text, verbatim]

</details>

### Fixes Implemented
1. [Fix description]
2. [Fix description]
...

## Round 2 Review & Fixes

<details>
<summary>GPT-5.4 xhigh Review (Round 2)</summary>

[Full raw review text, verbatim]

</details>

### Fixes Implemented
1. [Fix description]
2. [Fix description]
...

## PDFs
- `main_round0_original.pdf` — Original generated paper
- `main_round1.pdf` — After Round 1 fixes
- `main_round2.pdf` — Final version after Round 2 fixes

Step 9: Summary

Report to user:

Score progression table
Number of CRITICAL/MAJOR/MINOR issues fixed per round
Final page count
Remaining issues (if any)

Feishu Notification (if configured)

After each round's review AND at final completion, check ~/.codex/feishu.json:

After each round: Send review_scored — "Round N: X/10 — [key changes]"
After final round: Send pipeline_done — score progression table + final page count
If config absent or mode "off": skip entirely (no-op)

Output

paper/
├── main_round0_original.pdf    # Original
├── main_round1.pdf             # After Round 1
├── main_round2.pdf             # After Round 2 (final)
├── main.pdf                    # = main_round2.pdf
└── PAPER_IMPROVEMENT_LOG.md    # Full review log with scores

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
Preserve all PDF versions — user needs to compare progression
Save FULL raw review text — do not summarize or truncate GPT-5.4 responses
Use send_input for Round 2 to maintain conversation context
Always recompile after fixes — verify 0 errors before proceeding
Do not fabricate experimental results — synthetic validation must describe methodology, not invent numbers
Respect the paper's claims — soften overclaims rather than adding unsupported new claims
Global consistency — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions)

Typical Score Progression

Based on end-to-end testing on a 9-page ICLR 2026 theory paper:

| Round | Score | Key Improvements | |-------|-------|-----------------| | Round 0 | 4/10 (content) | Baseline: assumption-model mismatch, overclaims, notation issues | | Round 1 | 6/10 (content) | Fixed assumptions, softened claims, added interpretation, renamed notation | | Round 2 | 7/10 (content) | Added synthetic validation, formal truncation proposition, stronger limitations | | Round 3 | 5→8.5/10 (format) | Removed hero fig, appendix, compressed conclusion, fixed overfull hbox |

+4.5 points across 3 rounds (2 content + 1 format) is typical for a well-structured but rough first draft. Final: 8 pages main body, 0 overfull hbox, ICLR-compliant.

Auto Paper Improvement Loop: Review → Fix → Recompile

Autonomously improve the paper at: $ARGUMENTS

Context

Constants

MAX_ROUNDS = 2 — Two rounds of review→fix→recompile. Empirically, Round 1 catches structural issues (4→6/10), Round 2 catches remaining presentation issues (6→7/10). Diminishing returns beyond 2 rounds for writing-only improvements.
REVIEWER_MODEL = gpt-5.4 — Model used via a secondary Codex agent for paper review.
REVIEW_LOG = PAPER_IMPROVEMENT_LOG.md — Cumulative log of all rounds, stored in paper directory.
HUMAN_CHECKPOINT = false — When true, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. When false (default), runs fully autonomously.

💡 Override: /auto-paper-improvement-loop "paper/" — human checkpoint: true

Inputs

Compiled paper — paper/main.pdf + LaTeX source files
All section .tex files — concatenated for review prompt

State Persistence (Compact Recovery)

If the context window fills up mid-loop, Codex auto-compacts. To recover, this skill writes PAPER_IMPROVEMENT_STATE.json after each round:

{
  "current_round": 1,
  "agent_id": "019ce736-...",
  "last_score": 6,
  "status": "in_progress",
  "timestamp": "2026-03-13T21:00:00"
}

After each round: overwrite the state file. On completion: set "status": "completed".

Workflow

Step 0: Preserve Original

cp paper/main.pdf paper/main_round0_original.pdf

Step 1: Collect Paper Text

Concatenate all section files into a single text block for the review prompt:

# Collect all sections in order
for f in paper/sections/*.tex; do
    echo "% === $(basename $f) ==="
    cat "$f"
done > /tmp/paper_full_text.txt

Step 2: Round 1 Review

Send the full paper text to GPT-5.4 xhigh:

spawn_agent:
  model: gpt-5.4
  reasoning_effort: xhigh
  message: |
    You are reviewing a [VENUE] paper. Please provide a detailed, structured review.

    ## Full Paper Text:
    [paste concatenated sections]

    ## Review Instructions
    Please act as a senior ML reviewer ([VENUE] level). Provide:
    1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
    2. **Summary** (2-3 sentences)
    3. **Strengths** (bullet list, ranked)
    4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
    5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
    6. **Missing References** (if any)
    7. **Verdict**: Ready for submission? Yes / Almost / No

    Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
    self-containedness, notation consistency.

Save the agent id for Round 2.

Step 2b: Human Checkpoint (if enabled)

Skip if HUMAN_CHECKPOINT = false.

Present the review results and wait for user input:

📋 Round 1 review complete.

Score: X/10 — [verdict]
Key weaknesses (by severity):
1. [CRITICAL] ...
2. [MAJOR] ...
3. [MINOR] ...

Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end.

Parse user response same as /auto-review-loop: approve / custom instructions / skip / stop.

Step 3: Implement Round 1 Fixes

Parse the review and implement fixes by severity:

Priority order:

CRITICAL fixes (assumption mismatches, internal contradictions)
MAJOR fixes (overclaims, missing content, notation issues)
MINOR fixes (if time permits)

Common fix patterns:

Step 4: Recompile Round 1

cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round1.pdf

Verify: 0 undefined references, 0 undefined citations.

Step 5: Round 2 Review

Use send_input with the saved agent id:

send_input:
  id: [saved from Round 1]
  model: gpt-5.4
  reasoning_effort: xhigh
  message: |
    [Round 2 update]

    Since your last review, we have implemented:
    1. [Fix 1]: [description]
    2. [Fix 2]: [description]
    ...

    Please re-score and re-assess. Same format:
    Score, Summary, Strengths, Weaknesses, Actionable fixes, Verdict.

Step 5b: Human Checkpoint (if enabled)

Skip if HUMAN_CHECKPOINT = false. Same as Step 2b — present Round 2 review, wait for user input.

Step 6: Implement Round 2 Fixes

Same process as Step 3. Typical Round 2 fixes:

Add controlled synthetic experiments validating theory
Further soften any remaining overclaims
Formalize informal arguments (e.g., truncation → formal proposition)
Strengthen limitations section

Step 7: Recompile Round 2

cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round2.pdf

Step 8: Format Check

After the final recompilation, run a format compliance check:

# 1. Page count vs venue limit
PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}')
echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)"

# 2. Overfull hbox warnings (content exceeding margins)
OVERFULL=$(grep -c "Overfull" paper/main.log 2>/dev/null || echo 0)
echo "Overfull hbox warnings: $OVERFULL"
grep "Overfull" paper/main.log 2>/dev/null | head -10

# 3. Underfull hbox warnings (loose spacing)
UNDERFULL=$(grep -c "Underfull" paper/main.log 2>/dev/null || echo 0)
echo "Underfull hbox warnings: $UNDERFULL"

# 4. Bad boxes summary
grep -c "badness" paper/main.log 2>/dev/null || echo "0 badness warnings"

Auto-fix patterns:

If any overfull hbox > 10pt is found, fix it and recompile before documenting.

Step 9: Document Results

Create PAPER_IMPROVEMENT_LOG.md in the paper directory:

# Paper Improvement Log

## Score Progression

| Round | Score | Verdict | Key Changes |
|-------|-------|---------|-------------|
| Round 0 (original) | X/10 | No/Almost/Yes | Baseline |
| Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] |
| Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] |

## Round 1 Review & Fixes

<details>
<summary>GPT-5.4 xhigh Review (Round 1)</summary>

[Full raw review text, verbatim]

</details>

### Fixes Implemented
1. [Fix description]
2. [Fix description]
...

## Round 2 Review & Fixes

<details>
<summary>GPT-5.4 xhigh Review (Round 2)</summary>

[Full raw review text, verbatim]

</details>

### Fixes Implemented
1. [Fix description]
2. [Fix description]
...

## PDFs
- `main_round0_original.pdf` — Original generated paper
- `main_round1.pdf` — After Round 1 fixes
- `main_round2.pdf` — Final version after Round 2 fixes

Step 9: Summary

Report to user:

Score progression table
Number of CRITICAL/MAJOR/MINOR issues fixed per round
Final page count
Remaining issues (if any)

Feishu Notification (if configured)

After each round's review AND at final completion, check ~/.codex/feishu.json:

After each round: Send review_scored — "Round N: X/10 — [key changes]"
After final round: Send pipeline_done — score progression table + final page count
If config absent or mode "off": skip entirely (no-op)

Output

paper/
├── main_round0_original.pdf    # Original
├── main_round1.pdf             # After Round 1
├── main_round2.pdf             # After Round 2 (final)
├── main.pdf                    # = main_round2.pdf
└── PAPER_IMPROVEMENT_LOG.md    # Full review log with scores

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
Preserve all PDF versions — user needs to compare progression
Save FULL raw review text — do not summarize or truncate GPT-5.4 responses
Use send_input for Round 2 to maintain conversation context
Always recompile after fixes — verify 0 errors before proceeding
Do not fabricate experimental results — synthetic validation must describe methodology, not invent numbers
Respect the paper's claims — soften overclaims rather than adding unsupported new claims
Global consistency — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions)

Typical Score Progression

Based on end-to-end testing on a 9-page ICLR 2026 theory paper:

+4.5 points across 3 rounds (2 content + 1 format) is typical for a well-structured but rough first draft. Final: 8 pages main body, 0 overfull hbox, ICLR-compliant.

Adoption

brycewang-stanford/auto-paper-improvement-loop

$ install --global

Security Scan Results

SKILL.md

Auto Paper Improvement Loop: Review → Fix → Recompile

Context

Constants

Inputs

State Persistence (Compact Recovery)

Workflow

Step 0: Preserve Original

Step 1: Collect Paper Text

Step 2: Round 1 Review

Step 2b: Human Checkpoint (if enabled)

Step 3: Implement Round 1 Fixes

Step 4: Recompile Round 1

Step 5: Round 2 Review

Step 5b: Human Checkpoint (if enabled)

Step 6: Implement Round 2 Fixes

Step 7: Recompile Round 2

Step 8: Format Check

Step 9: Document Results

Step 9: Summary

Feishu Notification (if configured)

Output

Key Rules

Typical Score Progression

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

brycewang-stanford/auto-paper-improvement-loop

$ install --global

Security Scan Results

SKILL.md

Auto Paper Improvement Loop: Review → Fix → Recompile

Context

Constants

Inputs

State Persistence (Compact Recovery)

Workflow

Step 0: Preserve Original

Step 1: Collect Paper Text

Step 2: Round 1 Review

Step 2b: Human Checkpoint (if enabled)

Step 3: Implement Round 1 Fixes

Step 4: Recompile Round 1

Step 5: Round 2 Review

Step 5b: Human Checkpoint (if enabled)

Step 6: Implement Round 2 Fixes

Step 7: Recompile Round 2

Step 8: Format Check

Step 9: Document Results

Step 9: Summary

Feishu Notification (if configured)

Output

Key Rules

Typical Score Progression

Related Skills

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill