skills/42-wanshuiyin-ARIS/skills/skills-codex/auto-paper-improvement-loop/SKILL.md
Autonomously improve a generated paper via GPT-5.4 xhigh review → implement fixes → recompile, for 2 rounds. Use when user says \"改论文\", \"improve paper\", \"论文润色循环\", \"auto improve\", or wants to iteratively polish a generated paper.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research auto-paper-improvement-loopInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomously improve the paper at: $ARGUMENTS
This skill is designed to run after Workflow 3 (/paper-plan → /paper-figure → /paper-write → /paper-compile). It takes a compiled paper and iteratively improves it through external LLM review.
Unlike /auto-review-loop (which iterates on research — running experiments, collecting data, rewriting narrative), this skill iterates on paper writing quality — fixing theoretical inconsistencies, softening overclaims, adding missing content, and improving presentation.
gpt-5.4 — Model used via a secondary Codex agent for paper review.PAPER_IMPROVEMENT_LOG.md — Cumulative log of all rounds, stored in paper directory.true, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. When false (default), runs fully autonomously.💡 Override:
/auto-paper-improvement-loop "paper/" — human checkpoint: true
paper/main.pdf + LaTeX source files.tex files — concatenated for review promptIf the context window fills up mid-loop, Codex auto-compacts. To recover, this skill writes PAPER_IMPROVEMENT_STATE.json after each round:
{
"current_round": 1,
"agent_id": "019ce736-...",
"last_score": 6,
"status": "in_progress",
"timestamp": "2026-03-13T21:00:00"
}
On startup: if PAPER_IMPROVEMENT_STATE.json exists with "status": "in_progress" AND timestamp is within 24 hours, read it + PAPER_IMPROVEMENT_LOG.md to recover context, then resume from the next round. Otherwise (file absent, "status": "completed", or older than 24 hours), start fresh.
After each round: overwrite the state file. On completion: set "status": "completed".
cp paper/main.pdf paper/main_round0_original.pdf
Concatenate all section files into a single text block for the review prompt:
# Collect all sections in order
for f in paper/sections/*.tex; do
echo "% === $(basename $f) ==="
cat "$f"
done > /tmp/paper_full_text.txt
Send the full paper text to GPT-5.4 xhigh:
spawn_agent:
model: gpt-5.4
reasoning_effort: xhigh
message: |
You are reviewing a [VENUE] paper. Please provide a detailed, structured review.
## Full Paper Text:
[paste concatenated sections]
## Review Instructions
Please act as a senior ML reviewer ([VENUE] level). Provide:
1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
2. **Summary** (2-3 sentences)
3. **Strengths** (bullet list, ranked)
4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
6. **Missing References** (if any)
7. **Verdict**: Ready for submission? Yes / Almost / No
Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
self-containedness, notation consistency.
Save the agent id for Round 2.
Skip if HUMAN_CHECKPOINT = false.
Present the review results and wait for user input:
📋 Round 1 review complete.
Score: X/10 — [verdict]
Key weaknesses (by severity):
1. [CRITICAL] ...
2. [MAJOR] ...
3. [MINOR] ...
Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end.
Parse user response same as /auto-review-loop: approve / custom instructions / skip / stop.
Parse the review and implement fixes by severity:
Priority order:
Common fix patterns:
| Issue | Fix Pattern |
|-------|-------------|
| Assumption-model mismatch | Rewrite assumption to match the model, add formal proposition bridging the gap |
| Overclaims | Soften language: "validate" → "demonstrate practical relevance", "comparable" → "qualitatively competitive" |
| Missing metrics | Add quantitative table with honest parameter counts and caveats |
| Theorem not self-contained | Add "Interpretation" paragraph listing all dependencies |
| Notation confusion | Rename conflicting symbols globally, add Notation paragraph |
| Missing references | Add to references.bib, cite in appropriate locations |
| Theory-practice gap | Explicitly frame theory as idealized; add synthetic validation subsection |
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round1.pdf
Verify: 0 undefined references, 0 undefined citations.
Use send_input with the saved agent id:
send_input:
id: [saved from Round 1]
model: gpt-5.4
reasoning_effort: xhigh
message: |
[Round 2 update]
Since your last review, we have implemented:
1. [Fix 1]: [description]
2. [Fix 2]: [description]
...
Please re-score and re-assess. Same format:
Score, Summary, Strengths, Weaknesses, Actionable fixes, Verdict.
Skip if HUMAN_CHECKPOINT = false. Same as Step 2b — present Round 2 review, wait for user input.
Same process as Step 3. Typical Round 2 fixes:
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round2.pdf
After the final recompilation, run a format compliance check:
# 1. Page count vs venue limit
PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}')
echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)"
# 2. Overfull hbox warnings (content exceeding margins)
OVERFULL=$(grep -c "Overfull" paper/main.log 2>/dev/null || echo 0)
echo "Overfull hbox warnings: $OVERFULL"
grep "Overfull" paper/main.log 2>/dev/null | head -10
# 3. Underfull hbox warnings (loose spacing)
UNDERFULL=$(grep -c "Underfull" paper/main.log 2>/dev/null || echo 0)
echo "Underfull hbox warnings: $UNDERFULL"
# 4. Bad boxes summary
grep -c "badness" paper/main.log 2>/dev/null || echo "0 badness warnings"
Auto-fix patterns:
| Issue | Fix |
|-------|-----|
| Overfull hbox in equation | Wrap in \resizebox or split with \split/aligned |
| Overfull hbox in table | Reduce font (\small/\footnotesize) or use \resizebox{\linewidth}{!}{...} |
| Overfull hbox in text | Rephrase sentence or add \allowbreak / \- hints |
| Over page limit | Move content to appendix, compress tables, reduce figure sizes |
| Underfull hbox (loose) | Rephrase for better line filling or add \looseness=-1 |
If any overfull hbox > 10pt is found, fix it and recompile before documenting.
Create PAPER_IMPROVEMENT_LOG.md in the paper directory:
# Paper Improvement Log
## Score Progression
| Round | Score | Verdict | Key Changes |
|-------|-------|---------|-------------|
| Round 0 (original) | X/10 | No/Almost/Yes | Baseline |
| Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] |
| Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] |
## Round 1 Review & Fixes
<details>
<summary>GPT-5.4 xhigh Review (Round 1)</summary>
[Full raw review text, verbatim]
</details>
### Fixes Implemented
1. [Fix description]
2. [Fix description]
...
## Round 2 Review & Fixes
<details>
<summary>GPT-5.4 xhigh Review (Round 2)</summary>
[Full raw review text, verbatim]
</details>
### Fixes Implemented
1. [Fix description]
2. [Fix description]
...
## PDFs
- `main_round0_original.pdf` — Original generated paper
- `main_round1.pdf` — After Round 1 fixes
- `main_round2.pdf` — Final version after Round 2 fixes
Report to user:
After each round's review AND at final completion, check ~/.codex/feishu.json:
review_scored — "Round N: X/10 — [key changes]"pipeline_done — score progression table + final page count"off": skip entirely (no-op)paper/
├── main_round0_original.pdf # Original
├── main_round1.pdf # After Round 1
├── main_round2.pdf # After Round 2 (final)
├── main.pdf # = main_round2.pdf
└── PAPER_IMPROVEMENT_LOG.md # Full review log with scores
Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
Preserve all PDF versions — user needs to compare progression
Save FULL raw review text — do not summarize or truncate GPT-5.4 responses
Use send_input for Round 2 to maintain conversation context
Always recompile after fixes — verify 0 errors before proceeding
Do not fabricate experimental results — synthetic validation must describe methodology, not invent numbers
Respect the paper's claims — soften overclaims rather than adding unsupported new claims
Global consistency — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions)
Based on end-to-end testing on a 9-page ICLR 2026 theory paper:
| Round | Score | Key Improvements | |-------|-------|-----------------| | Round 0 | 4/10 (content) | Baseline: assumption-model mismatch, overclaims, notation issues | | Round 1 | 6/10 (content) | Fixed assumptions, softened claims, added interpretation, renamed notation | | Round 2 | 7/10 (content) | Added synthetic validation, formal truncation proposition, stronger limitations | | Round 3 | 5→8.5/10 (format) | Removed hero fig, appendix, compressed conclusion, fixed overfull hbox |
+4.5 points across 3 rounds (2 content + 1 format) is typical for a well-structured but rough first draft. Final: 8 pages main body, 0 overfull hbox, ICLR-compliant.
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.