skills/42-wanshuiyin-ARIS/skills/skills-codex/auto-review-loop-llm/SKILL.md
Autonomous research review loop using any OpenAI-compatible LLM API. Configure via llm-chat MCP server or environment variables. Trigger with "auto review loop llm" or "llm review".
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research auto-review-loop-llmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.
AUTO_REVIEW.md in project root (cumulative log)This skill uses any OpenAI-compatible API for external review via the llm-chat MCP server.
Add to ~/.codex/settings.json:
{
"mcpServers": {
"llm-chat": {
"command": "/usr/bin/python3",
"args": ["/Users/yourname/.codex/mcp-servers/llm-chat/server.py"],
"env": {
"LLM_API_KEY": "your-api-key",
"LLM_BASE_URL": "https://api.deepseek.com/v1",
"LLM_MODEL": "deepseek-chat"
}
}
}
}
| Provider | LLM_BASE_URL | LLM_MODEL |
|----------|--------------|-----------|
| OpenAI | https://api.openai.com/v1 | gpt-4o, o3 |
| DeepSeek | https://api.deepseek.com/v1 | deepseek-chat, deepseek-reasoner |
| MiniMax | https://api.minimax.io/v1 | MiniMax-M2.7 |
| Kimi (Moonshot) | https://api.moonshot.cn/v1 | moonshot-v1-8k, moonshot-v1-32k |
| ZhiPu (GLM) | https://open.bigmodel.cn/api/paas/v4 | glm-4, glm-4-plus |
| SiliconFlow | https://api.siliconflow.cn/v1 | Qwen/Qwen2.5-72B-Instruct |
| 阿里云百炼 | https://dashscope.aliyuncs.com/compatible-mode/v1 | qwen-max |
| 零一万物 | https://api.lingyiwanwu.com/v1 | yi-large |
Primary: MCP Tool
mcp__llm-chat__chat:
message: |
[Review prompt content]
model: "deepseek-chat"
system: "You are a senior ML reviewer..."
Fallback: curl
curl -s "${LLM_BASE_URL}/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${LLM_API_KEY}" \
-d '{
"model": "${LLM_MODEL}",
"messages": [
{"role": "system", "content": "You are a senior ML reviewer..."},
{"role": "user", "content": "[review prompt]"}
],
"max_tokens": 4096
}'
Persist state to REVIEW_STATE.json after each round:
{
"round": 2,
"status": "in_progress",
"last_score": 5.0,
"last_verdict": "not ready",
"pending_experiments": [],
"timestamp": "2026-03-15T10:00:00"
}
Write this file at the end of every Phase E (after documenting the round).
On completion, set "status": "completed".
REVIEW_STATE.json for recoveryIf MCP available:
mcp__llm-chat__chat:
system: "You are a senior ML reviewer (NeurIPS/ICML level)."
message: |
[Round N/MAX_ROUNDS of autonomous review loop]
[Full research context: claims, methods, results, known weaknesses]
[Changes since last round, if any]
1. Score this work 1-10 for a top venue
2. List remaining critical weaknesses (ranked by severity)
3. For each weakness, specify the MINIMUM fix
4. State clearly: is this READY for submission? Yes/No/Almost
Be brutally honest. If the work is ready, say so clearly.
If MCP NOT available:
curl -s "${LLM_BASE_URL}/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${LLM_API_KEY}" \
-d '{
"model": "${LLM_MODEL}",
"messages": [
{"role": "system", "content": "You are a senior ML reviewer (NeurIPS/ICML level)."},
{"role": "user", "content": "[Full review prompt]"}
],
"max_tokens": 4096
}'
CRITICAL: Save the FULL raw response verbatim. Then extract:
STOP: If score >= 6 AND verdict contains "ready/almost"
Priority: metric additions > reframing > new experiments
Monitor remote experiments
Append to AUTO_REVIEW.md:
## Round N (timestamp)
### Assessment (Summary)
- Score: X/10
- Verdict: [ready/almost/not ready]
- Key criticisms: [bullet list]
### Reviewer Raw Response
<details>
<summary>Click to expand full reviewer response</summary>
[Paste the COMPLETE raw response here — verbatim, unedited.]
</details>
### Actions Taken
- [what was implemented/changed]
### Results
- [experiment outcomes, if any]
### Status
- [continuing to round N+1 / stopping]
Write REVIEW_STATE.json with current state.
REVIEW_STATE.json status to "completed"Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
Be honest about weaknesses
Implement fixes BEFORE re-reviewing
Document everything
Include previous context in round 2+ prompts
Prefer MCP tool over curl when available
mcp__llm-chat__chat:
system: "You are a senior ML reviewer (NeurIPS/ICML level)."
message: |
[Round N/MAX_ROUNDS of autonomous review loop]
## Previous Review Summary (Round N-1)
- Previous Score: X/10
- Previous Verdict: [ready/almost/not ready]
- Previous Key Weaknesses: [list]
## Changes Since Last Review
1. [Action 1]: [result]
2. [Action 2]: [result]
## Updated Results
[paste updated metrics/tables]
Please re-score and re-assess:
1. Score this work 1-10 for a top venue
2. List remaining critical weaknesses (ranked by severity)
3. For each weakness, specify the MINIMUM fix
4. State clearly: is this READY for submission? Yes/No/Almost
Be brutally honest. If the work is ready, say so clearly.
development
Track dataset lineage, transformation steps, merge logic, and reproducibility risks in Stata workflows. Use when the user needs to explain where data came from, how it changed, or why a pipeline can be trusted.
development
Audit datasets for structure, missingness, labeling, suspicious values, duplicate identifiers, and documentation readiness. Use when a researcher asks for data QA, codebook review, sanity checks, or pre-analysis cleanup guidance.
data-ai
Design, run, and critique causal inference workflows in Stata. Use when the user is working on identification, treatment effects, DiD, IV, event studies, RD, or assumption-sensitive empirical claims.
tools
Complete survival analysis library in Python. Handles right-censored data, Kaplan-Meier curves, and Cox regression. Standard for clinical trial analysis and epidemiology.