skills/paper-pipeline/SKILL.md
End-to-end paper generation pipeline ported from AutoResearchClaw (Aiming Lab). 14 phases covering topic initiation through export/publish, with human- in-the-loop gates and quality gating at each handoff. Use this when the user wants a full paper pipeline run — topic to submission-ready manuscript. Delegates to researcher/reviewer/writer/verifier subagents for stage execution and to autonomous-iteration for experiment optimization loops.
npx skillsauth add moralespanitz/research-loop paper-pipelineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
One-shot paper generation from topic to export, ported from AutoResearchClaw (Aiming Lab). Each phase produces structured artifacts in the session directory. Quality gates at every handoff catch issues before they compound. The pipeline is NOT linear — phases 8-10 form a tight experiment loop that may iterate.
| Skill | When to Use |
|-------|-------------|
| paper-pipeline (this) | Full end-to-end run — topic to submission-ready manuscript |
| writing-papers | Standalone paper drafting from existing results |
| loop | Hypothesis testing with experiment ranking |
| autonomous-iteration | Metric-driven experiment optimization |
| experiment-sandbox | Running experiments in sandboxed environments |
| figure-agent | Generating publication-quality figures |
| review-prep | Handling reviews post-submission |
Phase A — Topic & Scope (Phases 1-2)
Phase 1: Topic Init → SMART research goal
Phase 2: Literature Coll → Paper search [GATE]
Phase B — Knowledge & Synthesis (Phases 3-6)
Phase 3: Literature Screen → Relevance + quality scoring
Phase 4: Knowledge Extract → Structured evidence cards
Phase 5: Hypothesis Gen → Falsifiable predictions [GATE]
Phase 6: Synthesis → Topic clusters + research gaps
Phase C — Experiment (Phases 7-10)
Phase 7: Experiment Design → Compute budget, ablations, baselines [GATE]
Phase 8: Code Generation → Runnable experiment code
Phase 9: Execution → Sandbox running (ref experiment-sandbox)
Phase 10: Result Analysis → Statistical interpretation
Phase D — Paper & Review (Phases 11-14)
Phase 11: Paper Draft → NeurIPS/ICML/ICLR quality [GATE]
Phase 12: Peer Review → Simulated review + inline annotations
Phase 13: Revision → Address reviewer feedback [GATE]
Phase 14: Export/Publish → Conference template formatting
| Phase | Gate | Fail Action | |-------|------|-------------| | 2 (Literature Coll) | Sufficient papers collected? (< 10 = fail) | Retry search with broader queries | | 5 (Hypothesis Gen) | At least 2 falsifiable hypotheses? | Iterate synthesis | | 7 (Experiment Design) | Budget + baselines + ablations defined? | Refine design | | 11 (Paper Draft) | All sections min word count met? | Expand draft | | 13 (Revision) | All reviewer points addressed? | Iterate revision |
Goal: Establish SMART research goal, scope, and success criteria.
Artifact: sessions/<slug>/01-topic-init.md
Steps:
Quality gate: Goal must be ONE sentence. If it takes a paragraph to describe, the topic is too broad. Split into sub-projects.
Delegation: Dispatch the researcher subagent if domain context is needed.
Goal: Gather candidate papers from arxiv, Semantic Scholar, and web sources.
Artifact: sessions/<slug>/02-literature-collection.md
Steps:
researcher subagent:
GATE: Show the candidate list to the user. Ask: "Do you want to adjust the search direction before we screen?" Wait for approval before proceeding.
Quality thresholds:
Goal: Filter candidates to a shortlist of high-quality, relevant papers.
Artifact: sessions/<slug>/03-literature-screen.md
Steps:
Output: Shortlist of 8-15 papers with scores and keep reasons.
Anti-patterns:
Goal: Extract structured evidence cards from the shortlisted papers.
Artifact: sessions/<slug>/04-evidence-cards.md
Steps:
Output: Structured evidence table with one row per paper, cross-referenced.
Delegation: This can be delegated to the researcher subagent for bulk card
extraction. Provide the shortlist file as input.
Goal: Produce at least 2 falsifiable, testable hypotheses from the evidence.
Artifact: sessions/<slug>/05-hypotheses.md
Steps:
GATE: Show hypotheses to user. Ask: "Which hypothesis should we pursue first? Do you want to add, remove, or reorder any?" Wait for confirmation.
Quality check: Each hypothesis must be falsifiable. If it cannot be proven wrong by an experiment, it is not a scientific hypothesis. Reformulate.
Goal: Organize knowledge into topic clusters and identify research gaps.
Artifact: sessions/<slug>/06-synthesis.md
Steps:
Output: Cluster overview + prioritized gap list with recommended hypothesis.
Reference: The idea-selection skill can be loaded here if the gaps need
a formal evaluation matrix.
Goal: Design a concrete experiment plan with compute budget, ablations, and baselines.
Artifact: sessions/<slug>/07-experiment-design.md
Steps:
GATE: Present the experiment plan to the user. Ask: "Does the budget look right? Are baselines fair? Ready to generate code?" Wait for approval.
Format: YAML experiment plan that can be directly consumed by the code
generation phase and the experiment-sandbox skill.
Goal: Generate executable experiment code implementing real algorithms.
Artifact: sessions/<slug>/08-code/ — directory with project files
Steps:
researcher or direct LLMTIME_ESTIMATE: Xs print before main loopresults.json structured outputiterative_repair sub-prompt patternfilename:xxx.py formatQuality checks:
Reference: The experiment-sandbox skill for sandbox configuration and
the autonomous-iteration skill for optimization loops.
Goal: Run experiments in sandboxed environment and collect results.
Artifact: sessions/<slug>/09-results/ — metrics + logs
Steps:
local (venv) — for quick, low-resource experimentsdocker — for isolated, reproducible environmentsssh_remote — for GPU compute on remote serverscolab — for Google Colab workflowsresults.jsonQuality checks:
results.json contains all declared metricsReference: Load the experiment-sandbox skill for detailed sandbox setup
and execution procedures.
Goal: Produce statistical interpretation of experimental results.
Artifact: sessions/<slug>/10-analysis.md
Steps:
results.jsonQuality check: Every number in the report must trace to an actual experiment output. No approximations, no rounding without disclosure.
Goal: Write a full-length conference-quality paper draft.
Artifact: sessions/<slug>/11-draft.md
Steps:
writer subagent for draft generationfigure-agent outputGATE: Present the draft to the user. Ask for initial feedback before the peer review phase. Do NOT skip this — first impressions matter.
Quality check: Total word count must be 5000-6500 words in main body. If any section is below minimum, expand with substantive content — not filler.
Goal: Simulate peer review with at least 2 reviewer perspectives.
Artifact: sessions/<slug>/12-reviews.md
Steps:
reviewer subagent for review generationOutput: Structured review with inline annotations and actionable revision requests.
Goal: Address all reviewer feedback while maintaining or increasing word count.
Artifact: sessions/<slug>/13-revision.md
Steps:
verifier subagent to check quality:
GATE: Present the revised draft with a change-log. Ask: "Ready for export? Any lingering concerns?" Wait for approval.
Quality check: The revised paper must be longer than or equal to the draft. If the revision shortened the paper, that is a failure.
Goal: Format the final paper for submission to a conference.
Artifact: sessions/<slug>/14-export/ — final formatted artifacts
Steps:
figure-agent skillOutput: Final paper file + figures + submission checklist.
The pipeline can be configured via inline YAML:
pipeline:
skip_phases: [] # Phases to skip (e.g., [3, 6])
quality_threshold: 0.4 # Min relevance score for literature screen
target_venue: "NeurIPS" # One of: NeurIPS, ICML, ICLR
experiment:
mode: "sandbox" # local | docker | ssh_remote | colab
time_budget_sec: 300
metric_key: "primary_metric"
metric_direction: "minimize"
figure_agent:
enabled: true
min_figures: 3
max_figures: 10
testing
Plan and execute a structured replication workflow for a paper, claim, or benchmark with environment selection and integrity checks.
testing
Run a structured literature review on a topic using parallel search, evidence tables with quality scoring, and primary-source synthesis.
development
Publication-quality figure generation for research papers. Decision agent selects figure type (code plot vs architecture diagram). Generates Matplotlib/Seaborn code for quantitative figures with iterative improvement loop. Style-matches conference templates (NeurIPS, ICML, ICLR). Use when the paper-pipeline reaches the figure generation phase, or when a user requests figures for an existing draft.
development
Experiment sandbox execution for Research Loop. Supports four modes: local (venv), Docker (isolated containers), SSH remote (GPU compute on servers), and Colab (Google Drive bridge). Provides experiment harness templates, code validation, metric collection, deterministic seeding, and compute budget enforcement. Use before running experiments generated by the paper-pipeline.