.autolab/acquired_skills/scientific-analysis-review/SKILL.md
Critically review AI-agent-conducted scientific analyses for correctness, rigor, and completeness. Use this skill whenever an analysis session has completed and needs validation, when a user asks to "review," "validate," "check," or "audit" a computational analysis, or when an agent pipeline produces scientific results that require quality control before reporting. Also trigger when the user references an execution trace, notebook, or conversation history from a prior analysis session. This skill should run as the final step of any autonomous scientific analysis pipeline.
npx skillsauth add albert-ying/autonomous-lab scientific-analysis-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill validates AI-agent-conducted scientific analyses by systematically checking for bugs, hallucinations, logical errors, and scientific rigor issues. It produces a structured review report with severity-graded findings and actionable recommendations.
The review operates on the execution trace of a completed analysis: conversation history, code cells, intermediate outputs, and final claims. It does NOT re-run the analysis. It audits what was done.
The reviewer needs access to one or more of the following (in order of preference):
If the input is empty or contains no analysis, state this in one sentence and stop. Do not fill out N/A tables for empty sessions.
Run every category in order. Each category contains specific checks. Skip a category only if it is entirely inapplicable (e.g., no code was executed).
Check whether the code ran as intended and produced the expected outputs.
[] but the
analysis continues as if data was retrieved.)Check whether numbers, calculations, and statistical claims are correct.
Trace whether each conclusion follows from the preceding analysis step.
Check whether the agent answered the question that was actually asked.
Check whether claims in the output are supported by the actual computed results.
Domain-specific checks for biological and computational analyses.
Check whether the analysis could be independently reproduced.
Every finding must be assigned exactly one severity level:
| Severity | Definition | Criteria | |----------|-----------|----------| | Critical | Invalidates a primary conclusion | A main claim of the analysis is unsupported, fabricated, or logically unfounded | | Major | Materially affects interpretation | A significant methodological flaw, unjustified threshold, or logical gap that changes what can be concluded, but does not fully invalidate the work | | Minor | Should be fixed but does not change conclusions | Cosmetic issues, incomplete documentation, missing edge cases, or limitations that are real but acknowledged | | Note | Observation for improvement | Suggestions for better practice that do not affect current results |
For every category, the reviewer must state what it CANNOT check. Do not silently skip a check. Instead, declare:
CANNOT VERIFY: [specific claim]. Reason: [would require literature search /
wet-lab validation / access to raw data / domain expertise in X].
This is a distinct output from "no issues found." "No issues found" means the check was performed and passed. "CANNOT VERIFY" means the check could not be performed.
The review produces a single structured report. Use this exact structure:
# Analysis Review: [Brief title]
## Session Summary
- **Original question**: [one sentence]
- **Analysis performed**: [two to three sentences]
- **Primary conclusions**: [numbered list of main claims]
## Findings
### [SEVERITY]: [Short title]
**Category**: [which review category]
**Evidence**: [specific code block, output, or claim that triggered this finding]
**Impact**: [what this means for the conclusions]
**Recommendation**: [specific action to resolve]
[Repeat for each finding, ordered by severity: Critical -> Major -> Minor -> Note]
## Capability Boundaries
[List of CANNOT VERIFY declarations]
## Verdict
**Overall assessment**: [One of: VALID, VALID WITH CAVEATS, MAJOR REVISION NEEDED, UNRELIABLE]
- VALID: No critical or major issues. Conclusions are supported.
- VALID WITH CAVEATS: No critical issues. Major issues exist but are acknowledged
limitations, not errors. Conclusions hold with stated caveats.
- MAJOR REVISION NEEDED: Major issues that change interpretation. Analysis should
be revised before conclusions are reported.
- UNRELIABLE: Critical issues found. Primary conclusions are not supported by the
analysis as conducted.
**Summary**: [Two to three sentences stating what can and cannot be concluded from
this analysis.]
Be specific. "The threshold is arbitrary" is not useful. "The filtering step uses a cutoff of X, which excludes N data points (including borderline cases at X+0.1); no justification is provided for this value versus alternative cutoffs" is useful.
Distinguish between errors and limitations. An error is something the analysis got wrong. A limitation is something the analysis cannot address given available data. Both matter, but they require different responses (fix vs. acknowledge).
Do not grade on effort. An analysis that tried many things but reached unsupported conclusions is worse than a simple analysis with well-supported conclusions.
Do not invent issues. If a category has no findings, say "No issues identified" for that category. Do not manufacture minor concerns to appear thorough.
Check the obvious first. Before looking for subtle issues, verify: Did the code actually run? Did it load the right data? Do the numbers in the summary match the numbers in the output? Most serious errors are mundane, not exotic.
Trace claims backward. Start from the final conclusions and trace each one back through the analysis to its data source. This catches more issues than reading the analysis forward.
When the session is empty or contains no analysis, state this in one sentence. Do not produce the full report template with N/A entries.
development
Critically review AI-agent-conducted scientific analyses for correctness, rigor, and completeness. Use this skill whenever an analysis session has completed and needs validation, when a user asks to "review," "validate," "check," or "audit" a computational analysis, or when an agent pipeline produces scientific results that require quality control before reporting. Also trigger when the user references an execution trace, notebook, or conversation history from a prior analysis session. This skill should run as the final step of any autonomous scientific analysis pipeline.
tools
# Variant Calling Skill ## When to Use Use when calling SNPs and indels from aligned BAM files against a reference. ## Standard Workflow 1. Mark duplicates (optional): `samtools markdup` 2. Call variants with freebayes: `freebayes -f reference.fasta -p 1 sample.bam > variants.vcf` OR with bcftools: `bcftools mpileup -f ref.fa sample.bam | bcftools call -mv -Oz -o variants.vcf.gz` 3. Filter variants: `bcftools filter -s LowQual -e 'QUAL<20' variants.vcf` ## Key Decisions - For haploid organ
tools
# Trimmomatic - Read Quality Trimming ## When to Use Use Trimmomatic to trim adapter sequences and low-quality bases from Illumina sequencing reads. ## Standard Workflow 1. Install: `conda install -c bioconda trimmomatic` 2. Run: `trimmomatic PE <input_R1.fastq.gz> <input_R2.fastq.gz> <output_R1_paired.fastq.gz> <output_R1_unpaired.fastq.gz> <output_R2_paired.fastq.gz> <output_R2_unpaired.fastq.gz> ILLUMINACLIP:<adapters.fa>:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36` ## Key Pa
testing
# SPAdes Assembly Skill ## When to Use Use for de novo genome assembly when no reference genome is available. ## Standard Workflow 1. Run SPAdes: `spades.py -1 R1.fastq.gz -2 R2.fastq.gz -o assembly_output --careful` 2. Check assembly stats: look at scaffolds.fasta or contigs.fasta 3. Use assembled genome as reference for read mapping ## Key Decisions - Use `--careful` flag for bacterial genomes to reduce misassemblies - For small bacterial genomes, default k-mer sizes work well - Output scaf