skills/tooluniverse-fastq-qc/SKILL.md
--- name: tooluniverse-fastq-qc description: FASTQ quality control and adapter/quality-trimming decisions with local NGS tools — run FastQC on raw reads, summarize a project with MultiQC, interpret per-base sequence quality, per-base N content, adapter content, overrepresented sequences, sequence duplication and GC content, and decide whether (and how) to trim with fastp / Cutadapt before downstream analysis. seqkit for read counts/stats/subsampling. Use when someone asks "run QC on my FASTQs",
npx skillsauth add mims-harvard/tooluniverse skills/tooluniverse-fastq-qcInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run quality control on raw sequencing reads, interpret the report, and make an evidence-based decision about whether to trim — using real local command-line tools (FastQC, MultiQC, fastp, Cutadapt, seqkit).
This skill drives real binaries. It must never fabricate QC numbers.
--mode trim.--workdir.
The input directory is read-only. Trimmed reads are written as NEW files.Use this skill when the user wants to:
.fastq, .fq, .gz) filesDo NOT use this skill for (route elsewhere):
tooluniverse-rnaseq-deseq2tooluniverse-sequence-analysistooluniverse-variant-analysistooluniverse-single-cellBefore running, confirm with the user (ask if unstated):
*_R1.fastq.gz / *_R2.fastq.gz).--workdir SEPARATE from the input folder.The bundled script preflights for you, but the decision logic is:
import shutil
for tool in ("fastqc", "fastp", "seqkit"):
print(tool, shutil.which(tool) or "MISSING")
command -v fastqc / shutil.which("fastqc") returning nothing means the
tool is absent. If a required tool (FastQC for QC; FastQC+fastp for trim)
is missing, emit:
mamba install -c bioconda -c conda-forge fastqc fastp seqkit multiqc
# or
conda install -c bioconda -c conda-forge fastqc fastp seqkit multiqc
and stop. Do not proceed to fabricate output.
| Tool | Role | Install (bioconda) |
|-----------|----------------------------------------------------------------|--------------------|
| FastQC | Per-file raw read QC; produces the module PASS/WARN/FAIL report | fastqc |
| MultiQC | Aggregates many FastQC (and fastp) reports into one summary | multiqc |
| fastp | All-in-one QC + adapter + quality trimming (fast, auto-detect) | fastp |
| Cutadapt | Explicit, precise adapter/primer removal (amplicons, custom) | cutadapt |
| seqkit | Read counts, length/GC stats, subsampling | seqkit |
Rule of thumb: FastQC to diagnose, fastp to fix general adapter/quality, Cutadapt to fix a known primer/adapter precisely, seqkit to count/stat.
scripts/run_fastq_qc.py does the preflight + run-if-available + plan-if-missing
flow, with workspace isolation built in.
# QC only (default) — never modifies reads
python scripts/run_fastq_qc.py \
--fastq reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz \
--workdir /tmp/fastq_qc_run
# QC + trim (explicit) — fastp writes NEW trimmed files into --workdir
python scripts/run_fastq_qc.py \
--fastq reads/sample_R1.fastq.gz reads/sample_R2.fastq.gz \
--workdir /tmp/fastq_qc_run \
--mode trim
Behavior:
--workdir.--mode trim, runs fastp writing *.trimmed.fastq.gz into
--workdir/trimmed/ — raw inputs are never touched.--workdir equals an input directory (overwrite guard).For a project-level summary after FastQC, run MultiQC over the workdir:
multiqc /tmp/fastq_qc_run -o /tmp/fastq_qc_run/multiqc
This table is the core value-add. Map each FastQC module to what PASS/WARN/FAIL
means and what to actually do. (See references/fastqc_interpretation.md for the
long form with thresholds and worked cases.)
| FastQC module | Typical PASS | WARN / FAIL means | Suggested action |
|------------------------------|---------------------------|----------------------------------------------------------------|------------------|
| Per base sequence quality | All positions Q>=28 | 3' tail drops below Q20-Q28 (common, esp. R2) | Quality-trim 3' (fastp -q/sliding window). Proceed if only the last few bases dip. |
| Per base N content | Near 0% N | Spike of N at a position = sequencer/base-call problem | Investigate: cycle-specific issue; consider hard-trim that position or re-sequence. |
| Adapter content | Flat, no adapter ramp | Rising adapter % toward 3' end = read-through into adapter | Trim adapters (fastp auto-detect, or Cutadapt with the known adapter). |
| Overrepresented sequences | None / <0.1% | A sequence is a large fraction: adapter, primer-dimer, rRNA, or low-complexity | Investigate the hit (BLAST it). If adapter/primer -> trim. If biology (rRNA/highly-expressed) -> proceed. |
| Sequence Duplication Levels | Low (diverse library) | High duplication = PCR over-amplification OR expected (amplicon/RNA-seq) | Investigate, usually proceed. Do NOT dedup blindly — expected high in amplicon/targeted/RNA-seq. Mark-duplicates belongs post-alignment, not here. |
| Per sequence GC content | Single peak at expected GC| Bimodal / shifted peak = contamination or mixed species | Investigate contamination (needs a reference screen; see Limitations). Not fixed by trimming. |
| Per base sequence content | Flat after first ~10 bp | Bias in first bases (random-hexamer priming) or adapter | Random-priming bias: usually proceed (expected in RNA-seq). Persistent bias at 3' -> adapter -> trim. |
| Sequence Length Distribution | Single length (raw) | Multiple lengths AFTER trimming is normal; before trimming may indicate mixed input | Usually proceed; only a concern on supposedly-raw uniform-length data. |
Decision summary for "do I need to trim?"
--mode qc): FastQC + seqkit -> read the report.--mode trim (fastp) or Cutadapt for precise
primer removal; re-run FastQC on the trimmed output to confirm the fix.sample to
subsample for a quick look on huge files.references/fastqc_interpretation.md — full module-by-module thresholds + casesreferences/tools_and_install.md — install commands, tool flags, command recipesreferences/trimming_decisions.md — when/how to trim (fastp vs Cutadapt), pitfallstools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.