read-qc/fastp-workflow/SKILL.md
All-in-one read preprocessing with fastp including adapter trimming, quality filtering, deduplication, base correction, and HTML report generation. Use when preprocessing Illumina data and wanting a single fast tool instead of separate Cutadapt, Trimmomatic, and FastQC steps.
npx skillsauth add GPTomics/bioSkills bio-read-qc-fastp-workflowInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: FastQC 0.12+, fastp 0.23+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
All-in-one preprocessing tool that handles adapter trimming, quality filtering, deduplication, and report generation in a single pass.
"Preprocess FASTQ reads with fastp" -> Run adapter trimming, quality filtering, and QC reporting in a single pass.
fastp -i R1.fq -I R2.fq -o clean_R1.fq -O clean_R2.fq --html report.htmlfastp -i input.fastq.gz -o output.fastq.gz
fastp -i R1.fastq.gz -I R2.fastq.gz -o R1_clean.fastq.gz -O R2_clean.fastq.gz
fastp -i R1.fq.gz -I R2.fq.gz \
-o R1_clean.fq.gz -O R2_clean.fq.gz \
-h sample_report.html \
-j sample_report.json
fastp auto-detects Illumina adapters by default.
# Auto-detect (default)
fastp -i in.fq -o out.fq
# Specify adapters manually
fastp -i in.fq -o out.fq \
--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
# Paired-end with manual adapters
fastp -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq \
--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
# Disable adapter trimming
fastp -i in.fq -o out.fq --disable_adapter_trimming
# Adapter FASTA file
fastp -i in.fq -o out.fq --adapter_fasta adapters.fa
# Per-base quality threshold (default Q15)
fastp -i in.fq -o out.fq -q 20
# Mean read quality threshold
fastp -i in.fq -o out.fq -e 25
# Max unqualified bases percent (default 40)
fastp -i in.fq -o out.fq -q 20 --unqualified_percent_limit 30
# Disable quality filtering
fastp -i in.fq -o out.fq --disable_quality_filtering
# Sliding window from 3' end (recommended)
fastp -i in.fq -o out.fq \
--cut_right \
--cut_right_window_size 4 \
--cut_right_mean_quality 20
# Sliding window from 5' end
fastp -i in.fq -o out.fq \
--cut_front \
--cut_front_window_size 4 \
--cut_front_mean_quality 20
# Both ends
fastp -i in.fq -o out.fq \
--cut_front --cut_tail \
--cut_front_window_size 4 \
--cut_front_mean_quality 20 \
--cut_tail_window_size 4 \
--cut_tail_mean_quality 20
# Minimum length (default 15)
fastp -i in.fq -o out.fq -l 36
# Maximum length
fastp -i in.fq -o out.fq --length_limit 150
# Required length (discard shorter AND longer)
fastp -i in.fq -o out.fq -l 100 --length_limit 100
# Trim poly-G (NovaSeq/NextSeq artifacts) - auto-enabled for these platforms
fastp -i in.fq -o out.fq --trim_poly_g
# Disable poly-G trimming
fastp -i in.fq -o out.fq --disable_trim_poly_g
# Trim poly-X (any homopolymer)
fastp -i in.fq -o out.fq --trim_poly_x
# Custom poly-G minimum length (default 10)
fastp -i in.fq -o out.fq --trim_poly_g --poly_g_min_len 5
# Max N bases (default 5)
fastp -i in.fq -o out.fq -n 3
# Disable N filtering
fastp -i in.fq -o out.fq --n_base_limit 50
# Enable deduplication
fastp -i in.fq -o out.fq --dedup
# Accuracy level (1-6, higher = more memory, default 3)
fastp -i in.fq -o out.fq --dedup --dup_calc_accuracy 4
# Enable overlap-based correction
fastp -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq --correction
# Required overlap length (default 30)
fastp -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq \
--correction --overlap_len_require 20
# Merge overlapping paired reads
fastp -i R1.fq -I R2.fq \
--merge --merged_out merged.fq \
-o R1_unmerged.fq -O R2_unmerged.fq
# UMI in read (extract to header)
fastp -i in.fq -o out.fq \
--umi --umi_loc read1 --umi_len 8
# UMI in separate read
fastp -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq \
--umi --umi_loc index1
# UMI locations: index1, index2, read1, read2, per_index, per_read
fastp \
-i raw_R1.fastq.gz -I raw_R2.fastq.gz \
-o clean_R1.fastq.gz -O clean_R2.fastq.gz \
--detect_adapter_for_pe \
--cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 \
-q 20 -l 36 \
--thread 8 \
-h sample_fastp.html -j sample_fastp.json
fastp \
-i raw_R1.fastq.gz -I raw_R2.fastq.gz \
-o clean_R1.fastq.gz -O clean_R2.fastq.gz \
--detect_adapter_for_pe \
--trim_poly_g \
--cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 \
-q 20 -l 36 \
--thread 8 \
-h sample_fastp.html -j sample_fastp.json
fastp \
-i raw_R1.fastq.gz -I raw_R2.fastq.gz \
-o clean_R1.fastq.gz -O clean_R2.fastq.gz \
--detect_adapter_for_pe \
--cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 \
-q 20 -l 50 \
--thread 8 \
-h sample_fastp.html -j sample_fastp.json
| File | Description |
|------|-------------|
| *.html | Interactive HTML report |
| *.json | Machine-readable statistics |
| Output FASTQ | Processed reads |
import json
with open('sample_fastp.json') as f:
report = json.load(f)
summary = report['summary']
print(f"Total reads: {summary['before_filtering']['total_reads']}")
print(f"Passed reads: {summary['after_filtering']['total_reads']}")
print(f"Q20 rate: {summary['after_filtering']['q20_rate']:.2%}")
print(f"Q30 rate: {summary['after_filtering']['q30_rate']:.2%}")
# Set threads (default 3)
fastp -i in.fq -o out.fq --thread 8
# Disable HTML report (faster)
fastp -i in.fq -o out.fq --html /dev/null
# Process from stdin
zcat in.fq.gz | fastp --stdin -o out.fq
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.