read-qc/contamination-screening/SKILL.md
Detect sample contamination and cross-species reads using FastQ Screen. Screen reads against multiple reference genomes to identify bacterial, viral, adapter, or sample swap contamination. Use when suspecting cross-contamination or working with samples prone to microbial contamination.
npx skillsauth add GPTomics/bioSkills bio-read-qc-contamination-screeningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BBTools 39.0+, Bowtie2 2.5.3+, FastQ Screen 0.15+, FastQC 0.12+, MultiQC 1.21+
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Screen FASTQ files against multiple genomes to identify contamination sources using FastQ Screen.
"Check for contamination in sequencing data" -> Align a sample of reads against multiple reference genomes to identify cross-species or cross-sample contamination.
fastq_screen --conf fastq_screen.conf reads.fqFastQ Screen aligns a subset of reads against multiple reference genomes to identify:
# Screen against configured genomes
fastq_screen sample.fastq.gz
# Multiple files
fastq_screen *.fastq.gz
# Specify output directory
fastq_screen --outdir qc_results/ sample.fastq.gz
# Custom config file
fastq_screen --conf my_screen.conf sample.fastq.gz
Create fastq_screen.conf:
# Database locations
DATABASE Human /path/to/human/genome
DATABASE Mouse /path/to/mouse/genome
DATABASE Ecoli /path/to/ecoli/genome
DATABASE PhiX /path/to/phix/genome
DATABASE Adapters /path/to/adapters
DATABASE rRNA /path/to/rrna
# Aligner (bowtie2 recommended)
BOWTIE2 /path/to/bowtie2
# Or use BWA
# BWA /path/to/bwa
# Threads
THREADS 8
# Download common screening databases
fastq_screen --get_genomes
# Downloads to ~/fastq_screen_databases/
# Includes: Human, Mouse, Rat, E.coli, PhiX, Adapters, etc.
# Number of reads to sample (default 100000)
fastq_screen --subset 200000 sample.fastq.gz
# Use all reads (slow)
fastq_screen --subset 0 sample.fastq.gz
# Set threads
fastq_screen --threads 8 sample.fastq.gz
# Paired-end (screen R1 only by default)
fastq_screen sample_R1.fastq.gz
# Force screening both pairs
fastq_screen --paired sample_R1.fastq.gz sample_R2.fastq.gz
# Generate PNG plot (default)
fastq_screen sample.fastq.gz
# No plot (text only)
fastq_screen --nograph sample.fastq.gz
# Generate additional mapping statistics
fastq_screen --tag sample.fastq.gz
# Filter reads by mapping (keep unmapped to all genomes)
fastq_screen --filter 0000 sample.fastq.gz
# Keep only reads mapping to first genome (e.g., Human)
fastq_screen --filter 1--- sample.fastq.gz
Use --filter to select reads based on mapping status:
| Code | Meaning | |------|---------| | 0 | Did not map to genome | | 1 | Mapped uniquely | | 2 | Mapped more than once | | 3 | Mapped (unique or multi) | | - | Ignore this genome |
# Example: Keep reads mapping only to Human (first genome)
# Human:1, all others:0
fastq_screen --filter 10000 sample.fastq.gz
# Keep reads NOT mapping to anything (clean reads)
fastq_screen --filter 00000 sample.fastq.gz
| File | Description |
|------|-------------|
| *_screen.txt | Tab-delimited results |
| *_screen.png | Visualization |
| *_screen.html | HTML report |
#Fastq_screen version: 0.15.3
Genome #Reads_processed #Unmapped %Unmapped #One_hit_one_genome %One_hit_one_genome #Multiple_hits_one_genome %Multiple_hits_one_genome #One_hit_multiple_genomes %One_hit_multiple_genomes Multiple_hits_multiple_genomes %Multiple_hits_multiple_genomes
Human 100000 2000 2.00 95000 95.00 1000 1.00 1500 1.50 500 0.50
Mouse 100000 98000 98.00 100 0.10 50 0.05 1500 1.50 350 0.35
| Sample Type | Expected Pattern | |-------------|------------------| | Human sample | >90% Human, <1% others | | Mouse sample | >90% Mouse, <1% others | | Human + PhiX | >80% Human, ~10% PhiX | | Contaminated | Significant % to unexpected genome |
| Pattern | Likely Cause | |---------|--------------| | High adapter % | Library prep issue | | High PhiX % | Spike-in not removed | | High E.coli % | Bacterial contamination | | High rRNA % | rRNA depletion failed | | Multiple species | Sample swap or contamination |
FastQ Screen results are automatically detected by MultiQC:
# Screen all samples
for f in *.fastq.gz; do
fastq_screen --outdir screen_results/ "$f"
done
# Aggregate with MultiQC
multiqc screen_results/
# Index a FASTA file
bowtie2-build reference.fa reference
# Add to config
# DATABASE MyGenome /path/to/reference
| Genome | Purpose | |--------|---------| | Human (GRCh38) | Human samples | | Mouse (GRCm39) | Mouse samples | | E. coli | Bacterial contamination | | PhiX | Illumina spike-in | | Adapters | Library prep | | rRNA | Ribosomal RNA | | Vectors | Cloning vectors | | Mycoplasma | Cell culture contamination |
# Download databases
fastq_screen --get_genomes
# Screen samples
fastq_screen --outdir screen_results/ --threads 8 *.fastq.gz
# Check results
multiqc screen_results/
# Screen and tag reads
fastq_screen --tag sample.fastq.gz
# Filter to keep only Human reads (assuming Human is first database)
fastq_screen --filter 3----- --tag sample.fastq.gz
# Or use BBDuk for removal
bbduk.sh in=sample.fastq.gz out=clean.fastq.gz \
ref=contaminants.fa k=31 hdist=1
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.