GPTomics

568 verified skills532,298 total stars

bio-comparative-genomics-whole-genome-duplication

Detect, date, and contextualize whole-genome duplication (WGD / paleopolyploidy) events using wgd v2 (Chen et al 2024), KsRates (Sensalari 2022 substitution-rate-corrected Ks dating), DupGen_finder (Qiao 2019), MAPS (Li 2018 phylogenomic), POInT (Conant 2008 ordered-block), SLEDGe (2024 ML-based), Whale.jl (Bayesian DL+WGD), and synteny-anchored paranome construction. Use when identifying ancient polyploidy from Ks distributions and synteny block analysis, positioning WGD events relative to speciation, distinguishing tandem from segmental from WGD duplications, dating the 2R/3R vertebrate / fish / salmonid WGDs, building paranome and Ks-age mixture models, applying KsRates substitution-rate correction across lineages, or testing alternative biased-fractionation / dosage-balance models post-WGD.

development1,065

bio-clip-seq-ago-clip-mirna-targets

Identify direct miRNA-target interactions from AGO HITS-CLIP, AGO-CLEAR-CLIP (chimeric reads), HEAP (Halo-Ago2 mouse), chimeric eCLIP / miR-eCLIP (deep miRNA-target profiling), or CLASH using chimeric-read processing pipelines, seed-pairing analysis, and 3' auxiliary pairing rules. Use when distinguishing direct miRNA targets from indirect, integrating CLIP-derived target maps with TargetScan / miRDB / DIANA predictions, applying canonical 7mer-8mer seed matching with 3' UTR context, or recovering miRNA-mRNA chimeras at scale.

tools1,065

bio-clip-seq-clip-motif-analysis

Discover RBP binding motifs from CLIP-seq peaks or single-nucleotide crosslink sites using HOMER, MEME/STREME, kpLogo, mCross (CL-position-registered motifs), PEKA (positional k-mer enrichment), RBPamp (affinity), and RNA Bind-n-Seq (RBNS) cross-validation. Use when characterizing RBP sequence specificity, registering motifs to crosslink positions, validating in vivo CLIP motifs against in vitro RBNS Kd, reconciling motif disagreements across tools, or correcting for the uracil crosslinking bias that contaminates raw CLIP motif logos.

tools1,065

bio-clip-seq-binding-site-annotation

Annotate CLIP-seq peaks or crosslink sites to RNA features (5'UTR, CDS, 3'UTR, intron, splice junction, snoRNA, tRNA, ncRNA, repeat elements) with ChIPseeker, RCAS, RBP-Maps (Yeo splicing regulatory maps), and bedtools, applying feature-priority hierarchies, transcript-context resolution, and metagene aggregation. Use when characterizing where in transcripts an RBP binds, comparing peak distribution across regions, generating splicing-regulatory maps relative to alternative-splicing events, or distinguishing exonic vs intronic vs UTR binding.

tools1,065

bio-clip-seq-clip-preprocessing

Preprocess CLIP-seq reads (eCLIP, iCLIP, iCLIP2, iCLIP3, irCLIP, PAR-CLIP, FLASH) with protocol-specific UMI extraction, adapter trimming, length filtering, and post-alignment PCR-duplicate collapse. Use when raw CLIP FASTQ must be turned into deduplicated, crosslink-preserving BAM input for peak calling; choosing between two-pass and single-pass adapter trimming; deciding minimum read length; or mapping UMI patterns to specific eCLIP/iCLIP/iCLIP2/iCLIP3 library preps.

tools1,065

bio-comparative-genomics-ortholog-inference

Infer orthologous genes and gene families across species using OrthoFinder3 (HOG-based phylogenetic orthology), SonicParanoid2, Broccoli, ProteinOrtho, OMA / FastOMA hierarchical orthologous groups, eggNOG-mapper, JustOrthologs, and TOGA whole-genome-alignment orthology. Use when building single-copy ortholog sets for phylogenomics, classifying co-orthologs and in/out-paralogs after gene duplication, propagating functional annotation via orthology with awareness of the ortholog conjecture, distinguishing speciation from duplication via gene-tree species-tree reconciliation, computing Quest-for-Orthologs benchmark performance, or running synteny-aware ortholog detection in WGD-affected lineages.

development1,065

bio-comparative-genomics-comparative-annotation-projection

Project gene annotations across genomes using TOGA (Kirilenko 2023 whole-genome-alignment chain-based projection with intactness classification), CESAR 2.0 (Sharma, Schwede & Hiller 2017 codon-aware exon projection), LiftOff (Shumate & Salzberg 2021 reference-based annotation transfer), Liftover (UCSC), GeMoMa (Keilwagen 2019 evidence-based projection), and Comparative Annotation Toolkit (CAT). Use when transferring annotations from a well-annotated reference to query genome(s), classifying gene-loss vs gene-intact across many genomes at scale, building Zoonomia-style comparative annotations across hundreds of mammals or birds (Kirilenko 2023), detecting pseudogenization, projecting alternative isoforms, or selecting between WGA-anchored (TOGA) vs ortholog-based (LiftOff) annotation transfer strategies.

tools1,065

bio-clip-seq-stamp-antibody-free

Profiles RNA-binding protein targets without antibody or UV crosslinking using STAMP (APOBEC1-RBP fusion, C-to-U editing), scSTAMP (single-cell), TRIBE/HyperTRIBE (ADAR-RBP, A-to-I editing), DART-seq (APOBEC1-YTH for m6A), or Bullseye/SAILOR edit-site detection pipelines. Use when antibody is unavailable or specificity is doubtful, when single-cell RBP profiling is needed (scSTAMP), or when in vivo RBP profiling without UV is preferred.

tools1,065

bio-comparative-genomics-gene-tree-species-tree-reconciliation

Reconcile gene trees against a species tree under probabilistic models of duplication, transfer, and loss (DTL) using ALE (Szöllősi 2013 amalgamated likelihood), GeneRax (Morel 2020 ML reconciliation), AleRax (Morel 2024 co-estimation), Whale.jl (Bayesian DL+WGD), RANGER-DTL 2 parsimony, NOTUNG, ecceTERA, and Treerecs. Use when inferring ancestral gene-family content, distinguishing duplication from horizontal transfer from differential loss, rooting deep species trees from gene-content signals (STRIDE / Williams 2017 ALE-rooting), counting DTL events per branch, refining noisy gene trees against a species tree, modeling WGD events jointly with DTL, or producing publication-grade gene-family histories for phylogenomic / comparative analyses.

testing1,065

bio-clip-seq-m6a-clip

Map N6-methyladenosine (m6A) RNA modifications at single-nucleotide resolution using miCLIP (Linder 2015), miCLIP2 + m6Aboost machine learning (Kortel 2021), GLORI (Liu 2023, antibody-free chemical conversion), DART-seq (Meyer 2019, APOBEC1-YTH fusion), m6Anet (nanopore direct RNA), or MeRIP-seq with calibration. Use when distinguishing antibody-based from antibody-free m6A detection methods, applying the DRACH motif constraint, reconciling cross-method disagreements (DART 44% in DRACH vs GLORI), or detecting m6Am at the cap.

tools1,065

bio-comparative-genomics-genome-distance-and-species-delineation

Compute genome-to-genome distances (ANI, AAI, dDDH, k-mer Mash) and assign taxonomic classifications using skani (Shaw 2023), FastANI (Jain 2018), pyani / pyANI ANIb / ANIm, OrthoANI (Lee 2016), AAI (amino-acid identity), dDDH via TYGS / GGDC, GTDB-Tk (Chaumeil 2020 standard prokaryote taxonomy), and Mash MinHash (Ondov 2016). Use when delineating prokaryote species (95% ANI threshold; Jain 2018 Nat Commun 9:5114), assigning genomes to GTDB taxonomy with ANI radius, computing genome similarity matrices for clustering, classifying archaea, evaluating MAG (metagenome-assembled genome) species assignment, applying skani for fast metagenomic ANI screening, or reconciling 16S rRNA-based taxonomy with whole-genome ANI.

testing1,065

bio-comparative-genomics-whole-genome-alignment

Build whole-genome alignments using Progressive Cactus (Armstrong 2020 reference-free clade-level WGA), Minigraph-Cactus (Hickey 2024 pangenome-aware), LASTZ chain/net (UCSC pipeline), MUMmer4 (Marçais 2018 pairwise), minimap2 -x asm5/10/20 (Li 2018 fast pairwise), AnchorWave (Song 2022 WGD-aware), and Mauve / progressiveMauve (bacterial). Operates the HAL toolkit (Hickey 2013) for downstream extraction including halSynteny, halLiftover, halBranchMutations, and hal2maf. Use when constructing multi-species alignments for comparative-annotation projection (TOGA), synteny detection, conservation analyses (phyloP / PhastCons), or pangenome graph construction; selecting between reference-free (Cactus) and reference-anchored (LASTZ chains/nets) approaches; tuning sensitivity for closely vs distantly related genomes; or producing HAL files for genome-wide downstream tools.

tools1,065

bio-comparative-genomics-gene-family-evolution

Model gene-family birth-death dynamics across a species tree using CAFE5 (Mendes et al 2020 Bioinformatics 36:5516 gamma-distributed rate categories), CAFE5-error (annotation-error-aware), Count (Csurös 2010 ancestral state reconstruction), BadiRate (Librado 2012 likelihood + parsimony), DupliPHY-Family, and ALE/AleRax (for per-family DTL; see [[gene-tree-species-tree-reconciliation]]). Test lineage-specific gene-family expansions and contractions, distinguish biological dynamics from annotation artifacts, account for assembly fragmentation, identify functional enrichment in expanded / contracted families. Use when correlating gene-family changes with phenotype evolution, ranking lineages by adaptive gene-family-rate shifts, post-WGD dosage-balance analysis, or building Birth-death models from OrthoFinder presence/absence matrices.

development1,065

bio-comparative-genomics-hgt-detection

Detect horizontal gene transfer (HGT / LGT) using compositional methods (GC%, codon usage, tetranucleotide z-scores via SIGI-HMM, AlienHunter, IslandViewer 4, IslandPath-DIMOB), phylogenetic-incongruence methods (AvP, HGTphyloDetect, ALE / GeneRax / AleRax reconciliation, RANGER-DTL), and BLAST-distribution methods (HGTector v2, DarkHorse, Alien Index). Use when screening prokaryote genomes for genomic islands and HGT events, distinguishing HGT from incomplete lineage sorting / differential gene loss / hybridization, mapping donor lineages via phylogenetic placement, separating eukaryotic HGT from contamination, ruling out gBGC as a false signal, or quantifying DTL rates with ALE/GeneRax on bacterial trees.

testing1,065

bio-comparative-genomics-introgression-detection

Detect introgression and admixture between species or populations using Dsuite (Malinsky 2021 fast D-statistics), Patterson's D / ABBA-BABA test (Green 2010; Durand 2011), f4-ratio and f-branch statistic (Malinsky 2018), TreeMix (Pickrell & Pritchard 2012), HyDe (Blischak 2018), QuIBL (Edelman 2019), sprime (Browning 2018), Twisst (Martin 2017), PhyloNet (Than 2008) for explicit phylogenetic networks, and qpAdm / qpGraph (Patterson 2012). Distinguish introgression from incomplete lineage sorting (ILS), ancestral structure, ghost-lineage admixture, and rate variation. Use when testing inter-species gene flow, dating admixture events, identifying introgressed segments, building phylogenetic networks for reticulate evolution, or applying the ABBAclustering (Koppetsch-Malinsky-Matschiner 2024) framework for divergent-species gene flow.

development1,065

bio-clip-seq-crosslink-site-detection

Detect single-nucleotide crosslink (CL) sites in CLIP-seq data using truncation patterns (iCLIP/eCLIP CITS), crosslink-induced mutations (HITS-CLIP CIMS deletions, PAR-CLIP T-to-C), or HMM/kernel-density methods (PureCLIP, PARalyzer, CTK). Use when single-nucleotide resolution is required for motif registration (mCross), allele-specific binding (BEAPR), variant-effect prediction, or comparing crosslink chemistry across CLIP variants.

tools1,065

bio-comparative-genomics-synteny-analysis

Detect syntenic blocks and structural rearrangements between genomes using MCScanX (Wang 2012), JCVI/MCScan (Tang 2008 Python), GENESPACE (Lovell 2022) for orthology-anchored riparian visualization, SyRI for structural variation, AnchorWave for sequence-level synteny, i-ADHoRe 3.0 for highly diverged species, SynNet for synteny networks, and ntSynt for multi-genome macrosynteny. Use when identifying collinear gene blocks across species, distinguishing macrosynteny from microsynteny, detecting inversions/translocations/duplications, anchoring orthology in WGD lineages, producing publication riparian plots, computing synteny block age via Ks (cross-references whole-genome-duplication), or running synteny-aware ortholog inference in polyploids.

development1,065

bio-clip-seq-clip-peak-calling

Call protein-RNA binding sites from CLIP-seq BAM with CLIPper, PureCLIP, Skipper, Piranha, omniCLIP, CTK, CLAM, or Paraclu. Use when choosing between coverage-based, HMM-based, beta-binomial window-based, and crosslink-site-based peak callers; applying ENCODE eCLIP thresholds (log2 IP/SMInput >= 3, -log10 p >= 3); deciding when SMInput is mandatory; or reconciling peak-set discordance between callers for the same RBP.

tools1,065

bio-clip-seq-clip-deep-learning

Predict RBP binding from RNA sequence using deep learning models (RBPNet sequence-to-signal, RNAProt RNN, GraphProt2 GCN with structure, DeepCLIP, DeepRiPe multi-modal CNN) for variant-effect prediction, in silico binding-site discovery, model interpretation, and transfer learning from CLIP and RBNS datasets. Use when computational prediction of RBP binding from sequence is needed, evaluating variant effects on binding without further wet-lab experiments, comparing model performance, or training a custom model on ENCODE eCLIP data.

tools1,065

bio-comparative-genomics-ancestral-reconstruction

Reconstruct ancestral states at internal phylogenetic nodes for sequences (PAML codeml, IQ-TREE --ancestral, GRASP, FastML), discrete traits (corHMM hidden-rate Markov, ape::ace, phytools::make.simmap stochastic mapping, BayesTraits), and continuous traits (phytools::fastAnc, geiger Brownian/OU, RPANDA). Use when designing constructs for ancestral protein resurrection, tracing trait evolution along a tree, performing stochastic character mapping, testing models of trait evolution (BM vs OU vs EB), inferring ancestral genome content via Dollo or DTL reconciliation, or quantifying ancestral-state uncertainty for downstream comparative analyses.

tools1,065

bio-comparative-genomics-pangenome-analysis

Build and analyze pangenomes for prokaryotes (Panaroo, PPanGGOLiN, PEPPAN, GET_HOMOLOGUES, anvi'o pangenomics) and eukaryotes (Minigraph-Cactus, PGGB, vg pangenome graphs). Implement Tettelin core/accessory/cloud genome decomposition (Tettelin 2005), Heap's law open/closed pangenome modeling, gene presence/absence GWAS (Scoary, pyseer), pangenome graph variant calling (vg, PanGenie), and structural-variation graph indexing. Use when assembling species- or genus-level pan-gene catalogs, separating core from accessory/shell/cloud genes, testing gene-content associations with phenotypes, building pangenome graphs from haplotype-resolved assemblies, calling SVs from pangenome graphs, or selecting between bacterial-pangenome and eukaryotic-pangenome workflows.

development1,065

bio-workflows-clip-pipeline

End-to-end CLIP-seq pipeline from FASTQ to ENCODE-compliant binding sites, single-nucleotide crosslink maps, annotation, motifs, and (optionally) differential binding. Use when running the full Yeo lab eCLIP / iCLIP / iCLIP2 / iCLIP3 / irCLIP / PAR-CLIP analysis with SMInput control, protocol-specific UMI extraction, ENCODE STAR parameters, CLIPper or Skipper peak calling with stringent log2 FC and -log10 p thresholds, IDR rescue and self-consistency QC, and downstream motif registration with mCross or PEKA.

tools1,065

bio-comparative-genomics-positive-selection

Detect positive (diversifying / episodic / pervasive) selection using codon dN/dS frameworks. Implements PAML codeml site models (M0/M1a/M2a/M7/M8/M8a), branch models, branch-site model A (Zhang 2005), and HyPhy methods (BUSTED, BUSTED-S, BUSTED-MH, BUSTED-PH, MEME, FEL, FUBAR, aBSREL, SLAC, RELAX, GARD, FUBAR-MH). Includes McDonald-Kreitman framework (asymptotic alpha, impMKT, polyDFE, DFE-alpha, GRAPES) for within-species + divergence inference, RERconverge for trait-correlated rate shifts, CSUBST for convergent substitution, and PhyloAcc for accelerated noncoding evolution. Use when testing adaptive evolution at codons, branches, or full gene; running GARD recombination pre-screen; controlling alignment-error and gBGC false positives; reconciling PAML vs HyPhy results; or performing genome-scale selection scans.

development1,065

bio-clip-seq-differential-clip

Identify differentially bound regions across CLIP-seq conditions (knockdown vs control, treatment vs vehicle, disease vs healthy) using DEWSeq (sliding-window DESeq2), Flipper (Skipper-downstream), ASpeak, edgeR, or limma-voom. Use when computing condition-level changes in RBP binding intensity, choosing peak-level vs window-level vs crosslink-level testing, designing replicate experiments, or distinguishing biological binding shifts from technical confounders.

tools1,065

bio-clip-seq-clip-alignment

Align preprocessed CLIP-seq reads (eCLIP, iCLIP, iCLIP2, PAR-CLIP) to genome with STAR or bowtie2 using crosslink-preserving parameters, choosing between unique-mapper-only and multi-mapper-aware alignment for repeat-binding RBPs, deciding STAR vs HISAT2 memory trade-offs, and applying ENCODE-compatible filters. Use when turning preprocessed CLIP FASTQ into a deduplicated, MAPQ-filtered BAM ready for peak calling or crosslink-site detection.

tools1,065

bio-clinical-databases-somatic-signatures

Extracts and assigns COSMIC v3.4 mutational signatures (86 SBS / 11 DBS / 18 ID / 21 CN / 16 SV) from somatic VCFs using SigProfilerSuite, MutationalPatterns, MuSiCal mvNMF, SigNet, or HRDetect. Use when characterizing DNA-damage etiology (BRCA1/2 HRD, MMR-D, POLE, APOBEC3A, UV, tobacco, aflatoxin, 5-FU/SBS17b, platinum, colibactin SBS88), routing PARP inhibitor decisions, or auditing de novo extraction vs refit choice for cohort size.

tools1,055

bio-clinical-databases-polygenic-risk

Constructs and validates polygenic risk scores using LDpred2-auto, SBayesRC, MegaPRS, PRS-CS, PROSPER, MUSSEL, BridgePRS, JointPRS, PRSmix, or PGS Catalog Calculator with ancestry-aware reference panels (HapMap3, UKB-LD), ancestry-conditional calibration, and PRS-RS reporting standards. Use when computing PRS for cohorts, applying absolute-risk transformation, assessing cross-ancestry portability (Martin 2017 / Ding 2023 continuous ancestry), or auditing PRS manuscripts against the 22-item PRS-RS reviewer checklist.

tools1,055

bio-workflows-neoantigen-pipeline

Orchestrates neoantigen discovery from somatic variants to ranked vaccine candidates, chaining HLA typing (OptiType/arcasHLA + LOHHLA), VEP annotation (Wildtype+Frameshift plugins) + expression/readcount annotation, proximal-variant phasing, pVACseq MHC-I/II binding, CCF/clonality, and immunogenicity/quality ranking. Use when recognizing that binding is single-digit PPV and the critical steps are downstream (full-resolution HLA + LOH gating, proximal-variant phasing, clonality from purity+CN not raw VAF, expression), sequencing normalize+annotate -> phase -> HLA -> binding -> quality in the defensible order, dropping candidates on LOH-lost alleles, supplying --phased-proximal-variants-vcf so the mutant peptide is real, or ranking WITHIN patient rather than a fixed IC50 threshold. Hands mechanism to the immunoinformatics component skills; not a re-teach of any single step.

tools1,055

bio-clinical-databases-clinvar-lookup

Queries ClinVar for variant pathogenicity classifications, ClinGen VCEP curations, and somatic-vs-germline interpretations via REST API, weekly VCF, or bulk XML. Use when determining clinical significance, triangulating conflicting interpretations, or aggregating evidence against the ACMG/AMP framework with ClinGen SVI specifications.

tools1,055

bio-clinical-databases-acmg-classification

Applies ACMG/AMP 2015 framework with ClinGen SVI specifications, Tavtigian 2018/2020 Bayesian point system, Abou Tayoun 2018 PVS1 decision tree, Pejaver 2022 and Bergquist 2025 calibrated PP3/BP4 thresholds for REVEL/BayesDel/AlphaMissense, Brnich 2020 PS3/BS3 OddsPath, Walker 2023 SpliceAI splicing framework, and AMP/ASCO/CAP 2017 tumor tiers. Use when classifying germline variants P / LP / VUS / LB / B, applying VCEP-specific CSpec rules, computing Whiffin BS1, or assigning cancer Tier I-IV per Li 2017.

tools1,055

bio-clinical-databases-myvariant-queries

Queries myvariant.info BioThings aggregator for ClinVar, gnomAD, dbSNP, dbNSFP, COSMIC, CADD, and CIViC annotations in batched, version-tracked requests. Use when annotating variant lists from multiple databases simultaneously without managing per-source APIs, and when reproducibility-grade analyses require recording source data versions via _meta.

tools1,055

bio-clinical-databases-gnomad-frequencies

Queries gnomAD v4 (807k samples), v3, v2.1.1, and constraint metrics with grpmax FAF95, bottleneck-group exclusion, LOEUF interpretation, SV/CNV/mtDNA catalogs, and Whiffin max-credible-AF framework. Use when filtering rare variants, applying ACMG BS1/BA1, ranking genes by LoF intolerance, or selecting between v2 (GRCh37 + chrX/Y constraint) and v4 (GRCh38 + 807k samples).

tools1,055

bio-clinical-databases-hla-typing

Calls HLA class I and class II alleles at 2/4/6/8-field resolution from WGS/WES/RNA-seq/long-read data using OptiType, HLA-LA, T1K, Polysolver, HLA-HD, arcasHLA, StarPhase, or HIBAG imputation. Use when typing for HSCT, solid-organ transplant, neoantigen prediction, PGx screening (B*57:01, B*15:02, etc.), or disease-association studies, with reconciliation across tools and IPD-IMGT/HLA version mismatch handling.

tools1,055

bio-clinical-databases-dbsnp-queries

Resolves rsIDs, navigates RsMergeArch/SNPHistory merge chains, and converts between rsID, SPDI, HGVS, and VCF representations using the dbSNP Build 156 JSON architecture. Use when normalizing variant identifiers, joining variant databases by cluster ID, or tracking deprecated rsIDs through historical merges.

tools1,055

bio-clinical-databases-variant-prioritization

Prioritizes rare-disease variants from trio/quad WES/WGS with de novo (DeNovoGear, Triodenovo), compound-heterozygous phasing (WhatsHap), mosaic VAF tiering, phenotype-driven ranking (Exomiser, Phen2Gene, AMELIE), ClinGen gene-disease validity gating, and ACMG SF v3.2 secondary findings reporting. Use when running diagnostic exome / genome pipelines, identifying candidate Mendelian disease genes, screening for incidental findings, or auditing VUS reclassification cycles. The ACMG/AMP classification framework (PVS1 decision tree, Pejaver PP3/BP4 calibration, Tavtigian point system) is in clinical-databases/acmg-classification.

tools1,055

bio-clinical-databases-pharmacogenomics

Queries PharmGKB / CPIC / DPWG for drug-gene interactions; calls CYP2D6/CYP2C9/CYP2C19/DPYD/TPMT/NUDT15/UGT1A1/SLCO1B1 star alleles and phenotype with PharmCAT, Cyrius (CYP2D6 structural variants), Aldy, Stargazer; applies Caudle 2020 activity-score translation. Use when implementing pharmacogenomic-guided prescribing, applying CPIC vs DPWG guidance, screening HLA risk alleles for ICI / antiepileptics / abacavir, or interpreting compound TPMT+NUDT15 thiopurine risk.

tools1,055

bio-clinical-databases-tumor-mutational-burden

Calculates tumor mutational burden from WES/WGS/panel data with Friends of Cancer Research harmonization equations, per-assay calibration (FDA 10/Mb = 7.8 TSO500 = 8.4 OncomineTML), synonymous/indel/germline filtering, hypermutator tiering, blood TMB, and integration with HLA-LOH and neoantigen quality (Luksza 2017 fitness). Use when assessing ICI eligibility under tumor-specific cutoffs (McGrail 2021), comparing tissue vs bTMB, or auditing TMB-H reporting against ESMO 2024 and FDA pembrolizumab pan-tumor 2020.

tools1,055

bio-clinical-databases-msi-detection

Calls microsatellite instability from WES/WGS/targeted-panel with MSIsensor, MSIsensor-pro, MSIsensor-ct (panel-aware), mSINGS, and MANTIS for FDA pembrolizumab MSI-H pan-tumor / Lynch syndrome / dMMR ICI biomarker. Use when stratifying ICI eligibility (Le 2015), pairing MSI with TMB-H (Sha 2020 / Salem 2018), screening Lynch syndrome (universal IHC + MSI), or distinguishing MSI-H tumors from POLE-exo hypermutator with overlapping signatures.

tools1,055

bio-clinical-biostatistics-missing-data

Implements missing-data sensitivity analyses for confirmatory clinical trials including MMRM under MAR (with Kenward-Roger correction), reference-based multiple imputation (J2R, CR, CIR, LMCF per Carpenter-Roger 2013), Permutt delta-adjustment / tipping-point analysis, pattern-mixture identifying restrictions (CCMV, NCMV, ACMV), and the Cro vs Bartlett variance debate. Use when handling missing primary or secondary endpoint data in regulatory submissions following NRC 2010 and ICH E9(R1).

tools1,033

bio-chipseq-super-enhancers

Identifies super-enhancers from H3K27ac, MED1, or BRD4 ChIP-seq using ROSE, ROSE2, LILY, HOMER -style super, and ENCODE dELS cross-referencing. Handles peak stitching parameters, ranking choices, hockey-stick inflection, marker choice (H3K27ac vs MED1/BRD4), and cross-condition comparison with spike-in normalization. Constructs core regulatory circuitry (Saint-Andre 2016) from SE-encoded TFs. Use when identifying cell-identity / cancer-associated regulatory domains, comparing super-enhancers between conditions, identifying master transcription factor networks, or predicting BET-inhibitor responsiveness.

development1,033

bio-clinical-biostatistics-subgroup-analysis

Performs subgroup and heterogeneous treatment effect (HTE) analyses for clinical trials. Covers Mantel-Haenszel pooling, Breslow-Day, interaction tests in regression, RERI for additive interaction, modern data-adaptive HTE methods (STEPP, SIDES, causal forests, X/R-learners), Bayesian shrinkage (Dixon-Simon, MAP, EXNEX), graphical multiplicity (Bretz-Maurer), and credibility frameworks (Sun BMJ, EMA 2019). Use when analyzing treatment effects across patient subgroups for regulatory submissions or precision-medicine claims.

tools1,033

bio-clinical-biostatistics-survival-analysis

Performs time-to-event analysis for clinical trials including Cox proportional hazards regression with PH diagnostics, restricted mean survival time (RMST) under non-PH, competing risks via Fine-Gray vs cause-specific Cox, weighted log-rank and MaxCombo for non-proportional hazards, recurrent events (Andersen-Gill, PWP, WLW), and interval-censored data. Use when analyzing time-to-event endpoints (OS, PFS, DOR, TTR, TTNT) in oncology or other clinical trials.

tools1,033

bio-clinical-biostatistics-multiplicity-graphical

Implements multiplicity control for confirmatory clinical trials using graphical procedures (Bretz-Maurer-Hommel), gatekeeping (parallel, serial, mixed), Hochberg/Hommel/Holm with PRDS, and the closed-testing principle (Marcus-Peritz-Gabriel; Goeman 2021 admissibility). Covers FDA Multiple Endpoints Final Guidance (October 2022), graphical procedures via R gMCP, primary + key-secondary + subgroup hierarchies, and FWER vs FDR distinction. Use when designing the multiplicity strategy for confirmatory trials with multiple primary or key secondary endpoints.

tools1,033

bio-clinical-biostatistics-power-sample-size

Computes sample size and power for clinical trials including continuous, binary, and time-to-event endpoints; superiority, non-inferiority, and equivalence designs; FDA 2016 non-inferiority margin selection with M1/M2 framework; Schoenfeld 1981 and Lakatos 1988 for survival; Schuirmann TOST and 80-125% bioequivalence; minimum clinically important difference (MCID) vs δ distinction. Use when justifying trial size in protocol or SAP per CONSORT 2025 item 16a.

tools1,033

bio-clinical-biostatistics-trial-reporting

Prepares statistical reports for clinical trials following CONSORT 2025, SPIRIT 2025, ICH E9(R1) estimands, and FDA 2023 covariate adjustment guidance. Covers Table 1 generation, analysis populations (ITT/FAS/PP/Safety), the 5 ICH E9(R1) intercurrent-event strategies, MMRM under MAR (mmrm), reference-based MI (rbmi J2R/CR/CIR), Permutt tipping-point sensitivity, and Rubin's-rules vs frequentist variance debate. Use when preparing regulatory submissions, defining estimands, or implementing missing-data sensitivity analyses.

tools1,033

bio-clinical-biostatistics-logistic-regression

Performs logistic regression for clinical trial outcomes (binary, ordinal, multinomial) with marginal-vs-conditional estimand reporting per FDA 2023 covariate adjustment guidance, g-computation/standardisation for marginal effects, modified Poisson for RR, Brant test for proportional odds, Firth penalty for separation, and Hauck-Donner detection. Use when modeling binary or ordinal endpoints in confirmatory or exploratory clinical trials.

tools1,033

bio-workflows-clinical-trial-pipeline

End-to-end clinical trial analysis workflow from CDISC SDTM/ADaM loading through ICH E9(R1) estimand-driven primary analysis to CONSORT 2025 regulatory-compliant reporting. Covers data preparation, FDA 2023 marginal vs conditional logistic regression, categorical tests with Boschloo, modern HTE/subgroup methods, missing-data sensitivity (MMRM, reference-based MI, Permutt tipping point), graphical multiplicity (Bretz-Maurer), survival analysis (Cox/RMST/competing risks) when applicable, and Table 1. Use when performing a complete analysis of clinical trial data.

tools1,033

bio-chipseq-allele-specific-binding

Detects allele-specific transcription factor or histone modification binding from heterozygous-variant ChIP-seq using WASP (reference-bias filter; mandatory upstream), RASQUAL (joint QTL + bias-corrected testing), BaalChIP (Bayesian beta-binomial with copy-number-aware overdispersion), and AlleleSeq (personalized diploid genome). Handles imprinted-locus awareness, X-inactivation artifacts, cancer copy-number imbalance, and integration with downstream caQTL / bQTL mapping. Use when identifying variants with allelic effects on TF binding, fine-mapping causal regulatory variants, validating deep-learning variant predictions, or characterizing cis-acting regulatory effects.

development1,033

bio-chipseq-chip-deep-learning

Trains and applies base-resolution deep learning models on ChIP-seq / ChIP-nexus / CUT&RUN data. Uses BPNet (Avsec 2021 Nat Genet 53:354; soft motif syntax from ChIP-nexus), chromBPNet (Pampari A et al 2024 bioRxiv; bias-factorized base-resolution profiles), EnFormer (Avsec 2021 Nat Methods 18:1196; 196 kb input, ~100 kb effective receptive field), DeepSEA (Zhou 2015; multi-task CNN), and JASPAR 2026 deep-learning collection (1259 BPNet ChIP models). Performs in silico mutagenesis for variant-effect prediction, DeepLIFT/Grad attribution, and TF-MoDISco motif discovery from attribution scores. Use when predicting variant effects on TF binding, discovering soft motif syntax / cooperativity, integrating ChIP-seq with sequence-only predictions, or applying precomputed JASPAR Deep Learning models to new variants.

data-ai1,033

bio-chipseq-motif-analysis

Discovers de novo motifs and tests known motif enrichment in ChIP-seq, ATAC-seq, or other peak sequences using HOMER, MEME-ChIP (STREME, CentriMo, TOMTOM, FIMO), monaLisa, and AME. Handles background selection (GC-matched, dinucleotide-shuffled, Markov order-2, peak-flanks), motif databases (JASPAR 2024 CORE PWMs, JASPAR 2026 deep-learning collection, HOCOMOCO v12, HOMER built-in), centrally-enriched motif testing, and differential motif analysis. Use when identifying TF binding motifs in peaks, testing for known TF enrichment, scanning for motif instances, comparing motif content between conditions, or interpreting motifs from deep learning models.

testing1,033

bio-chipseq-differential-binding

Identifies differentially bound ChIP-seq regions between conditions using DiffBind, csaw (sliding windows), DESeq2/edgeR/PyDESeq2 on count matrices, NormR (control-aware), or MAnorm2. Distinguishes three distinct normalization problems (composition bias, trended bias, global shifts) and matches each to its appropriate fix including spike-in scaling. Use when comparing ChIP-seq binding between experimental conditions, choosing normalization for global vs local changes, integrating spike-in data, or reconciling DiffBind/DESeq2 disagreement.

testing1,033

bio-chipseq-cut-and-run-tag

Analyzes CUT&RUN (Skene Henikoff 2017) and CUT&Tag (Kaya-Okur 2019) chromatin profiling data. Handles SEACR vs MACS2 peak calling (with the btaf375 2025 benchmark guidance), pA-MNase vs pA-Tn5 vs pAG-Tn5 chimera differences, E. coli spike-in carryover normalization, IgG-only control logic (no input), characteristic fragment-size signatures (25-75 bp for CUT&Tag), and lower depth requirements (5M reads typical vs 25M for ChIP). Use when calling peaks from CUT&RUN/CUT&Tag, scaling by E. coli spike-in carryover, choosing SEACR norm mode, or comparing CUT&RUN/Tag results to traditional ChIP.

data-ai1,033

bio-chipseq-peak-annotation

Annotates ChIP-seq peaks to genomic features, nearest genes, ENCODE candidate cis-regulatory elements (cCREs), and regulatory domains. Uses ChIPseeker (R), HOMER annotatePeaks.pl (CLI), pyranges (Python), GREAT/rGREAT (regulatory domain gene-set enrichment), ChIP-Enrich (locus-length-adjusted), ENCODE SCREEN cCRE classification (PLS/pELS/dELS/CA-CTCF/CA-H3K4me3), and ENCODE-rE2G for cell-type-specific enhancer-gene linking. Handles nearest-TSS vs host-gene ambiguity, promoter window definition, and feature priority. Use when assigning genomic context to peaks, linking enhancer peaks to target genes, classifying peaks against ENCODE cCRE registry, or running gene-set enrichment on peak-associated genes.

tools1,033

bio-chipseq-spike-in-normalization

Normalizes ChIP-seq data using exogenous spike-in (ChIP-Rx with Drosophila chromatin per Orlando 2014 / Egan 2016; E. coli carryover for CUT&RUN/CUT&Tag). Distinguishes RRPM from Rx-Input scaling, integrates with DiffBind / DESeq2 / edgeR / csaw via sizeFactors and DiffBind library-size vectors, applies the Patel et al 2024 *Nat Biotechnol* failure-mode framework, and validates that normalization is applied at the read level (not peak counts). Use when global signal shifts are expected (HDACi, BETi, EZH2i, dosage, target knockdown), when ChIPseqSpikeInFree detects post-hoc shifts, or when validating internal-control regions before publication.

development1,033

bio-chipseq-chromatin-state-segmentation

Segments the genome into chromatin states from combinatorial histone modification and chromatin factor ChIP-seq data. Uses ChromHMM (multivariate HMM on binarized signal, v1.27), Segway (Dynamic Bayesian Network on continuous signal), EpiSegMix (flexible-distribution HMM with duration modeling, 2024), EpiLogos (multi-biosample visualization), IDEAS (cell-type-aware joint), and full-stack ChromHMM (Vu Ernst 2022) for cross-cell-type segmentations. Handles state-count selection (15 vs 18 vs 25 states), binarization choice, OverlapEnrichment / NeighborhoodEnrichment downstream analysis, and cross-biosample integration. Use when learning chromatin states from a histone mark panel, characterizing learned states by genomic feature enrichment, or comparing chromatin landscapes across cell types.

development1,033

bio-chipseq-qc

Assesses ChIP-seq quality across antibody specificity, fragmentation, enrichment, replicate concordance, and library complexity. Computes FRiP, NSC/RSC (phantompeakqualtools), library complexity (NRF/PBC1/PBC2), deepTools plotFingerprint (JS distance, AUC, synthetic JS), ChIPQC, IDR with ENCODE Nself/Nt rules, and detects hyper-ChIPable artifacts. Use when validating an antibody, diagnosing failed peak calls, deciding whether to proceed with downstream analysis, grading against ENCODE thresholds, or auditing replicate concordance.

tools1,033

bio-chipseq-visualization

Visualizes ChIP-seq data using deepTools (computeMatrix, plotHeatmap, plotProfile, bamCoverage, bamCompare), pyGenomeTracks (modern INI-driven track plots), Gviz (R browser-style), EnrichedHeatmap (ComplexHeatmap-based), ChIPseeker tag heatmaps, and IGV batch screenshots. Handles bigWig normalization choices (CPM, BPM, RPGC, spike-in scaled), bamCompare operations (log2 ratio, subtract) with SES scaling, k-means clustering of heatmaps for biological subgrouping, and spike-in-scaled tracks for global-shift experiments. Use when generating publication-quality ChIP-seq signal heatmaps, profile plots, genome-browser tracks, or comparing samples visually.

tools1,033

bio-clinical-biostatistics-bayesian-trials

Designs Bayesian clinical trials including Phase I dose-finding (BOIN, CRM, EWOC, mTPI-2), meta-analytic-predictive (MAP) priors with robust mixtures for external data borrowing, EXNEX for basket trials, hierarchical models for safety AE (Berry-Berry), Bayesian platform trials (I-SPY 2, GBM AGILE, REMAP-CAP), and posterior probability stopping rules. Covers FDA Bayesian Devices Guidance (2010), FDA Bayesian Methodology in Drugs Draft (January 2026), BOIN Fit-for-Purpose qualification (December 2021), and Project Optimus dose-optimisation. Use when designing dose-finding studies, platform trials, or sensitivity analyses with informative priors.

tools1,033

bio-clinical-biostatistics-adaptive-designs

Designs adaptive clinical trials including group-sequential (O'Brien-Fleming, Pocock, Lan-DeMets spending), sample-size re-estimation (blinded Friede-Kieser, unblinded Cui-Hung-Wang, Mehta-Pocock promising zone), seamless Phase 2/3 with treatment-arm selection, population enrichment, and response-adaptive randomisation. Covers FDA 2019 Final Adaptive Designs Guidance, FDA 2022 Master Protocols, and ICH E20 Step 2b/3 draft (June 2025, NOT final). Use when planning interim analyses, sample-size re-estimation, or master/platform-trial designs.

tools1,033

bio-clinical-biostatistics-categorical-tests

Tests associations between categorical variables in clinical data using chi-square, Fisher's exact, Boschloo, Cochran-Mantel-Haenszel, and modern McNemar variants with calibrated confidence intervals (Wilson, Newcombe, Miettinen-Nurminen). Use when analyzing categorical outcomes, paired binary endpoints, or testing treatment-outcome independence in confirmatory or exploratory clinical trials.

tools1,033

bio-chipseq-peak-calling

Calls ChIP-seq peaks with MACS3, MACS2, HOMER, or SPP across narrow (TF) and broad (histone) modes. Handles input control matching, fragment-size modeling vs --nomodel, effective genome size, ENCODE-style IDR vs naive overlap, hyper-ChIPable artifacts, and aligner-specific shifts. Use when calling peaks from ChIP-seq alignments, choosing between narrow vs broad mode for a histone mark, deciding model vs nomodel for low-depth data, applying ENCODE pseudoreplicate IDR, or reconciling MACS vs HOMER vs SPP results.

development1,033

bio-clinical-biostatistics-cdisc-data

Reads, validates, and prepares CDISC SDTM and ADaM clinical trial data for analysis. Covers SDTM domain joins (DM, AE, EX, VS, LB, DS), ADaM architecture (ADSL, BDS, OCCDS, ADTTE) with traceability, treatment-emergent AE conventions, baseline derivation, SUPPQUAL/NSV handling, Define-XML 2.1, and Pinnacle 21 / CORE validation. Use when working with clinical trial datasets in CDISC SDTM/ADaM format, preparing analysis-ready data, or validating for regulatory submission.

tools1,033

bio-clinical-biostatistics-effect-measures

Computes and interprets treatment effect measures (OR, RR, RD, HR, NNT) with calibrated confidence intervals (Wilson, Newcombe, Miettinen-Nurminen, MOVER, profile likelihood, Bender NNT) and reports marginal vs conditional estimands per FDA 2023 covariate adjustment guidance. Use when reporting treatment effects in confirmatory trials, comparing effect sizes across studies, or constructing forest plots.

tools1,033

bio-causal-genomics-transcriptome-wide-association

Performs gene-level association from GWAS summary statistics via genetically predicted tissue expression using FUSION, PrediXcan, S-PrediXcan, S-MultiXcan, UTMOST, MOSTWAS, kTWAS, EpiXcan, TIGAR-V2, and probabilistic fine-mapping with FOCUS and MA-FOCUS. Use when running TWAS from GWAS sumstats, prioritising candidate causal genes from a GWAS lead locus, picking single-tissue vs cross-tissue models, identifying LD-induced TWAS false positives, choosing ancestry-matched prediction weights, fine-mapping co-regulated TWAS hits, or triangulating TWAS with cis-eQTL Mendelian randomization and colocalization to nominate a causal gene.

testing1,028

bio-causal-genomics-mediation-analysis

Decompose total effects into direct and indirect paths through mediators using mediation, CMAverse 4-way, HIMA/HIMA2 high-dimensional, BAMA, two-step / MVMR mediation, or double-ML medDML. Use when testing whether a molecular phenotype (expression, methylation, protein) mediates a treatment-outcome relationship, decomposing exposure-mediator interaction via VanderWeele 4-way, screening high-dimensional EWAS mediators, or running MR-based mediation when sequential ignorability is implausible.

testing1,028

bio-causal-genomics-proteome-mr-drug-target

Runs cis-pQTL Mendelian randomization for drug-target validation using UKB-PPP (Olink), deCODE (SomaScan), Fenland, INTERVAL, ARIC, and FinnGen-PPP proteomes plus colocalization triangulation, phenome-wide on-target adverse-effect scans, cross-platform Olink/SomaScan replication, and PAV (protein-altering variant) sensitivity. Use when nominating or de-risking a drug target from plasma-proteome GWAS, mimicking pharmacological inhibition via cis-pQTL instruments, separating shared-causal from LD-confounded signal under the Schmidt 2020 cis-MR framework, screening on-target adverse phenotypes pheWAS-style, or producing publication-grade STROBE-MR plus PP.H4 evidence for a target gene.

development1,028

bio-causal-genomics-mendelian-randomization

Estimate causal effects of an exposure on an outcome from GWAS summary statistics using genetic instruments. Implements IVW (fixed/random), MR-Egger, weighted median/mode, MR-RAPS, CAUSE, GSMR-HEIDI, MR-PRESSO, MVMR, MR-Clust, LCV, and LHC-MR via TwoSampleMR, MendelianRandomization, MR-PRESSO, cause, and lhcMR. Use when testing causal direction between traits, evaluating drug-target effects via cis-pQTL/cis-eQTL, performing multivariable mediation MR, distinguishing causation from correlated horizontal pleiotropy, or producing STROBE-MR-compliant sensitivity batteries.

testing1,028

bio-causal-genomics-fine-mapping

Resolves GWAS associations to candidate causal variants and credible sets via SuSiE, susie_rss, FINEMAP, CAVIAR, DAP-G, PAINTOR, PolyFun, SuSiEx, MultiSuSiE, and FOCUS. Use when narrowing a GWAS lead SNP to a 95 percent credible set, choosing between in-sample and reference LD, calibrating non-sparse loci with SuSiE-inf or FINEMAP-inf, integrating functional priors via PolyFun, fine-mapping across ancestries with SuSiEx, diagnosing LD mismatch via estimate_s_rss and kriging_rss, handling HLA or long-range LD, or feeding credible sets into coloc.susie for colocalization.

testing1,028

bio-causal-genomics-genetic-correlation

Estimates bivariate genetic correlation (rg) between traits from GWAS summary statistics or individual-level genotypes using cross-trait LDSC, HDL, LAVA, rho-HESS, GREML-bivariate, Popcorn, and HDL-L. Use when quantifying shared genetic architecture between two traits, screening MR validity before causal inference, distinguishing global from locus-level rg, estimating trans-ancestry rg, separating partial from full causation via LCV gcp, or producing a STROBE-MR-compliant cross-trait sensitivity battery. Cross-trait LDSC intercept absorbs sample overlap and is NOT a bias; HDL is biased under sample overlap above ~5%. High rg between exposure and outcome motivates CHP-aware MR sensitivity (CAUSE, LHC-MR).

testing1,028

bio-causal-genomics-genomic-sem

Fits structural equation models to GWAS summary statistics using GenomicSEM (Grotzinger 2019), including common-factor models, confirmatory factor models, ESEM, common-factor GWAS with Q_SNP heterogeneity, multivariate Wald tests, and stratified GenomicSEM partitioned heritability. Reconciles results against MTAG multi-trait analysis. Handles sample overlap via the LDSC sampling-covariance matrix, identifies and resolves Heywood cases, and verifies model fit with CFI / RMSEA. Use when modeling latent genetic architecture across correlated traits, running multivariate GWAS on a shared factor, distinguishing factor-mediated from trait-specific SNP effects, or comparing GenomicSEM common-factor results against MTAG when both depend on accurate sampling covariance.

testing1,028

bio-workflows-causal-genomics-pipeline

End-to-end post-GWAS causal inference pipeline orchestrating heritability partitioning, genetic correlation, Mendelian randomization with CHP-aware sensitivity (CAUSE / LHC-MR), colocalization, fine-mapping with SuSiE / FOCUS, mediation, TWAS triangulation, cis-pQTL drug-target MR, effector-gene prioritization (L2G / PoPS / cS2G), and GenomicSEM common-factor GWAS. Use when triangulating causal inference across multiple complementary methods, prioritizing tissues via stratified LDSC, nominating or de-risking drug targets, mapping a lead SNP to a candidate effector gene, modeling shared genetic architecture across correlated traits, or producing a STROBE-MR-compliant publication-grade evidence battery from GWAS summary statistics.

testing1,028

bio-causal-genomics-heritability-partitioning

Estimates SNP heritability and partitions it across functional annotations, cell types, and loci from GWAS summary statistics or individual-level genotypes. Implements LDSC, stratified LDSC with the baseline-LD model, Finucane 2018 cell-type prioritization, LDAK SumHer, HDL, HESS local heritability, BOLT-REML, GCTA-GREML, graphREML, and Popcorn cross-population genetic correlation. Use when computing total h2_SNP from summary stats, partitioning heritability across functional categories, prioritizing trait-relevant tissues or cell types from ENCODE/Roadmap chromatin marks, reconciling LDSC vs LDAK enrichment estimates, computing local heritability with HESS, estimating genetic correlation between traits, or producing publication-grade enrichment with calibrated sensitivity to model assumptions.

development1,028

bio-causal-genomics-colocalization-analysis

Test whether two or more traits share a causal variant at a locus using Bayesian colocalization (coloc.abf, coloc.susie, HyPrColoc, moloc, eCAVIAR, SMR/HEIDI, PWCoCo, SharePro). Use when integrating GWAS with eQTL/sQTL/pQTL/mQTL, distinguishing shared causal variants from LD-driven coincidence, handling allelic heterogeneity, choosing between single-causal vs multi-causal methods, picking PP.H4 thresholds, running sensitivity over p12, or harmonising summary statistics for colocalization.

testing1,028

bio-causal-genomics-effector-gene-prioritization

Maps GWAS-implicated loci to candidate effector (causal) genes by integrating variant-to-gene (V2G) features via Open Targets L2G (Mountjoy 2021), MAGMA gene-based association (de Leeuw 2015), FUMA SNP2GENE, cS2G combined SNP-to-gene scores (Gazal 2022), Polygenic Priority Scores (PoPS, Weeks 2023), FLAMES, INQUISIT, DEPICT, and enhancer-gene predictors (ABC, ENCODE-rE2G). Use when narrowing a GWAS lead locus to a candidate causal gene, picking between proximity, eQTL-based, and similarity-based prioritizers, integrating multi-evidence streams (fine-mapping, colocalization, ABC enhancer-gene, distance, chromatin), reconciling discordant L2G vs PoPS calls, prioritizing tissue-specific eQTL evidence, or triangulating across at least three independent lines of evidence for a publication-grade effector-gene nomination.

development1,028

bio-causal-genomics-pleiotropy-detection

Detect and adjust for horizontal pleiotropy in two-sample Mendelian randomization by distinguishing uncorrelated (UHP) from correlated (CHP) pleiotropy and choosing among Egger, MR-PRESSO, MR-RAPS, CAUSE, LHC-MR, LCV, MR-Clust, MR-Mix, and contamination-mixture methods. Use when validating an MR causal claim, running the STROBE-MR sensitivity battery, suspecting a shared heritable confounder, working under weak-instrument or polygenic-exposure regimes, or reconciling discordant estimates across robust methods.

testing1,028

bio-admet-prediction

Predicts ADMET properties using ADMETlab 3.0 (119 platform features, including 77 prediction models with modeled-endpoint uncertainty), ADMET-AI, DeepChem MolNet, and chemprop D-MPNN with explicit handling of OECD QSAR principles, applicability domain assessment, calibration, hERG/CYP/AMES endpoints, and PAINS / Lipinski / Ro5 / Veber / BBB druglikeness filters. Use when filtering compounds for drug-likeness, prioritizing leads by predicted safety, or building an in-house ADMET QSAR model.

development1,022

bio-reaction-enumeration

Enumerates virtual chemical libraries via reaction SMARTS transformations using RDKit and reaction templates, with explicit handling of atom mapping, RDChiral template extraction, product validation, RECAP/BRICS fragmentation, R-group decomposition, matched molecular pair analysis (MMPA), and Free-Wilson analysis. Use when generating combinatorial libraries from building blocks, enumerating analog series, deriving structure-activity rules, or extracting transformations from reaction data.

development1,022

bio-conformer-generation

Generates 3D conformer ensembles using RDKit ETKDGv3 with knowledge-enhanced distance geometry, MMFF94/UFF force-field optimization, CREST + GFN2-xTB semi-empirical refinement, and macrocycle-aware torsion preferences. Provides explicit decision rules for single vs ensemble conformer use, RMSD pruning, energy windows, conformer count, and force-field choice. Use when preparing 3D ligands for docking, generating descriptor input for 3D QSAR, or sampling macrocycle/peptide conformational ensembles.

development1,022

bio-scaffold-analysis

Analyzes chemical libraries by scaffold using Bemis-Murcko scaffolds, generic frameworks, cyclic skeletons, matched molecular pair (MMP) analysis via mmpdb, R-group decomposition, Free-Wilson analysis, scaffold hopping, and chemotype-aware ML train/test splits. Use when identifying chemotype clusters in a library, deriving SAR transformation rules, decomposing series into R-groups, performing scaffold-balanced QSAR splits, or planning analog campaigns.

tools1,022

bio-covalent-design

Designs covalent inhibitors and warheads targeting cysteine, lysine, serine, threonine, tyrosine, and aspartate residues, with explicit handling of warhead reactivity (acrylamide, chloroacetamide, vinyl sulfone, sulfonyl fluoride, fluorosulfate, aldehyde, boronate, nitrile), reversibility (kinact/Ki, t_residence), glutathione (GSH) stability, intrinsic reactivity assays, and covalent docking (DOCKovalent, GOLD, HCovDock). Use when designing covalent inhibitors for targeted covalent inhibition (TCI), KRAS G12C-style approaches, or rationalizing covalent SAR.

development1,022

bio-free-energy-calculations

Performs alchemical free-energy calculations including relative binding free energy (RBFE / FEP+) and absolute binding free energy (ABFE) via OpenFE, FEP+, GROMACS, AMBER pmemd, and OpenMM with explicit lambda scheduling, soft-core potentials, MBAR/BAR analysis, cycle-closure validation, and protocol-appropriate enhanced sampling. Compares ML alternatives (Boltz-2 affinity, DeepDock). Use when ranking analogs by binding affinity beyond docking accuracy, performing prospective lead optimization, or validating SAR predictions.

testing1,022

chemoinformatics/ml-docking-rescoring

--- name: bio-ml-docking-rescoring description: Performs ML-based protein-ligand pose prediction and scoring using DiffDock-L (diffusion-based), Boltz-1 / Boltz-2 (foundation model with affinity), Chai-1, AlphaFold3 ligand, EquiBind, TANKBind, NeuralPLexer, and hybrid workflows (DiffDock pose + GNINA rescore + PoseBusters QC). Explicit handling of when ML beats classical docking, when classical beats ML, the PB-invalid pose problem, and rescoring as the standard production hybrid. Use when moder

development1,022

bio-molecular-descriptors

Calculates molecular fingerprints (ECFP/Morgan, FCFP, MACCS, RDKit, AtomPair, TopologicalTorsion, Avalon, MAP4, MHFP6) and physicochemical descriptors (Lipinski, QED, TPSA, Crippen LogP, 3D shape) with explicit choice tables, bit vs count semantics, and partial-charge model selection. Use when featurizing molecules for similarity, QSAR, virtual screening, or ML, or selecting the correct fingerprint for a chemotype-aware task.

testing1,022

bio-shape-similarity

Performs 3D shape-based similarity searching using ROCS (OpenEye), USRCAT (ultra-fast), Open3DAlign (RDKit), ESPSim (electrostatic), and ShaEP with explicit handling of Tanimoto-Combo (shape + color), shape vs ECFP4 complementarity, conformer-ensemble searching, alignment optimization, and scaffold hopping. Use when searching for shape-mimicking compounds with different scaffolds, identifying bioisosteric replacements, prospective scaffold hopping, or expanding hit series beyond 2D similarity.

testing1,022

bio-qsar-modeling

Builds QSAR / QSPR models using chemprop D-MPNN, MolFormer, Uni-Mol, ChemBERTa, random forest baselines, and Gaussian processes with explicit handling of OECD 5 principles, applicability domain (kNN, leverage, conformal prediction, Mahalanobis), scaffold-balanced splits, ensemble uncertainty, calibration (Platt, isotonic), feature importance (SHAP, atomic attribution), and prospective validation. Use when building target-specific predictive models from in-house bioassay data, ADMET endpoints, or selectivity profiles.

development1,022

bio-pose-validation

Validates docked / generated protein-ligand poses using PoseBusters physical-validity tests, strain energy quantification, geometric checks (planarity, vdW overlap, bond/angle distortion), and pose-energy reasonableness. Use when QC-ing docking results, comparing classical vs ML docking outputs, or filtering pose lists before SAR analysis.

testing1,022

bio-molecular-io

Reads, writes, and converts molecular file formats (SMILES, InChI, SDF V2000/V3000, MOL2, PDB, and BinaryCIF) using RDKit and Open Babel with rigorous handling of aromaticity perception, stereochemistry, implicit/explicit hydrogens, kekulization, and salt/fragment separation. Use when loading chemical libraries, debugging parse failures, or preparing molecules for downstream standardization, descriptor calculation, or docking.

development1,022

bio-pharmacophore-modeling

Builds and applies 3D pharmacophore models using RDKit Pharm3D, the apo2ph4 receptor-based workflow (Heider et al. 2023), Pharmer / Pharmit for search, and PharmacoForge for protein-pocket-conditioned pharmacophore generation (Flynn et al. 2025), covering ligand-based pharmacophores from active-set alignment and receptor-based pharmacophores from binding-pocket geometry. Explicitly handles feature types, geometric tolerances, partial matching, and pharmacophore-based virtual screening. Use when identifying scaffold-hopping candidates, building shape-and-feature search queries, or transferring SAR across chemotypes.

development1,022

bio-virtual-screening

Performs structure-based virtual screening using AutoDock Vina, SMINA, GNINA (CNN scoring), and DiffDock-L hybrid workflows with explicit choice rules across rigid vs flexible docking, cross-docking vs self-docking, binding-site detection (P2Rank, fpocket), receptor preparation (PDB2PQR, PROPKA), ligand preparation (meeko, OpenBabel), and ultralarge-library screening (ZINC22, Enamine REAL). Use when screening chemical libraries against a protein target to find candidate binders, ranking docking poses, or selecting a docking workflow for a specific scenario.

development1,022

bio-protac-degraders

Designs PROTACs, molecular glues, and bivalent degraders with explicit handling of E3 ligase choice (VHL, CRBN, IAP, MDM2, KEAP1), linker design (length, composition, rigidity), ternary complex prediction (PRosettaC, DeepTernary, AlphaFold3), cooperativity (alpha), DC50 / Dmax characterization, hook effect, and prediction-experiment reconciliation. Use when designing targeted protein degraders, planning linker SAR, predicting ternary complex stability, or building generative degrader workflows.

development1,022

bio-molecular-standardization

Standardizes molecular structures using the ChEMBL structure pipeline for normalization and parent selection plus RDKit rdMolStandardize for explicit custom steps such as tautomer canonicalization, salt/solvent stripping, charge handling, stereochemistry handling, mixture selection, and isotope normalization. Explicitly compares ChEMBL, canSARchem, RDKit, and PubChem standardization choices. Use when preparing libraries for QSAR training, joining datasets across sources, deduplicating compound collections, or building canonical compound registries.

development1,022

bio-atac-seq-motif-deviation

Analyze TF motif accessibility variability across samples or single cells using chromVAR. Use when identifying TF motifs whose accessibility correlates with conditions, computing per-sample motif z-scores after matched background correction, comparing to ArchR / Signac equivalents, or distinguishing motif-accessibility signal from per-site footprinting.

development1,022

bio-generative-design

Designs novel molecules using REINVENT 4 (de novo, scaffold decoration, linker design, R-group, molecular optimization), MolMIM, Diffusion-based generators (DiGress, DiffSMol), and JT-VAE with explicit handling of multi-parameter optimization (MPO), goal-directed scoring functions, transfer/reinforcement/curriculum learning, synthetic accessibility scoring, and chemical space exploration vs exploitation. Use when designing new chemical matter against a target, decorating a scaffold, linking fragments, or optimizing a hit for multiple ADMET / activity properties simultaneously.

development1,022

bio-substructure-search

Searches molecular libraries for substructure matches using SMARTS patterns with explicit handling of recursive SMARTS, ring membership, aromaticity dialect, vector binding, atom map indices, and reactive/PAINS/REOS/Brenk filter catalogs. Use when filtering compounds by pharmacophore features, functional groups, scaffold matches, or screening for assay-interference / structural alerts.

development1,022

bio-similarity-searching

Performs molecular similarity searching using Tanimoto, Tversky, Dice, and cosine coefficients on bit/count fingerprints with explicit choice rules for symmetric vs asymmetric measures, scaffold-hopping vs lead-optimization regimes, activity-cliff diagnosis, and large-library nearest-neighbor methods (BulkTanimoto, MHFP6 LSH forest, USRCAT). Use when ranking compounds by structural resemblance to a query, clustering libraries, finding analogs, or diagnosing activity cliffs.

tools1,022

bio-retrosynthesis

Performs retrosynthetic planning using AiZynthFinder (template-based MCTS), maintained or version-pinned template-free models, ASKCOS, and emerging RetroSynFormer with explicit handling of route scoring, configurable MCTS rewards, building-block availability, and forward-prediction checks. Use when assessing synthetic feasibility of generated or selected molecules, planning multi-step syntheses, building synthesis-aware design pipelines, or screening libraries for retro-route feasibility.

development1,022

bio-workflows-multiome-pipeline

Orchestrates the end-to-end 10x Multiome (paired scRNA + scATAC) pipeline from Cell Ranger ARC output to a jointly-embedded, annotated object, chaining per-modality QC, AMULET fragment-based ATAC doublet detection, per-modality normalization (RNA SCT/PCA; ATAC TF-IDF/LSI), WNN (or MultiVI) integration, joint clustering, RNA-based annotation, and LinkPeaks peak-to-gene linking. Use when enforcing the shared cell-barcode intersection between modalities (cellranger-ARC not -atac), keeping per-modality QC/doublets before the joint embedding, dropping the depth-correlated LSI component, annotating identity from RNA (ATAC is regulatory state), treating peak-to-gene links as correlational hypotheses, or aggregating to pseudobulk for cross-condition DE. Hands mechanism to the single-cell and atac-seq component skills; not a re-teach of any single step.

development1,011

bio-atac-seq-atac-peak-calling

Call accessible chromatin regions from ATAC-seq BAM files using MACS3, MACS2, Genrich, or HMMRATAC. Use when identifying open chromatin from aligned ATAC-seq, choosing between point-source vs HMM peak callers, applying ENCODE-style pseudoreplicate IDR, removing blacklist regions, or fixing 501bp consensus peaks for downstream differential analysis.

development1,011

bio-atac-seq-allele-specific-accessibility

Detect allele-specific chromatin accessibility from ATAC-seq using WASP, GATK ASEReadCounter, or RASQUAL. Use when mapping cis-regulatory genetic variants from heterozygous SNPs, separating cis from trans regulation, building chromatin QTL (caQTL) maps, validating GWAS variant function with allelic imbalance, or detecting reference allele mapping bias before downstream analysis.

development1,011

bio-atac-seq-atac-qc

ATAC-seq library quality control -- TSS enrichment, FRiP, fragment-size periodicity, library complexity (NRF/PBC1/PBC2), mitochondrial fraction, and ENCODE 4 thresholds. Use when assessing whether an ATAC-seq library passes ENCODE acceptance criteria, diagnosing transposition artefacts, comparing Omni-ATAC vs standard prep quality, or selecting which replicates to drop before peak calling.

development1,011

bio-atac-seq-co-accessibility

Infer cis-regulatory connections (peak-to-peak co-accessibility) from scATAC-seq using Cicero, ArchR getCoAccessibility, or SCENIC+. Use when linking enhancer accessibility to promoter accessibility, identifying enhancer-gene pairs from chromatin alone (without paired RNA), running gene-regulatory inference combining ATAC + RNA, or comparing predicted regulatory contacts against Hi-C/Micro-C ground truth.

testing1,011

bio-atac-seq-consensus-peakset

Build a differential-ready consensus peakset from per-replicate ATAC-seq peaks using iterative overlap removal, fixed-width re-centering, and majority-rule overlap. Use when generating a stable peak coordinate system for downstream differential accessibility, ML feature engineering, cross-sample comparison, or fixed-width peak counts; covers Corces 2018 iterative overlap (501 bp), DiffBind summit re-centering, and ENCODE consistency rules.

development1,011

bio-atac-seq-deep-learning-atac

Sequence-based deep learning for ATAC-seq using chromBPNet, BPNet, scBasset, or Enformer. Use when correcting Tn5 bias with neural networks beyond k-mer models, predicting per-base accessibility profiles, scoring in silico variant effects at GWAS or rare-variant SNPs, discovering motifs via DeepLIFT/TF-MoDISco from a trained model, or generating cell-type-specific accessibility predictions for unobserved cell states.

testing1,011

bio-atac-seq-enhancer-gene-linking

Predict enhancer-gene regulatory connections from ATAC-seq using ABC, ENCODE-rE2G, HiChIP, or Cicero. Use when linking distal enhancers to target genes, choosing between contact-aware (ABC, ENCODE-rE2G), accessibility-only (Cicero), and orthogonal (HiChIP H3K27ac, EpiMap) approaches, validating predictions against CRISPRi-FlowFISH gold-standard, or building cell-type-specific regulatory maps for fine-mapping or therapeutic target discovery.

development1,011

bio-atac-seq-single-cell-atac

Process and analyze single-cell ATAC-seq data with Signac, ArchR, SnapATAC2, or Cell Ranger ATAC. Use when handling 10X scATAC or 10X Multiome (paired RNA+ATAC) data, performing per-cell QC, choosing between ArchR/Signac/SnapATAC2 ecosystems, building per-cluster consensus peaksets, integrating with paired scRNA-seq, doublet detection (AMULET vs ArchR vs scDblFinder), or running pseudobulk differential accessibility per cluster.

development1,011

bio-atac-seq-differential-accessibility

Identify differentially accessible chromatin regions across conditions using DiffBind, csaw, DESeq2, or edgeR. Use when comparing ATAC-seq accessibility between treatment groups, choosing between consensus-peak vs sliding-window approaches, picking the correct normalization (full library vs reads-in-peaks), correcting batch with SVA/RUVseq, or interpreting log2FC and FDR thresholds in a chromatin context.

development1,011

bio-atac-seq-footprinting

Detect transcription factor binding footprints in ATAC-seq using TOBIAS, HINT-ATAC, Wellington, or scprinter. Use when identifying bound TF sites within accessible regions, correcting Tn5 insertion bias before footprinting, choosing between cleavage-based and aggregate-based footprinters, or comparing differential TF activity between conditions.

tools1,011

bio-atac-seq-nucleosome-positioning

Map nucleosome center positions, occupancy, and fuzziness from ATAC-seq fragment-size patterns using NucleoATAC, ATACseqQC, DANPOS3, or scprinter. Use when characterizing nucleosome organization at promoters and enhancers, calling +1/-1 nucleosomes flanking NFRs, generating V-plots for chromatin structure visualization, or comparing nucleosome positioning between conditions.

data-ai1,011

bio-workflows-chipseq-pipeline

Orchestrates the end-to-end ChIP-seq pipeline from FASTQ to blacklist-filtered, annotated peaks, chaining fastp QC, Bowtie2 alignment, pre-dedup library-complexity QC (NRF/PBC), duplicate removal, chrM + ENCODE-blacklist filtering, MACS3 peak calling against a matched input, IDR/consensus reproducibility, deepTools signal tracks, and ChIPseeker annotation. Use when committing the reference build + blacklist version + effective genome size once, pairing each IP with its matched control, computing complexity metrics BEFORE dedup, choosing narrow vs broad and MACS3 vs SEACR/Genrich, keeping per-replicate peaks for IDR, or avoiding depth-normalization that erases a spike-in global shift. Hands mechanism to the chip-seq component skills; not a re-teach of any single step.

tools1,011

bio-differential-splicing

Detects differential alternative splicing between conditions using rMATS-turbo (binomial LRT on junction counts), leafcutter (Dirichlet-multinomial GLM on intron clusters), MAJIQ V3 deltapsi/HET (Bayesian posterior on LSVs), SUPPA2 (empirical-null on TPM-derived PSI), or Shiba (junction-imbalance-corrected, 2025 SOTA at low coverage). Reports FDR-corrected significance and delta PSI effect sizes. Tools differ in statistical model, annotation dependence, calibration regime, and replicate-count requirements. Use when comparing splicing patterns between treatment groups, tissues, or disease states.

tools1,011

bio-workflows-atacseq-pipeline

Orchestrates the end-to-end bulk ATAC-seq pipeline from FASTQ to differential accessibility and TF footprints, chaining Nextera-aware fastp QC, Bowtie2 alignment, chrM removal, dedup, a single Tn5 +4/-5 shift, MACS3 peak calling, Corces fixed-width consensus, DiffBind/csaw differential accessibility, and TOBIAS footprinting. Use when committing the reference build + blacklist once, recognizing ATAC has NO input control (the shift-extend model IS the background), applying the Tn5 shift exactly once (never combining -f BAMPE with --shift/--extsize), removing chrM before calling, building a fixed-width consensus so per-sample counts are comparable, or choosing MACS3 vs Genrich vs HMMRATAC. Hands mechanism to the atac-seq component skills; not a re-teach of any single step.

development1,011

bio-workflows-tcr-pipeline

Orchestrates an end-to-end immune-repertoire pipeline from FASTQ to clonotypes, diversity, overlap, somatic hypermutation and lineages, routing on two forks. Use when deciding bulk vs single-cell (bulk amplicon/RNA-seq -> MiXCR analyze preset -> VDJtools/immunarch depth-normalized diversity and overlap -> figures; 10x paired VDJ -> MiXCR 10x preset or Cell Ranger -> scirpy gene-expression integration, chain QC, clonotype clusters); and TCR vs BCR (TCR -> exact CDR3-nt+V/J clonotypes, VDJtools diversity is fine; BCR -> somatic hypermutation makes exact clonotypes wrong -> Immcantation distToNearest/findThreshold clonal clustering, germline reconstruction, SHM, Dowser lineages); selecting the MiXCR 4.x preset by chemistry and activating its license; downsampling to equal depth before comparing diversity or overlap; and optionally annotating antigen specificity.

tools1,006

bio-splicing-quantification

Quantifies alternative splicing as PSI (percent spliced in) from RNA-seq using rMATS-turbo (BAM-based event), SUPPA2 (TPM-based event), MAJIQ V3 (LSV-based Bayesian), leafcutter (annotation-free intron clusters), VAST-TOOLS (cross-species with microexon support), Shiba (junction-imbalance-corrected, 2025 SOTA at low coverage), or IRFinder-S (intron retention coverage-aware). Distinguishes the five canonical event classes (SE, A5SS, A3SS, MXE, RI), special classes (microexons, exitrons, AFE/ALE), intron retention subtypes (canonical RI vs detained introns), and applies effective-length normalization. Use when measuring splice-site usage or isoform inclusion ratios from short-read RNA-seq.

tools1,006

bio-workflows-riboseq-pipeline

End-to-end Ribo-seq analysis from FASTQ through periodicity QC, P-site calibration, ORF detection, translation efficiency, and stalling. Use when orchestrating a full ribosome profiling pipeline and deciding harvest/dedup/alignment options and which downstream analyses the library can support.

development1,006

bio-workflows-proteomics-pipeline

Orchestrates bottom-up proteomics from a search engine's output (MaxQuant/FragPipe/DIA-NN) to differential protein abundance with limma/DEqMS/MSstats. Use when committing the search database + acquisition mode (DDA vs DIA) up front, re-controlling FDR at PSM AND peptide AND protein-group level (not just PSM), removing contaminant/reverse rows and inspecting RAW distributions before normalizing, bridging cross-plex TMT with an IRS reference channel, modeling MNAR missingness rather than downshift-imputing on/off proteins, batching as a covariate (not pre-subtracted), and testing with treat()/DEqMS. Hands mechanism to the proteomics component skills; not a re-teach of any single step.

testing1,006

bio-alignment-multiple

Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.

tools1,006

bio-workflows-somatic-variant-pipeline

Chains a somatic (tumor-normal) SNV/indel and structural-variant pipeline end to end with GATK Mutect2 (or Strelka2), wiring the somatic-specific machinery - panel-of-normals and gnomAD germline-resource priors, GetPileupSummaries/CalculateContamination, and LearnReadOrientationModel FFPE/oxoG orientation-bias filtering fed into FilterMutectCalls. Use when calling somatic mutations from a tumor-normal pair (or tumor-only with PoN caveats), deciding which artifact filter removes which class of false positive, reasoning about VAF/purity/ploidy and clonal-vs-subclonal detection, adding somatic SV/CNV or TMB/MSI/signatures, or routing variants to AMP/ASCO/CAP tier and oncogenicity interpretation (never germline ACMG).

testing1,006

bio-workflows-rnaseq-to-de

Orchestrates the end-to-end bulk RNA-seq differential-expression pipeline from FASTQ to an annotated DE gene table, chaining fastp QC/trim, Salmon (decoy-aware) or STAR+featureCounts quantification, tximport gene-level collapse, DESeq2/edgeR/limma-voom testing, apeglm shrinkage, and VST-based visualization. Use when committing the reference release and gene-ID namespace once for the whole run, sequencing steps in the defensible order (tximport before DE, raw counts into the model, VST only for viz/clustering), choosing alignment-free vs align-then-count and the DE engine, setting strandedness correctly, keeping batch in the design instead of correcting-then-testing, or handing the signed ranking statistic to downstream enrichment. Hands mechanism to the component skills; not a re-teach of any single step.

testing1,006

bio-workflows-microbiome-pipeline

End-to-end 16S/ITS amplicon workflow from demultiplexed FASTQ to a consensus differential-abundance result, orchestrating cutadapt primer removal, per-run DADA2 ASV inference (learnErrors/mergeSequenceTables/removeBimeraDenovo), region-matched taxonomy assignment, a SEPP/Greengenes2 tree, alpha/beta diversity at a declared sampling depth (phyloseq/vegan, adonis2 paired with betadisper), compositional DA as a consensus of >=2 tools (ALDEx2/ANCOM-BC2) on unrarefied counts, and optional PICRUSt2 functional prediction gated on NSTI. Covers the stage-ordering decisions (primers before truncation, per-run error model, rarefy for diversity not DA, predicted potential not activity) and defers each per-step choice to the six microbiome skills. Use when staging an amplicon study end to end or chaining ASV inference, taxonomy, diversity, and differential abundance. For shotgun reads see workflows/metagenomics-pipeline.

tools1,006

bio-workflows-multi-omics-pipeline

Orchestrates VERTICAL bulk multi-omics integration (RNA + protein + methylation on the SAME samples) from harmonization to a validated result, routing to MOFA2 (shared factors), mixOmics/DIABLO (predictive signature), or SNF (patient subtypes). Use when confirming the correspondence is vertical (not horizontal same-features-different-cohorts), joining on a stable sample primary key rather than cbind on assumed row order, normalizing each omic in its OWN space and equalizing block variance BEFORE stacking (or the widest omic hijacks every shared factor), correcting batch ONCE in one place, and validating in a HELD-OUT cohort because in-cohort CV at n<<p is optimistically biased. Hands mechanism to the multi-omics-integration component skills; not a re-teach of any single step.

testing1,006

bio-workflows-methylation-pipeline

Orchestrates the end-to-end bisulfite/EM-seq methylation pipeline from FASTQ to differentially methylated regions, chaining Trim Galore/fastp QC, Bismark alignment + deduplication, methylation calling, methylKit coverage-filtering/normalization, and selection-aware DMR detection (dmrseq/DSS). Use when gating the run on bisulfite conversion (lambda + pUC19 controls) BEFORE any beta value, committing the genome build + library directionality once, keeping mate-overlap deduplicated (--no_overlap), M-bias-trimming from the plot, filtering coverage before testing, choosing a count model (beta-binomial/DSS) over a bare-beta t-test, or using a region-selection-aware FDR (dmrseq/DSS) rather than raw methylKit tiles. Hands mechanism to the methylation-analysis component skills; not a re-teach of any single step.

development1,006

bio-alignment-validation

Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification.

testing1,006

bio-alignment-indexing

Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.

tools1,006

bio-alignment-filtering

Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.

tools1,006

bio-reference-operations

Generate consensus sequences and manage reference files using samtools. Use when creating consensus from alignments, indexing references, or creating sequence dictionaries.

tools1,006

bio-splicing-qc

Assesses RNA-seq data quality specifically for alternative splicing analysis. QC layers include experimental design audit (library prep, read length, depth, replicates), STAR 2-pass cohort-style alignment, junction saturation curves and discovery plateau detection, novel-vs-known junction ratio diagnostics, junction-overhang distribution, splice-site strength scoring (MaxEntScan intrinsic + SpliceAI context-aware), strandedness verification, GENCODE basic vs comprehensive choice, and rRNA contamination screening. Splicing analysis is more demanding than DGE on read length, depth, library prep, alignment strategy, and annotation choice — failures silently bias PSI estimates and inflate novel-junction false positives. Use when evaluating data suitability for splicing analysis, troubleshooting low event detection, or designing sequencing experiments where AS is a primary endpoint.

development1,006

bio-single-cell-splicing

Analyzes alternative splicing at single-cell resolution. The first decision is library chemistry — 10X 3' is fundamentally limited (RT primes from poly-A, R2 falls in 3' UTR, <0.1 junction read per cell per AS event). Plate-based full-length methods (Smart-seq3, FLASH-seq, VASA-seq, STORM-seq) and single-cell long-read (MAS-Iso-seq, scISOr-Seq2) are the chemistries that give per-cell isoform structure. Tools include MARVEL (R, Smart-seq integrated), BRIE2 (Bayesian PSI with regulatory features and ELBO_gain test), scQuint (junction-cluster, plate-based; not for 10X), SpliZ (annotation-free Z-score), Psix (graph-smoothness regulated AS), and Sierra (alternative polyadenylation, often confused with AS). Use when analyzing isoform usage in scRNA-seq, identifying cell-type-specific splicing, or determining whether scRNA-seq chemistry supports splicing analysis at all.

tools1,006

bio-workflows-timecourse-pipeline

End-to-end bulk time-course analysis from an expression matrix to temporal gene modules and per-cluster pathway enrichment. Orchestrates temporal DE (limma splines or DESeq2 LRT), Mfuzz/tslearn soft clustering of expression-profile shapes, GAM trajectory fitting, per-cluster GO enrichment against a temporal-gene background, and an OPTIONAL circadian rhythm-detection branch (MetaCycle/CosinorPy) that runs only when the design covers >=2 full cycles with >=6-8 evenly spaced samples per cycle. Use when analyzing a bulk time-series expression experiment from any omics platform and deciding limma-splines vs DESeq2-LRT for temporal DE, soft vs hard clustering, whether the sampling design even licenses rhythm detection, and which background to use for enrichment. Not for single-cell pseudotime (see temporal-genomics/trajectory-modeling for the bulk-vs-pseudotime boundary) or unknown-period discovery (see temporal-genomics/periodicity-detection).

development1,006

bio-outlier-splicing-detection

Detects aberrant splicing in single rare-disease patients vs a control panel using FRASER 2.0 (Bioconductor; Beta-binomial autoencoder on Intron Jaccard Index, default delta cutoff 0.1, q hyperparameter), OUTRIDER (gene-level outlier expression via autoencoder denoising), LeafcutterMD (Dirichlet-multinomial outlier mode of LeafCutter for annotation-free junctions), and DROP (Snakemake pipeline integrating FRASER2 + OUTRIDER + monoallelic expression for clinical diagnostics). The statistical model is fundamentally different from differential splicing — single-sample-vs-cohort outlier detection rather than two-group comparison. Standard tool in EU rare-disease (Solve-RD) and NIH UDN programs. Use when applying RNA-seq to undiagnosed Mendelian disease, validating predicted splice variants in clinical samples, or detecting cryptic splicing in disease tissue.

tools1,006

bio-alignment-structural

Align protein structures using Foldseek 3Di, TM-align, US-align, DALI, or Foldmason for structural MSA. Predict, score, and superpose backbone coordinates when sequence identity is below the twilight zone or remote-homology detection is required. Use when sequence MSA fails (<25% identity), when the dark proteome is the target, when AlphaFoldDB / ESM Atlas search is needed, or when structural superposition is the goal.

development1,006

bio-workflows-spatial-pipeline

Orchestrates the end-to-end spatial transcriptomics pipeline from Space Ranger / vendor output to spatial domains and statistics, branching FIRST on platform class (imaging in-situ Xenium/MERFISH/CosMx vs sequencing/capture Visium/Visium HD/Slide-seq). Use when deciding segmentation-vs-deconvolution and the QC floors from the platform class, committing the coordinate/image-registration frame and panel identity, deconvolving multi-cell spots against an annotated scRNA reference (never relabeling spot clusters as cell types), building the spatial neighbor graph on PHYSICAL not expression space, gating spatially-variable genes on FDR, or using a real domain method (BANKSY/BayesSpace/STAGATE) rather than clustering the spatial graph alone. Hands off deconvolution and cell-cell communication to the component skills; not a re-teach of any single step.

development1,006

bio-alignment-amplicon-clipping

Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.

tools1,006

bio-sashimi-plots

Creates sashimi-style plots showing RNA-seq read coverage and splice junction counts using ggsashimi (general-purpose, condition-grouped overlays), rmats2sashimiplot (rMATS-output-aware), MAJIQ-VOILA (LSV posteriors interactive HTML), leafviz (leafcutter clusters Shiny), Jutils (tool-agnostic heatmaps and sashimi for rMATS/leafcutter/MntJULiP/MAJIQ output), or pyGenomeTracks (multi-track publication figures). Tool choice depends on the upstream differential-splicing tool's output format and the publication vs interactive use case. Use when visualizing specific splicing events, validating differential splicing calls, or producing publication-quality figures.

tools1,006

bio-bam-statistics

Generate alignment statistics using samtools flagstat, stats, depth, coverage, and mosdepth. Use when assessing alignment quality, calculating coverage, or generating QC reports.

tools1,006

bio-isoform-switching

Analyzes differential transcript usage (DTU) and isoform switches with functional consequence prediction (NMD via 50nt rule, ORF disruption, protein domain loss/gain, signal peptide changes, IDR alterations, coding-potential shifts). Tools include IsoformSwitchAnalyzeR v2 (auto-selects satuRn for >5 reps else DEXSeq), the manual DRIMSeq -> DEXSeq/satuRn -> stageR DTU pipeline, and fishpond/swish for inferential-uncertainty-aware DTE. Distinguishes DTU from DGE and DTE; integrates external annotators (CPC2, Pfam, SignalP, IUPred2A or DeepTMHMM). Use when investigating how splicing differences alter protein function or trigger NMD-mediated degradation.

tools1,006

bio-workflows-smrna-pipeline

Orchestrates the end-to-end small RNA-seq pipeline from FASTQ to differential miRNAs and expression-filtered targets, chaining kit-aware cutadapt trimming (adapter on every read, UMI/4N handling), miRge3 known+isomiR quantification or miRDeep2 novel discovery, compositionally-aware DESeq2, and miRanda target prediction. Use when committing the library-kit adapter/UMI handling once, choosing the NORMALIZER (which drives which miRNAs are called DE more than the DE model does), deciding known quantification vs novel discovery, handling biofluid/plasma libraries that lack a trustworthy endogenous normalizer, routing tRF/piRNA reads to their own profiling, or feeding RAW (not RPM) counts with size-factor inspection into DE. Hands mechanism to the small-rna-seq component skills; not a re-teach of any single step.

development1,006

alternative-splicing/long-read-splicing

--- name: bio-long-read-splicing description: Analyzes alternative splicing from PacBio Iso-Seq (HiFi, Kinnex/MAS-Iso-seq) and Oxford Nanopore (direct cDNA, direct RNA, R10.4.1+) long-read RNA-seq with full-isoform resolution. Tools include FLAIR (correct/collapse/quantify/diffSplice for PacBio + ONT), IsoQuant (de-novo or annotation-guided isoform discovery 2024 SOTA), Bambu (annotation-aware Bayesian discovery + quantification with Novel Discovery Rate), SQANTI3 (isoform classification: FSM/IS

tools1,006

bio-workflows-outbreak-pipeline

Orchestrates genomic-epidemiology outbreak investigation from pathogen isolates to transmission networks, forking bacterial (snippy -> Gubbins recombination-masking -> IQ-TREE -> TreeTime -> TransPhylo) vs viral (Nextstrain/augur), with parallel MLST typing (cgMLST delegated to epidemiological-genomics/pathogen-typing) and AMR surveillance. Use when committing ONE reference genome for SNP calling (every isolate and distance inherits its coordinates), applying MANDATORY Gubbins recombination-masking on core.full.aln before the tree for recombining bacteria (skipping it inflates the clock 2-5x), gating time-scaling on a temporal-signal test (TempEst R2 >= 0.3), using a pathogen- AND population-specific cluster threshold rather than a universal SNP cutoff, or pinning pangolin-data/Nextclade/Freyja versions for the viral route. Hands mechanism to the epidemiological-genomics component skills; not a re-teach of any single step.

development1,006

bio-workflows-scrnaseq-pipeline

Orchestrates the end-to-end single-cell RNA-seq pipeline from 10x Cell Ranger output to annotated cell types, chaining ambient-RNA removal, doublet detection, MAD-adaptive QC, normalization, integration, clustering, marker annotation, and (separately) pseudobulk DE + differential abundance. Use when honoring the made-once counting commitments (reference build/Ensembl vintage, --include-introns, cell-calling, feature namespace), ordering correct-then-detect-then-normalize (ambient before doublet before normalize), running per-sample QC before merge, integrating-then-clustering (never testing on integrated values), aggregating to pseudobulk for condition DE instead of cells-as-replicates, or pairing DE with differential abundance. Hands mechanism to the single-cell component skills; not a re-teach of any single step.

development1,006

bio-sam-bam-basics

View, convert, and understand SAM/BAM/CRAM alignment files using samtools and pysam. Use when inspecting alignments, converting between formats, or understanding alignment file structure.

tools1,006

bio-alignment-msa-parsing

Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.

development1,006

bio-duplicate-handling

Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.

tools1,006

bio-alignment-sorting

Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.

tools1,006

bio-alignment-trimming

Trim multiple sequence alignments using ClipKIT, trimAl, BMGE, Divvier, or HMMcleaner with mode selection guidance per downstream goal. Use when removing unreliable columns or contaminating residues before phylogenetic inference, HMM building, or selection analysis.

tools1,006

bio-splice-variant-prediction

Predicts whether a DNA variant alters mRNA splicing using sequence-based deep-learning tools — SpliceAI (10kb context dilated CNN, clinical default), Pangolin (multi-tissue), MMSplice (modular per-region CNN with calibrated ΔPSI), SpliceTransformer/TrASPr (tissue-aware transformers), SpliceVault (empirical 300K-RNA lookup of likely mis-splicing outcomes), CADD-Splice (composite score). Applies the ClinGen SVI 2023 framework for ACMG/AMP variant interpretation (PVS1, PP3, BP4 evidence codes), HGVS splicing nomenclature (c.123+1G>A, c.123-3T>G, r.spl?), extended-window scoring for deep-intronic pseudoexons, tissue-specific predictions, branchpoint variant detection (BPHunter, LaBranchoR), and splice-switching ASO design. Use when interpreting splice impact of clinical variants, prioritizing VUS, identifying deep-intronic pathogenic variants, or designing ASOs.

tools1,006

bio-pileup-generation

Generate pileup data for variant calling using samtools mpileup and pysam. Use when preparing data for variant calling, analyzing per-position read data, or calculating allele frequencies.

tools1,006

bio-alignment-msa-statistics

Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.

testing1,006

bio-workflows-splicing-pipeline

Orchestrates the end-to-end bulk short-read alternative-splicing pipeline from FASTQ to differential splicing, chaining fastp QC, cohort-consistent STAR 2-pass alignment (one shared junction DB), junction QC, event-level differential splicing (rMATS-turbo + leafcutter, optional MAJIQ V3), parallel isoform-level DTU (Salmon -> tximport dtuScaledTPM -> DRIMSeq/DEXSeq -> stageR), and sashimi visualization. Use when committing the annotation GTF and a shared 2-pass junction database for the whole cohort, keeping the analysis at splice-aware resolution (never collapsing to gene), choosing event-level vs isoform-level DTU and reconciling them, applying the stageR two-stage gene->transcript FDR, or off-ramping to splice-variant / outlier / long-read / single-cell splicing. Hands mechanism to the alternative-splicing component skills; not a re-teach of any single step.

testing1,006

bio-alignment-pairwise

Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.

development1,006

bio-workflows-crispr-editing-pipeline

Orchestrates an end-to-end CRISPR editing experiment design from target gene to delivery-ready, validatable constructs. Sequences guide design, off-target assessment, edit-modality selection (knockout, base editing, prime editing, HDR knock-in), and template/donor design, with a QC checkpoint at each handoff. Use when designing a complete CRISPR experiment for knockout, point correction, or tagging and the order of operations, the modality decision, and the cross-cutting traps are needed rather than a single step. Defers each step's mechanics to the genome-engineering skills.

testing1,004

bio-workflows-edna-pipeline

End-to-end eDNA metabarcoding from raw amplicons to community ecology. Covers QC, primer removal (mandatory before DADA2 filterAndTrim), denoising with OBITools3 v3 (obi stats plural; DMS-based) or DADA2 ASVs (Callahan 2017), decontam combined method as screening-not-classifier (Davis 2018), tag-jumping (Schnell 2015) with a platform-dependent baseline (NovaSeq patterned flow cells ~10x MiSeq), Hill-number effective species counts with coverage-based rarefaction (Jost 2006; Chao & Jost 2012; doubling rule), beta-diversity decomposition with MANDATORY PERMANOVA + PERMDISP pair (Anderson & Walsh 2013), constrained ordination, and the read-counts-not-abundance critique (Lamb 2019). Use when processing eDNA samples for biodiversity assessment, deciding ASV vs OTU, configuring OBITools3 v3, interpreting decontam screening, or reporting community comparisons with the dispersion confound check.

tools1,004

bio-workflows-fastq-to-variants

Orchestrates the end-to-end germline short-variant pipeline from FASTQ to a filtered, normalized, benchmarked VCF, chaining QC/trim, BWA-MEM2 alignment, duplicate marking, optional BQSR, calling (bcftools/GATK HaplotypeCaller/DeepVariant/DRAGEN), normalization, site+genotype filtering, annotation, and hap.py/vcfeval benchmarking. Use when deciding the pipeline-wide reference-genome commitment (GRCh38 analysis set vs T2T, ALT/decoy handling), sequencing the steps in the defensible order (normalize BEFORE annotate, filter site- then genotype-level), choosing the calling engine and single-sample vs cohort joint-calling, picking a filtering strategy by cohort size, or benchmarking stratified within GIAB confident regions. Hands off mechanism to the variant-calling and read-alignment component skills; not a re-teach of any single step.

tools1,004

bio-workflows-cytometry-pipeline

End-to-end flow, spectral, and mass cytometry (CyTOF) pipeline from raw FCS files to differentially abundant/expressed cell populations. Orchestrates the read -> compensate/unmix -> transform -> QC -> doublet-removal -> cluster-or-gate -> annotate -> diffcyt DA/DS chain with flowCore/CATALYST/diffcyt, branching on instrument type and on clustering-vs-gating. Use when processing a cytometry experiment end-to-end, deciding the pipeline path for an instrument, or wiring the flow-cytometry component skills into one analysis with valid sample-level statistics.

testing1,004

workflows/cnv-pipeline

--- name: bio-workflows-cnv-pipeline description: Orchestrates the copy-number pipeline from BAM to segmented, integer-called, annotated CNVs, forking on germline-vs-somatic - CNVkit (somatic exome/panel: coverage -> assay-matched reference/PoN -> fix -> segment -> purity/ploidy-aware call), GATK gCNV (germline rare-CNV cohort), and allele-specific callers (ASCAT/FACETS/PURPLE) for purity/ploidy. Use when committing the build + target/access BED + PoN once (assay-matched), building the reference

development1,004

bio-workflows-metabolomics-pipeline

Orchestrates the untargeted LC-MS metabolomics pipeline end-to-end (xcms 4.x feature extraction, QC/drift/normalization, confidence-stratified annotation, permutation-validated statistics, background-aware pathway mapping), naming what each stage decides and where it silently fails. Use when running a full LC-MS metabolomics study from raw mzML to enriched pathways and needing the honest handoffs between stages. Each stage defers to its component skill for parameters and traps; for stable-isotope flux (a separate branch, not this untargeted flow) see metabolomics/isotope-tracing.

testing1,004

bio-workflows-hic-pipeline

End-to-end Hi-C analysis workflow from FASTQ to compartments, TADs, and loops, with the decision of WHICH features the sequencing depth can support. Covers pairtools read-pair processing and library QC, cooler matrices, ICE balancing and distance-decay expected, A/B compartments, TAD boundaries, loop calling, and the routing of HiChIP/PLAC-seq/Capture Hi-C to protein-directed loop callers. Use when processing Hi-C data end to end, deciding a resolution for a given depth, or choosing between bulk-Hi-C and protein-directed loop calling.

tools1,004

bio-workflows-genome-annotation-pipeline

Orchestrates genome annotation from assembled contigs to functional annotation, forking prokaryotic (Bakta one-step, genetic-code table from GTDB-Tk) vs eukaryotic (RepeatMask -> BRAKER3 -> functional -> ncRNA), then eggNOG/InterProScan functional assignment and Infernal/tRNAscan ncRNA. Use when committing the pro-vs-eukaryotic path and the genetic-code table from taxonomy (never guessing), annotating ONLY a decontaminated QC-passed assembly (CheckM2 before prokaryotic annotation is non-negotiable), committing the evidence set (RNA-seq + protein drives BRAKER3 training), soft-masking with a curated repeat library before gene prediction, or pinning the tool + DB version for any pangenome comparison. Hands mechanism to the genome-annotation component skills; not a re-teach of any single step.

tools1,004

bio-gatk-variant-calling

Call germline SNPs and indels with GATK HaplotypeCaller and the GVCF joint-genotyping workflow. Covers the local-reassembly + PairHMM mechanism (why HC beats pileup callers on indels), the -ERC GVCF reference-confidence model and <NON_REF> allele, BQSR-vs-DRAGSTR and --dragen-mode error modeling, allele-specific (AS_) annotations, and edge cases (ploidy, Mutect2 mitochondria mode, sex chromosomes/PAR, contamination gating). Use when deciding whether to use HaplotypeCaller vs a pileup or DRAGEN caller, whether BQSR still earns its place, whether to call per-sample GVCFs for a cohort, or how to handle non-diploid, mitochondrial, sex-chromosome, or contaminated samples. Not for post-calling filtering depth (see variant-calling/filtering-best-practices) or cohort joint-genotyping scaling (see variant-calling/joint-calling).

testing1,004

bio-phylo-tree-manipulation

Edit phylogenetic tree structure with Biopython Bio.Phylo, and treat rooting as a separate statistical inference rather than a display choice. Covers why most inference returns an unrooted tree so placing the root creates every ancestor/descendant and basal claim; why a distant or lonely outgroup misroots inside the ingroup via long-branch attraction; the outgroup/midpoint/MAD/MinVar/non-reversible-likelihood rooting tradeoffs; why pruning must suppress degree-2 nodes and sum their branch lengths or all patristic distances silently corrupt; and why collapsing by support makes SOFT (uncertainty) polytomies, not HARD (radiation) ones. Use when rooting, re-rooting, pruning or subsetting taxa, extracting a clade or induced subtree, collapsing low-support branches, resolving polytomies, or ladderizing. Routes clock-based rooting to divergence-dating, inference to modern-tree-inference, and reading/plotting to tree-io and tree-visualization.

development1,004

bio-phasing-imputation-haplotype-phasing

Estimates haplotype phase from population linkage disequilibrium with SHAPEIT5, SHAPEIT4, Eagle2, or Beagle - turning unphased genotypes (0/1) into phased haplotypes (0|1) for imputation input, compound-heterozygote calls, HLA typing, or population genetics. Covers why statistical phase is an INFERENCE (not a measurement) whose error concentrates at rare variants, why a genome-wide switch-error rate hides catastrophic rare-variant error and must be reported MAC-stratified, the SHAPEIT5 common-scaffold-then-rare design (phase_common, ligate, phase_rare, switch), reference-based vs within-cohort phasing, the build-matched genetic map, chrX male-haploid handling, and the switch-vs-flip-vs-Hamming distinction. Use when phasing genotypes before imputation, for compound-het/ASE/HLA, or benchmarking against trios. Read-backed / molecular phasing (long reads, Hi-C) is long-read-sequencing/haplotype-phasing; panel choice is reference-panels; imputation is genotype-imputation.

development1,004

bio-phylo-tree-visualization

Draw and export phylogenetic trees with Bio.Phylo plus matplotlib, and route rich figures to ggtree, ETE4, or iTOL. Covers why a tree figure is an argument not a neutral picture, the cladogram-vs-phylogram-vs-chronogram choice that hides or reveals rate and time, how ladderization manufactures a false arrow of progress, why an unlabeled support number always flatters the result (bootstrap vs posterior vs SH-aLRT vs UFBoot are different scales), why Bio.Phylo silently drops BEAST HPD bars so annotated Bayesian trees must go through treeio plus ggtree, and the tip-count and raster-vs-vector thresholds for legible publication figures. Use when drawing a tree, choosing a layout, coloring branches, showing or labeling support, exporting vector figures, or deciding a drawing tool. Routes annotation-preserving reads to tree-io, and rooting and ladderizing to tree-manipulation.

tools1,004

bio-clip-seq-clip-qc

Comprehensive quality control for CLIP-seq libraries (eCLIP, iCLIP, iCLIP2, PAR-CLIP) covering library complexity (preseq), FRiP, IDR replicate reproducibility, read-distribution metagene, SMInput vs IgG control rationale, rRNA / snoRNA contamination, fragment-length distribution, and ENCODE-compliance thresholds. Use when assessing whether a CLIP library passed, deciding lenient vs stringent peak thresholds, comparing replicates with IDR rescue and self-consistency ratios, or distinguishing failed IP from over-amplified library.

tools1,004

bio-workflows-expression-to-pathways

Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when the input is a full ranking for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.

development1,004

bio-workflows-imc-pipeline

Orchestrates imaging mass cytometry from raw MCD acquisitions to patient-level spatial analysis, chaining steinbock preprocessing, Mesmer/Cellpose segmentation, single-cell quantification, phenotyping, and squidpy spatial statistics. Use when committing the panel + segmentation frame + pixel size (every per-cell number is a mask-bounded pixel average), compensating channel spillover on PIXELS before segmentation but running REDSEA lateral-spillover on the per-cell table AFTER segmentation, using arcsinh cofactor 1 (not the suspension-CyTOF 5), and aggregating to the PATIENT before any cross-condition test (cells and ROIs from one patient are not independent replicates). Hands mechanism to the imaging-mass-cytometry component skills; not a re-teach of any single step.

testing1,004

bio-imaging-mass-cytometry-spatial-analysis

Analyze spatial cell-cell interactions, neighborhoods, and niches in IMC/MIBI data with squidpy and imcRtools, covering neighborhood-enrichment permutation nulls, the abundance-vs-density confound, inhomogeneous Ripley's K, cellular-neighborhood discovery, graph-construction (contact vs proximity), and edge effects. Use when testing whether cell types co-locate, choosing a spatial null, building a neighbor graph, discovering tissue niches, or deciding whether a spatial pattern is real or a density/segmentation artifact.

tools1,004

bio-workflows-merip-pipeline

Orchestrates an end-to-end MeRIP-seq / m6A-seq analysis from raw FASTQ to differential m6A peak calls and metagene plots, chaining fastp adapter trimming, STAR splice-aware alignment (NO deduplication for non-UMI MeRIP), deepTools replicate-concordance + IP-enrichment QC, PreSeq saturation curves, exomePeak2 (transcript-aware, GC-bias-aware negative-binomial GLM) peak calling, optional MACS3 broad-peak cross-check, DRACH motif confirmation as a sanity check (NOT a per-peak filter), exomePeak2 differential calling via the four-BAM-vector interface (bam_ip + bam_input control; bam_treated_ip + bam_treated_input treatment), ChIPseeker annotation, and the canonical Guitar metagene with stop-codon enrichment as the biological QC anchor. Use when running a complete MeRIP analysis from raw reads, when chaining the constituent epitranscriptomics skills (merip-preprocessing -> m6a-peak-calling -> m6a-differential -> modification-visualization), or when wrapping the pipeline in Snakemake / Nextflow.

tools1,004

bio-workflows-metabolic-modeling-pipeline

Orchestrates genome-scale metabolic modeling from a protein FASTA to flux predictions, chaining CarveMe/gapseq reconstruction, memote QC, gap-filling, media-constrained FBA/FVA, gene essentiality, and context-specific models. Use when committing the reconstruction tool (which locks the identifier NAMESPACE forever - BiGG vs ModelSEED vs KEGG, no automatic translation), setting the medium BEFORE FBA (the exchange bounds ARE the medium; essentiality and gap-fill are computed relative to it), curating iteratively (stoichiometric-consistency first, then mass/charge, then directionality, then GPR) with energy-generating-cycle removal, and reading a MEMOTE score as well-formedness NOT correctness. Hands mechanism to the systems-biology component skills; not a re-teach of any single step.

tools1,004

bio-workflows-metagenomics-pipeline

End-to-end shotgun metagenomics workflow from FASTQ to taxonomic and functional profiles, orchestrating controls/host depletion, Kraken2+Bracken classification, MetaPhlAn marker profiling, and HUMAnN functional profiling. Covers the controls-first ordering, why Kraken2 read counts are not abundances and MetaPhlAn cell fractions do not equal Bracken read fractions, and the consistent-pipeline framing. Use when profiling shotgun metagenomic samples end to end, or chaining classification, abundance, and function. For resistome see metagenomics/amr-detection; for strains see metagenomics/strain-tracking; for assembly see genome-assembly/metagenome-assembly.

devops1,004

population-genetics/association-testing

--- name: bio-population-genetics-association-testing description: Single-variant common-variant GWAS with plink2 --glm (linear/logistic, Firth) and the linear mixed models GEMMA, BOLT-LMM, SAIGE, regenie (SPA). A GWAS statistic is valid only when genotype is independent of unmodeled phenotype drivers after the chosen covariates and random effects, so the engine follows sample structure and case:control imbalance, not taste: PC covariates absorb continuous ancestry but cannot remove relatedness

testing1,004

methylation-analysis/methylkit-analysis

--- name: bio-methylation-methylkit description: Imports Bismark coverage or cytosine-report files into the methylKit object model, then runs the import-to-results spine - filterByCoverage, normalizeCoverage, unite/destrand, calculateDiffMeth, getMethylDiff - for both per-CpG (DMC) and fixed-tile (DMR) differential methylation, plus tileMethylCounts, PCA/correlation/clustering QC, and assocComp/removeComp batch handling. Covers the silent default traps that shape the false-positive rate: overdis

data-ai1,004

bio-workflows-longread-sv-pipeline

Orchestrates an end-to-end long-read structural-variant pipeline - basecalling to minimap2 alignment (platform-matched preset) to Sniffles2/cuteSV/pbsv calling to optional assembly-based calling (dipcall/PAV) to two-step .snf cohort merging to Truvari benchmarking - chaining ONT and PacBio HiFi runs while handing the SV signal mechanism off to the component skills. Use when running a long-read SV workflow from reads to a benchmarked callset, choosing the minimap2 preset and SV caller by platform and goal, deciding when long reads are worth it for the insertions and repeat-mediated SVs short reads physically miss, building a joint-genotyped cohort with the two-step .snf design, or parameterizing a Truvari benchmark against GIAB HG002 Tier 1 plus CMRG. Not for the SV signal mechanism itself (see variant-calling/structural-variant-calling) or short-read SV.

development1,004

bio-data-visualization-matplotlib-fundamentals

Build publication-quality figures with matplotlib using the object-oriented Figure/Axes API, constrained_layout, rcParams customization, TrueType (Type-42) font embedding for journal submission, and CVD-safe palettes. Covers seaborn integration, common chart types, axis formatting, and the small gotchas that distinguish reproducible matplotlib from notebook scratch. Use when producing publication figures in Python — RNA-seq scatter, single-cell embeddings, generic biological plotting.

development1,004

bio-genome-annotation-repeat-annotation

Discovers, classifies, and masks repetitive elements and transposable elements with RepeatModeler2 (de novo family library), RepeatMasker (masking against a library), EDTA (plant/structural TEs), or EarlGrey (auto-curating wrapper), and quantifies TE expression from RNA-seq with TEtranscripts/SQuIRE. Covers de-novo-library-as-curation-project, soft-vs-hard masking, the domesticated-gene over-masking massacre, Dfam-vs-RepBase, TE classification (Class I/II, family-vs-copy), Kimura repeat landscapes, LAI, and the RNA-seq multimapping problem. Use when masking repeats before gene prediction, building a TE library for a non-model genome, or analyzing transposable-element content or expression.

development1,004

bio-methylation-bismark-alignment

Aligns bisulfite-converted (WGBS, RRBS, PBAT) and enzymatic (EM-seq) short reads to an in-silico C->T/G->A-converted reference with Bismark (Bowtie2 or HISAT2), preparing the genome index, choosing the directional vs non-directional vs PBAT strand flag, deduplicating WGBS/EM-seq (never RRBS), and bounding bisulfite conversion efficiency with unmethylated lambda and methylated pUC19 spike-ins. Covers why the library protocol (not the aligner) decides whether calls are meaningful, why incomplete conversion masquerades as methylation, the 3-letter reduced-complexity mapping bias (50-70% efficiency is normal), and M-bias end-clipping. Use when aligning bisulfite or EM-seq reads, preparing a bisulfite genome, choosing the strand flag, or diagnosing low mapping efficiency. For methylation extraction see methylation-calling; for long-read MM/ML modification calling see long-read-sequencing/nanopore-methylation.

tools1,004

bio-long-read-sequencing-nanopore-methylation

Calls DNA base modifications (5mC, 5hmC, 6mA, 4mC) directly from Oxford Nanopore and PacBio HiFi long reads encoded as MM/ML SAM tags, piles them into per-site bedMethyl with modkit (or pb-CpG-tools for PacBio), and produces phased allele-specific methylation. Covers why methylation is a basecalling decision that cannot be recovered later, the MM/ML tag-drop failure that silently zeroes methylation through alignment, the MM ? vs . no-call semantics, 5mC/5hmC resolution vs bisulfite, modkit's 10th-percentile auto-threshold, and the haplotagged ASM workflow. Use when calling 5mC/5hmC/6mA from a modBAM, generating bedMethyl, preserving methylation tags through alignment, doing allele-specific or differential methylation, or QC-ing a modification BAM.

tools1,004

bio-phylo-bayesian-inference

Frames Bayesian phylogenetics as approximating a posterior distribution over trees conditioned on data AND priors via an MCMC that must be proven to have converged, using MrBayes, BEAST2, RevBayes, and PhyloBayes-MPI. Covers why convergence (ESS, PSRF, ASDSF, topology vs scalar) is the load-bearing claim, why posterior probabilities are systematically higher than bootstrap and overconfident under model misspecification, why the default branch-length prior inflates tree length, why the harmonic-mean estimator must never select models (use stepping-stone), and when site-heterogeneous CAT-GTR is required at depth. Use when needing posterior clade support, model averaging, marginal-likelihood model comparison, or CAT models for deep phylogeny. Routes topology-only ML to modern-tree-inference, divergence times to divergence-dating, and tree summarization to tree-io.

testing1,004

bio-phylo-distance-calculations

Build model-corrected evolutionary distance matrices and distance trees (NJ, BIONJ, FastME, UPGMA) with Biopython Bio.Phylo plus R ape/phangorn/FastME. Covers why a distance is a model-corrected estimate of substitutions per site that undercounts raw because of multiple/back/parallel hits (saturation); why the matrix discards the per-site information ML keeps; the LogDet/paralinear fix for compositional heterogeneity; the UPGMA molecular-clock trap; and the Bio.Phylo landmine that DistanceCalculator offers only identity/matrix distances, not JC/K80/TN93. Use when computing a distance matrix, building a fast NJ/FastME tree, seeding an ML search, barcoding, or testing substitution saturation before a deep tree. Routes ML and starting-tree work to modern-tree-inference, alignment quality to alignment/alignment-io, and tree I/O to tree-io.

development1,004

bio-flow-cytometry-cytometry-qc

Quality control for flow, spectral, and mass cytometry - time-based anomaly cleaning (flowAI, flowCut, PeacoQC, flowClean), margin/boundary event removal, signal-drift detection, dead-cell exclusion, CyTOF Gaussian/DNA/event-length checks, instrument calibration/standardization (MESF, CS&T, peak-2), and batch-level outlier flagging. Use when assessing acquisition quality, choosing a cleaning tool, ordering QC relative to compensation, deciding margin removal before density-based steps, or flagging problematic samples before clustering or differential analysis.

tools1,004

bio-data-visualization-dimensionality-reduction-plots

Produce and interpret PCA, t-SNE, UMAP, and PHATE plots for high-dimensional omics data with rigor about which method preserves what (variance, local structure, manifold, transitions), hyperparameter sensitivity, and the well-documented limits of 2D embeddings. Covers PCA biplot/scree/loadings, t-SNE PCA initialization (Kobak-Berens 2019), UMAP n_neighbors/min_dist trade-offs, and the Chari-Pachter 2023 critique. Use when visualizing high-dimensional data — bulk PCA, single-cell embeddings, multi-omics integration projections.

development1,004

bio-temporal-genomics-differential-rhythmicity

Compares how a rhythm CHANGES between conditions, genotypes, treatments, tissues, or ages (differential rhythmicity), classifying each feature as gain-of-rhythm, loss-of-rhythm, phase change, amplitude change, unchanged-rhythmic, or arrhythmic-in-both, and distinguishing differential EXPRESSION (condition main effect) from differential RHYTHMICITY (condition x time interaction). Uses model-based approaches that borrow strength across conditions - LimoRhyde (sin/cos interaction terms in a limma/edgeR/DESeq2 design), dryR (BIC model selection across >=2 conditions), compareRhythms (direct gain/loss/change/same classification), DODR, CircaCompare - instead of the detect-then-Venn anti-pattern that overestimates reprogramming. Use when testing whether rhythms differ between conditions/genotypes/tissues/ages, classifying gain/loss/phase/amplitude change, or separating differential expression from differential rhythmicity. Not for detecting rhythms in one condition (see temporal-genomics/circadian-rhythms).

testing1,004

bio-tcr-bcr-analysis-immcantation-analysis

Reconstructs B-cell clonal families, quantifies somatic hypermutation and selection, and builds antibody lineage trees with the Immcantation R suite (alakazam, shazam, scoper, dowser, tigger) on AIRR-format BCR data. Use when deriving the clonal-clustering threshold from the distToNearest bimodal valley (never a hardcoded 0.15); choosing hierarchicalClones vs spectralClones (vj vs novj) for SHM-diverged repertoires; personalizing the germline with TIGGER before mutation counting; reconstructing D-masked germlines with createGermlines; measuring R/S mutation frequency by CDR and FWR region; testing antigen-driven selection with BASELINe; comparing Hill-number diversity at equal sampling depth; and inferring IgPhyML lineage trees for affinity maturation, class-switch, and ancestral-antibody analysis.

development1,004

bio-proteomics-spectral-libraries

Builds and manages DIA spectral libraries as peptide query parameters (precursor m/z, a few fragment m/z plus relative intensities, normalized RT, optional CCS), covering experimental DDA, chromatogram, and in-silico predicted libraries via Koina-served Prosit, AlphaPeptDeep, MS2PIP, and DeepLC, with iRT/CiRT RT calibration, NCE tuning, format conversion (DIA-NN tsv/speclib/parquet, OpenSWATH pqp/TraML, Spectronaut, blib/dlib/elib), and library QC/merge. Use when generating, calibrating, converting, or merging a spectral library to drive a DIA search. Running the actual DIA search is dia-analysis; building from DDA identifications depends on peptide-identification; modified-peptide libraries route to ptm-analysis; quantifying the result is quantification.

tools1,004

bio-workflows-gwas-pipeline

Orchestrates the GWAS pipeline from genotypes to association results, chaining PLINK2 QC (variant-then-sample missingness, controls-only HWE, KING relatedness), panel harmonization + joint phasing/imputation to dosages, long-range-LD-excluded PCA, and an engine chosen by sample structure (PLINK2-GLM / regenie / SAIGE / BOLT-LMM), with LDSC-intercept diagnostics. Use when committing the genome build + ancestry-matched imputation panel once (ancestry match > panel size), running the strand/allele harmonization gate (drop intermediate-frequency palindromes), imputing cases+controls TOGETHER on dosages, excluding long-range-LD regions before PCA, choosing an LMM when relatedness/structure is present (PCs cannot remove a covariance), or separating polygenicity from confounding via the LDSC intercept. Hands mechanism to the population-genetics and phasing-imputation component skills; not a re-teach of any single step.

development1,004

bio-workflows-genome-assembly-pipeline

Orchestrates an end-to-end de novo genome assembly project, routing each step to the right genome-assembly skill rather than restating it. Profiles the genome first (k-mer spectrum -> size, heterozygosity, ploidy), QCs reads, chooses an assembly path by data type (SPAdes for Illumina, Flye for noisy long reads, hifiasm for HiFi, metaFlye for communities), polishes only when needed, decontaminates, scaffolds with Hi-C, and finishes with three-axis QC (contiguity + completeness + correctness). Use when assembling a genome from raw reads and deciding which assembler, whether to polish, and how to prove the result is good.

development1,004

bio-genome-intervals-overlap-significance

Tests whether two genomic interval sets overlap (colocalize) more than expected by chance using a permutation test against a structured-genome null model. Covers bedtools fisher (analytic 2x2 screen), bedtools shuffle + jaccard permutation, GAT (isochore/GC-conditioned simulation with FDR), regioneR (flexible permutation, randomizeRegions vs circularRandomizeRegions, localZScore), LOLA (universe-relative Fisher against a region database), and GREAT/rGREAT (regulatory-domain binomial + hypergeometric for ontology-from-regions). Stresses the universe/background choice, matched background, blacklist exclusion, and multiple-testing control. Use when asking whether peaks/regions are enriched at enhancers/TFBS/features, scoring region-set colocalization or region-set enrichment, comparing CNV/SV concordance, or turning an overlap count into a defensible p-value.

tools1,004

bio-gene-regulatory-networks-grn-inference

Infer gene regulatory networks from bulk or general expression data with mutual-information (ARACNe) and tree-ensemble (GENIE3, GRNBoost2) methods, and infer transcription-factor protein activity from regulons with VIPER and msVIPER. Covers the activity-not-edges paradigm, the undirected-association caveat, the DREAM5 wisdom-of-crowds and method-complementarity result, AUPRC-over-AUROC evaluation, and gold-standard incompleteness. Use when inferring a regulatory network from a bulk expression matrix, finding master regulators, or scoring TF activity from a signature. For single-cell motif-pruned regulons see scenic-regulons; for co-expression modules see coexpression-networks.

development1,004

bio-flow-cytometry-compensation-transformation

Corrects fluorophore spillover (conventional compensation) or spectral overlap (spectral unmixing) and applies variance-stabilizing transforms (logicle/biexponential, arcsinh, log) for flow and mass cytometry. Covers spillover-matrix estimation from single-stain controls, AutoSpill, the spillover spreading matrix and why panel design (not compensation) bounds resolution, compensate-then-transform ordering, and arcsinh cofactor choice (5 for CyTOF, ~150 for fluorescence, per-channel via flowVS). Use when correcting spectral overlap, preparing data for gating/clustering, choosing logicle vs arcsinh, deciding a cofactor, or distinguishing compensation from spectral unmixing.

development1,004

bio-gene-regulatory-networks-coexpression-networks

Build weighted gene co-expression networks to identify modules of co-regulated genes, relate them to phenotypes, and find hub genes using WGCNA, hdWGCNA, MEGENA, CEMiTool, and Gaussian graphical models. Covers signed-network choice, soft-threshold selection, module preservation, and the marginal-vs-partial-correlation distinction. Use when finding co-expression modules, identifying hub genes, relating gene networks to clinical or experimental traits, or building single-cell co-expression networks. For directed TF-target inference see scenic-regulons and grn-inference; for condition rewiring see differential-networks.

tools1,004

bio-flow-cytometry-bead-normalization

Bead-based signal normalization and cross-batch harmonization for CyTOF and high-parameter cytometry - EQ four-element bead normalization of instrument sensitivity drift (CATALYST normCytof, premessa), and reference-anchor cross-batch normalization (CytoNorm, per-cluster quantile splines). Covers the distinction between within-run drift correction and between-batch correction, the mandatory anchor/reference sample, why normalization is per-cluster with many quantiles, and the over-correction risk. Use when correcting CyTOF signal drift, harmonizing multi-batch or multi-site studies, or deciding whether to normalize data versus model batch in the design.

testing1,004

bio-workflows-grn-pipeline

Orchestrates gene regulatory network inference from processed single-cell data to regulons and in-silico perturbation, via pySCENIC (RNA-only GRNBoost2 -> cisTarget -> AUCell), SCENIC+ (multiome cisTopic -> pycistarget -> eGRN), and CellOracle perturbation. Use when recognizing that an inferred GRN is UNDIRECTED by default and reporting only the evidence tier delivered (co-expression vs motif-pruned vs enhancer-resolved vs perturbation), matching species/assembly/namespace across the TF-list + cisTarget DB + motif2TF annotation, feeding RAW counts of the cleaned/doublet-free/batch-controlled cells (never imputed/batch-corrected values), running the cisTarget pruning that buys directionality (modules are not regulons without it), or choosing the RNA-only vs multiome path. Hands mechanism to the gene-regulatory-networks component skills; not a re-teach of any single step.

testing1,004

bio-workflows-liquid-biopsy-pipeline

Orchestrates the cell-free DNA / liquid-biopsy pipeline from plasma sequencing to tumor monitoring, forking tumor-naive (screening) vs tumor-informed (MRD), and chaining pre-analytic QC, UMI/duplex error-suppression (fgbio), fragment QC, ichorCNA tumor fraction (sWGS) or VarDict low-VAF calling (panel), CHIP subtraction against matched WBC, optional fragmentomics/methylation, and longitudinal tracking. Use when treating pre-analytics as the irreversible sensitivity ceiling (tube/time-to-plasma/hemolysis), running error-suppression BEFORE calling (single-strand consensus does not remove deamination; only duplex does), reporting a VAF only with input genome-equivalents (TF ~ 2x VAF only for clonal-het-diploid), subtracting CHIP before reporting somatic, or keeping tube/panel/pipeline identical across a longitudinal MRD series. Hands mechanism to the liquid-biopsy component skills; not a re-teach of any single step.

testing1,004

bio-methylation-cell-type-deconvolution

Estimates cell-type composition from bulk DNA methylation and uses it to defuse the single biggest EWAS confounder. Covers reference-based deconvolution (Houseman constrained-projection, minfi estimateCellCounts2 with FlowSorted.Blood.EPIC + IDOL-optimized libraries, EpiDISH RPC/CBS/CP, 12-cell extended, cord-blood nRBC references, EpiSCORE/hepidish for solid tissue), reference-free correction (ReFACTor, RefFreeEWAS, SVA), using fractions as covariates vs the compositionality/collinearity trap, and cell-type-resolved EWAS (CellDMC, TCA, TOAST, omicwas, HIRE). Use when estimating blood/tissue cell fractions, adjusting an EWAS for composition, choosing a deconvolution reference, or attributing a methylation signal to a cell type. For the EWAS confounder-vs-mediator decision see ewas-design; for the IEAA cell-count adjustment of DNAm age see epigenetic-clocks; for clean beta input see array-preprocessing.

development1,004

bio-hi-c-analysis-loop-calling

Detects focal chromatin loops (point interactions / corner-dots) in balanced Hi-C and Micro-C contact maps and aggregates/validates a loop set. Covers de-novo calling with cooltools dots (HiCCUPS-style 4-background local enrichment with lambda-chunked FDR), chromosight (template-correlation), and Mustache (scale-space blob detection); aggregate peak analysis (APA) via cooltools pileup for confirmation; the depth/resolution prerequisite (de-novo needs ~5-10kb resolution = hundreds of millions to billions of valid pairs); consensus across callers and convergent-CTCF support as validation; and differential loops via union anchors plus chromosight quantify. Use when calling chromatin loops or dots from a cooler, deciding whether a map is deep enough to call de-novo vs running APA on known CTCF/cohesin anchors, building an aggregate peak pileup, comparing loops across conditions, or validating loop calls. For HiChIP/PLAC-seq/PCHi-C protein-anchored data use FitHiChIP/MAPS, not dots.

tools1,004

bio-gene-regulatory-networks-differential-networks

Compare gene co-expression and regulatory networks between biological conditions to find rewired relationships using DiffCorr, DiffCoEx, DINGO/iDINGO, and CoDiNA. Covers the differential-connectivity-is-not-differential-expression distinction, the pairwise multiple-testing explosion, marginal vs partial (direct) rewiring, and the underpowered-rewiring failure mode. Use when comparing co-expression networks between disease vs control, treatment, or developmental stages, or finding hub genes that rewire without changing mean expression. For single-condition modules see coexpression-networks; for differential expression of means see differential-expression/de-results.

development1,004

bio-imaging-mass-cytometry-quality-metrics

Quality control for IMC/MIBI data across pixel, channel, image, slide, and batch levels, covering Poisson-count SNR (cell-level Gaussian-mixture and empty-channel comparison), spillover-matrix QC (the three physical sources), drift and the missing EQ-bead analog, acquisition artifacts, and sample-of-origin batch effects. Use when deciding whether to keep or drop a channel, ROI, or slide, distinguishing a dim antibody from a failed one, reading a spillover matrix, or diagnosing batch-driven clustering before analysis.

testing1,004

bio-imaging-mass-cytometry-data-preprocessing

Load and preprocess imaging mass cytometry (IMC) and MIBI data from raw MCD/TXT through hot-pixel removal, spillover compensation, and variance-stabilizing transformation, covering readimc/steinbock ingestion, NNLS spillover compensation (CATALYST), IMC-Denoise, and the IMC arcsinh-cofactor question. Use when starting analysis from raw MCD files, building per-channel TIFF stacks, compensating channel spillover, choosing an arcsinh cofactor, or preparing single-cell intensities for phenotyping.

development1,004

bio-genome-annotation-prokaryotic-annotation

Annotates bacterial and archaeal genomes (isolates, MAGs, plasmids) with Bakta (active versioned databases, NCBI-compliant output) or Prokka (legacy), producing GFF3/GenBank/EMBL/FASTA with INSDC locus tags. Covers Bakta-vs-Prokka-vs-PGAP-vs-DFAST choice, light-vs-full database tiers, translation-table selection (11/4/25), archaeal and leaderless-gene caveats, the small-ORF blind spot, pseudogene-vs-phase-variation, the pangenome re-annotation trap, and submission compliance. Use when annotating a newly assembled prokaryotic genome, choosing an annotation tool, re-annotating a collection for pangenomics, or preparing annotations for NCBI/DDBJ submission.

tools1,004

bio-genome-annotation-eukaryotic-gene-prediction

Predicts protein-coding gene structures (exons, introns, UTRs) in eukaryotic genomes with BRAKER3 (RNA-seq + protein evidence), BRAKER1/BRAKER2, GALBA (protein-only), Funannotate (fungi), GeMoMa (homology projection), or Helixer/Tiberius (deep-learning ab initio). Covers the evidence-first tool decision, mandatory soft-masking, the training-set-quality-dominates principle, OrthoDB clade-partition selection, the one-isoform-per-locus and missing-UTR traps, merge/split errors, and reference bias against orphan genes. Use when annotating a newly assembled eukaryotic genome, choosing a gene-prediction pipeline based on available evidence, or diagnosing a poor annotation.

tools1,004

bio-methylation-ewas-design

Designs and defends an epigenome-wide association study (EWAS) on 450K/EPIC array or bisulfite methylation - the layer deciding whether a hit is credible. Covers the confounding hierarchy (cell composition covariates as the dominant confounder, batch/Sentrix chip/array position, age/sex, smoking AHRR cg05575921, ancestry/mQTL, reverse causation), chip randomization (no-rescue theorem), surrogate variable analysis sva/SmartSVA, ComBat, RUVm, over-correction, genomic inflation lambda vs GWAS genomic control, BACON bias/inflation, genome-wide significance threshold 450K/EPIC, FWER vs FDR, pwrEWAS power, meta-analysis, EWAS Catalog/Atlas, methylation risk scores. Use when designing an EWAS, choosing a covariate set, randomizing a plate layout, interpreting lambda, applying BACON, setting a threshold, powering a study, or using an MRS. For the per-site test see differential-cpg-testing; for cell fractions see cell-type-deconvolution; for causal mQTL orientation see causal-genomics/mendelian-randomization.

testing1,004

bio-phylo-divergence-dating

Estimate divergence times under molecular-clock models with BEAST2, MCMCTree/PAML, TreePL, and LSD2, framing a date as a product of the calibration prior and the clock model far more than of the sequence data. Covers why branch length = rate x time is nonidentifiable so only calibrations convert relative rate-time into absolute age; why the effective (marginal) prior on a calibrated node differs from the density specified, mandating a sample-from-prior run; the fossil-as-minimum rule, soft bounds, tip-dating, and the fossilized birth-death process; the temporal-signal check (TempEst root-to-tip regression + date-randomization) required before dating viruses or ancient DNA; and clock-model choice via the coefficient of variation. Use when dating nodes, calibrating with fossils or sampling dates, choosing a clock or dating engine, or routing topology to modern-tree-inference, posteriors to bayesian-inference, and rooting to tree-manipulation.

testing1,004

bio-hi-c-analysis-hic-data-io

Loads, converts, and manipulates Hi-C contact matrices in cooler format (.cool/.mcool/.scool) and Juicer .hic, using cooler (Python + CLI), hic2cool, and hictk. Covers the single-resolution mcool URI (file.mcool::/resolutions/<bp>), the load-bearing divisive-vs-multiplicative weight-naming rule (KR/VC/VC_SQRT auto-divisive vs cooler's multiplicative weight), what survives .hic<->.cool conversion (FRAG matrices and norm vectors do not), raw-vs-balanced coarsening, the .pairs upper-triangle/chromsize-order contract, and chrom-naming/bin-table provenance. Use when loading a cooler, converting .hic to .mcool, selecting a resolution, building a cooler from pairs or a matrix, coarsening/zoomifying, importing Juicer norm vectors, or debugging all-NaN balanced matrices and chr1-vs-1 empty fetches.

tools1,004

bio-imaging-mass-cytometry-interactive-annotation

Interactive cell annotation and image QC for IMC/MIBI using napari, napari-imc, Mantis Viewer, and cytomapper, covering the pixels-to-cell-table bridge, overlaying masks to catch segmentation/spillover artifacts, inter-annotator variability as the accuracy ceiling, contrast-as-threshold, and building class-balanced ground-truth label sets. Use when manually labeling cells, generating training data for a classifier, QC-ing segmentation on the image, confirming clusters are spatially real, or choosing an annotation viewer.

development1,004

bio-imaging-mass-cytometry-differential-analysis

Compare cell-type composition and spatial features across conditions in IMC/MIBI cohorts with the patient as the experimental unit, covering pseudoreplication, per-patient aggregation, mixed models, compositional (Dirichlet/scCODA) differential abundance, diffcyt, per-image-to-patient spatial differential testing (SpaceANOVA), batch covariates, and FDR. Use when testing whether a cell type or spatial niche differs between groups, avoiding cell-level pseudoreplication, choosing a differential-abundance method, or correctly powering an IMC cohort comparison.

testing1,004

bio-gene-regulatory-networks-perturbation-simulation

Simulate transcription factor perturbation effects on cell state in silico with CellOracle and Dynamo, and predict transcriptional responses to genetic perturbations with GEARS, scGen, and CPA. Covers the direction-not-magnitude principle, local-linear validity, the GRN/velocity error it inherits, baseline discipline (mean and additive baselines), and the validation gap. Use when predicting TF knockout or overexpression effects, ranking driver TFs for fate transitions, or planning perturbation experiments. For GRN construction see multiomics-grn; for experimental Perturb-seq see single-cell/perturb-seq.

testing1,004

bio-phasing-imputation-genotype-imputation

Imputes untyped genotypes against a phased reference panel with Beagle, Minimac4, or IMPUTE5 (array data) or from genotype likelihoods with GLIMPSE2, QUILT2, or STITCH (low-coverage WGS), producing per-variant dosages (DS) with a self-estimated quality (Beagle DR2, Minimac R2, IMPUTE INFO). Covers why the honest output is a dosage posterior not a hard call, why GWAS regresses on DS, why the quality metric is an ESTIMATE of r2 from posterior spread (not validation against truth), the DS/GP/HDS fields, the phasing prerequisite, chunking, chrX ploidy, the Michigan/TOPMed servers (the only access to HRC/TOPMed), and low-coverage WGS as the modern array replacement. Use when increasing variant density for GWAS, harmonizing arrays, inferring untyped variants, or imputing low-coverage sequence. Phase first with haplotype-phasing; prepare the panel with reference-panels; filter with imputation-qc; the GWAS test is population-genetics/association-testing; end-to-end orchestration is workflows/gwas-pipeline.

testing1,004

bio-imaging-mass-cytometry-cell-segmentation

Segment single cells from multiplexed IMC/MIBI tissue images using Mesmer/DeepCell, Cellpose, or ilastik+CellProfiler, covering whole-cell vs nuclear segmentation, the summed-membrane-channel decision, nuclear-expansion bias, lateral spillover, resolution-floor parameters, and downstream-proxy evaluation. Use when delineating cells after preprocessing, choosing a segmentation model, building a cell mask for quantification, diagnosing impossible double-positive populations, or troubleshooting over/under-segmentation.

development1,004

bio-pathway-wikipathways

Tests a gene list (ORA, enrichWP) or a ranked gene vector (GSEA, gseWP) against the WikiPathways community-curated pathway collection with clusterProfiler and rWikiPathways. Covers why a WikiPathways result is a snapshot of a live, monthly-updated database (enrichWP/gseWP/gson_WP silently pull data.wikipathways.org/current/), why reproducibility requires pinning a dated GMT via downloadPathwayArchive(date=, format='gmt'), why the WP GMT is Entrez-keyed so symbols and Ensembl silently overlap nothing, why universe=NULL gives a biased all-WP-genes background, how to split the name%version%wpid%org term, and why WikiPathways (CC0, no peer review) complements KEGG/Reactome. Use when running open community-pathway enrichment, covering a non-model WP species, catching disease/drug pathways missing from KEGG/Reactome, or needing a reproducible dated analysis. The gene list comes from differential-expression/de-results; visualize with enrichment-visualization.

development1,004

bio-phasing-imputation-reference-panels

Selects and prepares the reference panel that phasing/imputation copies haplotypes from (1000 Genomes, HRC, TOPMed, HGDP+1kGP/gnomAD, CAAPA), matching panel ancestry to the target, reconciling genome build and chromosome naming, and running the strand/allele harmonization gate. Covers why ancestry-match beats panel size (imputation can only copy haplotypes the panel contains), why palindromic A/T and C/G SNPs flip strand without erroring, why liftover is a strand-flip generator in between-build inverted regions, that HRC is SNP-only and TOPMed is never downloadable (governance can override accuracy), and panel formats (msav, bref3, imp5). Use when choosing a panel for a target ancestry, preparing or converting a panel, aligning study data, or deciding between downloadable and server-only panels. Phasing is haplotype-phasing; imputation is genotype-imputation; PCA for ancestry is population-genetics/population-structure; HLA panels are clinical-databases/hla-typing.

tools1,004

bio-imaging-mass-cytometry-phenotyping

Assign cell types from marker expression in IMC/MIBI data using clustering (PhenoGraph/FlowSOM/Leiden/Pixie), marker-based probabilistic classifiers (Astir), or image-context CNNs (CellSighter), covering the double-positive segmentation artifact, lineage-vs-state markers, the two spillover types, and why a "cell type" in imaging is conditioned on a segmentation guess. Use when phenotyping segmented IMC cells, choosing clustering vs classification, diagnosing implausible double-positive populations, separating lineage from functional markers, or transferring labels across a cohort.

data-ai1,004

bio-phylo-modern-tree-inference

Infers maximum-likelihood phylogenetic trees with IQ-TREE2 and RAxML-NG -- model selection (ModelFinder), branch support (UFBoot2, SH-aLRT), concordance factors (gCF/sCF), partitioning, topology tests, and long-branch-attraction control. Covers why an ML tree inherits every flaw of the assumed model and the fixed alignment, why reported support measures repeatability under resampling and not correctness, why UFBoot uses a >=95 cutoff and not the bootstrap-70 rule, and why a node with UFBoot 100 but gCF ~35 is essentially unresolved ILS rather than a clade. Use when inferring an ML tree, selecting a substitution or partition model, choosing or interpreting support measures, testing an a-priori topology, or diagnosing LBA. Routes model-free distance trees to distance-calculations, posteriors to bayesian-inference, and species trees under ILS to species-trees.

testing1,004

bio-machine-learning-atlas-mapping

Maps query single-cell data onto reference atlases and transfers cell-type labels using scArches surgery (scVI/scANVI), Symphony, Azimuth, CellTypist, scPoli, popV, and foundation models, with explicit out-of-distribution and label-transfer uncertainty. Use when annotating new single-cell datasets against a pre-trained reference, deciding which mapping method fits, or judging whether transferred labels are trustworthy. For de novo clustering and manual annotation see single-cell/cell-annotation; for batch integration without a reference see single-cell/batch-integration.

development1,004

bio-phasing-imputation-imputation-qc

Assesses and filters phasing/imputation output - the quality metrics (Beagle DR2, Minimac R2 and EmpRsq, IMPUTE/GLIMPSE INFO), MAF-stratified filtering, true accuracy by masking, the differential-imputation confound, dosage-based downstream usage, and phasing switch-error QC. Covers why every routine quality score is an ESTIMATE of r2 from the posterior spread (not validation against truth), why it is confounded with MAF so a flat INFO>=0.3 cutoff is a hidden rare-variant filter, why concordance lies for rare variants while masked dosage-r2 by MAF is the gold standard, why separate case/control imputation manufactures false GWAS hits, and that the field name tells the tool (DR2=Beagle, R2=Minimac, INFO=GLIMPSE/IMPUTE). Use when filtering imputed variants before GWAS, validating accuracy, benchmarking phasing against trios, or diagnosing inflated association. Imputation is genotype-imputation; phasing is haplotype-phasing; panel ancestry is reference-panels; the test is population-genetics/association-testing.

tools1,004

bio-workflows-crispr-screen-pipeline

End-to-end pooled and single-cell CRISPR screen analysis from FASTQ to hit genes. Orchestrates library design QC, guide counting, six-stage screen QC (plasmid Gini, replicate Pearson, CEGv2 PR-AUC, copy-number artifact), method-appropriate hit calling across MAGeCK RRA/MLE, BAGEL2, drugZ, JACKS, and Chronos, cancer-cell-line copy-number correction (CRISPRcleanR / Chronos), batch correction for multi-batch screens, and the specialized branches for combinatorial paralog screens, single-cell Perturb-seq, base-editor variant-function screens, prime-editor screens, and in vivo bottleneck-aware screens. Use when analyzing any pooled CRISPR screen end-to-end, matching the hit-calling method to the experimental design, integrating copy-number correction into the pipeline, or branching the workflow for single-cell, combinatorial, base-editor, prime-editor, or in vivo variants.

development1,004

bio-pathway-kegg-pathways

Tests gene lists, ranked vectors, and fold-change vectors against KEGG pathways and modules with clusterProfiler enrichKEGG/enrichMKEGG (ORA), gseKEGG (GSEA), and SPIA/graphite (signed-topology perturbation) in R. Owns the third pathway-analysis generation because KEGG ships signed directed signaling topology (KGML). Covers why a KEGG result is a timestamped join against a live REST API (irreproducible unless pinned with a gson snapshot, not the stale 2012 KEGG.db), why enrichKEGG keyType is kegg/ncbi-geneid not OrgDb ENSEMBL/SYMBOL (zero hits), why organism is a KEGG code (hsa, pae) with prokaryotic locus tags, and why SPIA works only on signaling maps. Use when finding enriched KEGG pathways or modules, scoring signed pathway perturbation, analyzing prokaryotes or non-model organisms via locus tags or KO, comparing conditions with compareCluster, or overlaying data with pathview. The hypergeometric universe lives in go-enrichment; the GSEA engine in gsea.

development1,004

bio-temporal-genomics-circadian-rhythms

Tests and estimates rhythmicity at a PRE-SPECIFIED period (canonically 24h) in time-series omics using cosinor regression (CosinorPy), JTK_CYCLE/ARSER/Lomb-Scargle meta-analysis (MetaCycle meta2d), and non-parametric tests for asymmetric waveforms (RAIN, DiscoRhythm); estimates phase (acrophase), amplitude, and MESOR, and controls FDR with an effect-size (rAMP) filter against over-detection. Use when testing for 24-hour or other known-period oscillations in a single condition (circadian, feeding-fasting, or light-dark experiments) and estimating their phase/amplitude. Not for unknown-period discovery (see temporal-genomics/periodicity-detection) or comparing rhythms between conditions (see temporal-genomics/differential-rhythmicity).

testing1,004

bio-gene-regulatory-networks-scenic-regulons

Infer transcription factor regulons from single-cell RNA-seq with pySCENIC by combining GRNBoost2 co-expression, cisTarget motif-enrichment pruning, and AUCell per-cell activity scoring. Covers the motif-pruning-as-directionality principle, regulon specificity scoring, run-to-run stability, and database/species matching. Use when identifying TF regulons, scoring TF activity per cell, finding master regulators of cell identity, or comparing regulon activity across conditions. For enhancer-driven multiomic GRNs see multiomics-grn; for bulk inference and VIPER protein-activity see grn-inference.

testing1,004

bio-spatial-transcriptomics-spatial-domains

Identify spatially coherent tissue domains (regions like cortical layers, tumor vs stroma) in Visium, Visium HD, Xenium, MERFISH, Slide-seq, and Stereo-seq data with Squidpy, BANKSY, BayesSpace, STAGATE, and GraphST. Use when distinguishing a domain (a region with many cell types) from a cell type (one cell's identity) and a niche (local cell-type composition); choosing a domain method by tissue geometry (laminar/continuous vs high-resolution imaging vs non-contiguous); tuning the spatial-weight knob (BANKSY lambda, BayesSpace smoothing, SpaGCN histology weight, GNN graph radius) to avoid over-smoothing into blobs or under-smoothing into salt-and-pepper; choosing the number of domains k as a biological decision with k+-1 sensitivity; and reading the Yuan 2024 benchmark with the DLPFC continuous-laminar caveat.

testing1,004

bio-hi-c-analysis-tad-detection

Detects TAD boundaries from balanced Hi-C contact matrices via the diamond-window insulation score (cooltools insulation) and HiCExplorer hicFindTADs, returning a continuous log2 insulation track, valley-prominence boundary_strength, and Li/Otsu-thresholded is_boundary flags across a list of window sizes. Covers the multi-scale window sweep (sub-TAD to compartment-domain), why the boundary is reproducible but the domain partition is not, cross-condition comparison via differential SCORE not differential partition, and the insulation-vs-compartment orthogonality. Use when calling TADs or domain boundaries, computing insulation scores, choosing a window size, ranking boundary strength, comparing boundaries across conditions, or annotating CTCF-backed boundaries; route domain rendering to hic-visualization and boundary-feature overlap to genome-intervals.

tools1,004

bio-spatial-transcriptomics-spatial-data-io

Loads spatial transcriptomics data from Visium, Visium HD, Xenium, MERFISH/MERSCOPE, CosMx, Slide-seq/Curio, and Stereo-seq into AnnData or SpatialData using spatialdata-io and Squidpy. Use when deciding which platform class is in hand (imaging/in-situ vs sequencing/capture), which reader matches the platform (spatialdata_io.xenium/merscope/cosmx vs squidpy.read.visium/vizgen/nanostring), whether to work from the per-transcript molecule table (the re-segmentable source of truth) or the segmentation-derived per-cell matrix (quality-filtered, inherits all segmentation error), whether a molecule table even exists (spot platforms have none), and how to keep coordinate frames and units (pixel vs micron) registered to histology.

testing1,004

bio-single-cell-hashing-demultiplexing

Assign cells to their sample of origin from cell or nucleus hashing (CITE-seq HTOs, MULTI-seq lipid/cholesterol tags, CellPlex CMOs) and call cross-sample doublets using Seurat HTODemux/MULTIseqDemux, hashsolo, demuxEM, GMM-Demux, and demuxmix. Use when assigning pooled hashed cells back to their sample, calling cross-sample doublets from HTO counts, choosing a demultiplexing method, deciding between hashtag and genetic demultiplexing, or rescuing an oversized Negative pile from weak HTO staining or ambient spillover.

testing1,004

bio-spatial-transcriptomics-spatial-neighbors

Build the spatial neighbor graph that every downstream spatial statistic (Moran's I, neighborhood enrichment, co-occurrence, spatial domains) inherits, using Squidpy. Use when choosing the graph type (kNN vs Delaunay vs fixed-radius vs Visium hex grid) and understanding why it silently changes every downstream result; handling variable cell density (kNN fixes neighbor COUNT, fixed-radius fixes physical DISTANCE -- each distorts the other); getting coordinate units right (pixels vs microns; Visium array coords are not distance); pruning Delaunay long edges across tissue gaps; running the graph sensitivity analysis almost nobody runs; and knowing when planar section neighbors misrepresent a 3D tissue.

development1,004

bio-variant-annotation

Annotates VCF variants with functional consequences, population frequencies, and pathogenicity scores using bcftools annotate/csq, Ensembl VEP, SnpEff, and ANNOVAR. Use when deciding which annotation engine and version to pin, which transcript set to report on (RefSeq vs Ensembl vs MANE Select/Plus Clinical, and why VEP --pick is dangerous clinically), how to reconcile HGVS 3'-shifting with VCF left-alignment, which consequence plus NMD status governs PVS1 eligibility, which single calibrated predictor to use for PP3/BP4 (REVEL, AlphaMissense, CADD, SpliceAI deltas), or how to read gnomAD v2/v3/v4 grpmax filtering allele frequency instead of one global AF cutoff. Not for ACMG combining rules or final classification (see variant-calling/clinical-interpretation).

tools1,004

bio-workflow-management-cwl-workflows

Authors portable, strongly-typed bioinformatics pipelines in the Common Workflow Language (CWL v1.2) as CommandLineTool/Workflow/ExpressionTool documents, validated with cwltool and run at scale on Toil/Arvados/Calrissian. Use when deciding CWL (portability/provenance/regulated) vs Nextflow/WDL/Snakemake; declaring secondaryFiles for indexed companions (.bai/.fai/.dict/.tbi and the caret rule); putting resources/containers under requirements (must-hold) vs hints (advisory) to avoid silent OOM; choosing scatterMethod (dotproduct vs flat_/nested_crossproduct); preferring $(...) parameter refs over ${...} JavaScript for portability; pinning DockerRequirement images; or emitting a CWLProv provenance object for audited/clinical settings.

tools1,004

bio-workflows-biomarker-pipeline

End-to-end biomarker discovery workflow from expression data to validated biomarker panels. Covers feature selection with Boruta/LASSO, leakage-safe cross-validation, calibration, and SHAP interpretation. Use when building and validating diagnostic or prognostic biomarker signatures from omics data.

development1,004

bio-variant-calling-deepvariant

Calls germline SNPs and indels with Google DeepVariant, which reframes variant calling as CNN image classification over multi-channel pileup tensors. Covers platform-specific model selection (WGS, WES, PACBIO, ONT_R104, HYBRID_PACBIO_ILLUMINA), one-shot run_deepvariant vs the three-stage make_examples/call_variants/postprocess_variants pipeline, GPU acceleration of call_variants, DeepTrio for family/trio and de-novo calling, and joint genotyping of gVCFs with GLnexus (not GenotypeGVCFs). Use when deciding DeepVariant vs GATK vs DRAGEN, picking the right --model_type for a sequencing platform, avoiding post-hoc GATK hard filters or BQSR that degrade CNN calls, calling de-novo variants in a trio, merging a DeepVariant cohort, or weighing GIAB-trained benchmark accuracy before clinical deployment.

tools986

bio-consensus-sequences

Generate consensus FASTA sequences by applying VCF variants onto a reference with bcftools consensus, or build viral/amplicon consensus with iVar. Use when reconstructing a sample-specific reference or haplotype, deciding -H haplotype vs IUPAC vs all-ALT projection, masking no-coverage sites so a consensus does not manufacture false reference calls, or setting iVar min-depth/min-frequency policy for surveillance genomes.

tools986

bio-vcf-basics

View, query, and interpret VCF/BCF variant files with bcftools and cyvcf2. Use when inspecting variants, extracting fields with query format strings, converting VCF/BCF, or correctly reading a field -- QUAL (site) vs GQ (genotype) vs PL/GL likelihoods, AD vs DP and allele balance, GT phasing/ploidy/PS and missing-vs-hom-ref, INFO/FORMAT Number A/R/G semantics, symbolic alleles (<DEL>, <NON_REF>, spanning *) and END, or telling a raw gVCF apart from a filtered callset.

tools986

bio-variant-calling-filtering-best-practices

Filters germline and somatic variant callsets at the site and genotype level with GATK VQSR (VQSLOD, truth-sensitivity tranches), VETS/ScoreVariantAnnotations, NVScoreVariants, hard filters with per-annotation thresholds, and bcftools/cyvcf2 expressions, plus Ti/Tv-based QC. Use when deciding between VQSR, hard filtering, and ML recalibration by cohort size and platform, setting SNP vs indel thresholds, replicating the missing-annotation-passes rule so hom-alt sites survive, applying genotype-level GQ/DP filters, or validating filter impact. Not for VCF normalization (see variant-calling/variant-normalization) or summary statistics (see variant-calling/vcf-statistics).

tools986

bio-vcf-manipulation

Combine, split, sort, intersect, and subset VCF/BCF files with bcftools merge, concat, isec, sort, view, and reheader. Use when merging different samples into a cohort VCF, concatenating per-chromosome or per-region call sets for the same samples, intersecting or complementing call sets from different callers, subsetting samples/regions, harmonizing sample names and

tools986

bio-vcf-statistics

Compute and interpret VCF quality-control metrics (Ti/Tv, het/hom, novel/known, missingness, HWE, contamination, relatedness) with bcftools stats, vcftools, plot-vcfstats, and identity tools (somalier, peddy, KING). Use when judging whether a callset is trustworthy, diagnosing a low Ti/Tv or outlier het/hom sample, deciding whether an HWE deviation is error or biology, screening a cohort for sample swaps/contamination/wrong-sex before analysis, or comparing call sets before and after filtering. Not for applying filters (see variant-calling/filtering-best-practices) or normalizing representation (see variant-calling/variant-normalization).

tools986

bio-variant-calling-structural-variant-calling

Call structural variants (>=50 bp deletions, insertions, inversions, duplications, translocations) from short- or long-read data by reconstructing four orthogonal signals (discordant pairs, split reads via the SA tag, read depth, local assembly). Covers Manta, DELLY, LUMPY/smoove, GRIDSS2, SvABA for short reads and Sniffles2, cuteSV, pbsv, dipcall/PAV for long reads, each mapped to the signals it fuses and the blind spots that follow. Use when choosing an SV caller from its signal set and failure modes, decoding the SVLEN-sign / symbolic-vs-BND / CIPOS VCF representation minefield, force-genotyping a cohort matrix instead of unioning discovery VCFs, merging populations with sequence-aware Truvari vs position-only SURVIVOR, parameterizing a Truvari benchmark, or deciding when short-read insertion recall forces a switch to long reads. Not for pure copy-number dosage (see copy-number/cnvkit-analysis).

development986

bio-variant-calling

Call germline SNPs and indels from a BAM/CRAM with bcftools mpileup and call, and select the right calling engine for the job. Use when generating a VCF from aligned reads, choosing between bcftools, GATK HaplotypeCaller, DeepVariant, and DRAGEN, setting ploidy for haploid/organelle/polyploid/sex-chromosome calling, or deciding whether pileup-based calling is good enough versus a local-reassembly caller for indels and difficult regions. Not for cohort joint genotyping (see variant-calling/joint-calling), GATK-specific workflows (see variant-calling/gatk-variant-calling), deep-learning calling (see variant-calling/deepvariant), or somatic/low-VAF detection.

tools986

bio-variant-calling-joint-calling

Joint genotype a cohort of per-sample gVCFs with GATK (HaplotypeCaller -ERC GVCF -> GenomicsDBImport or CombineGVCFs -> GenotypeGVCFs) or GLnexus for DeepVariant gVCFs, producing a squared-off sample-by-site genotype matrix. Use when deciding between joint genotyping and merging single-sample callsets (never bcftools merge as absent==hom-ref), choosing GenomicsDBImport vs CombineGVCFs by cohort size and memory, solving the N+1 problem so a new sample does not force re-calling everyone, understanding cohort rescue of low-coverage het sites, handling the spanning-deletion star allele and GQ/PL recomputation at the joint step, scaling to biobank cohorts by interval sharding, or picking DeepVariant+GLnexus over the GATK path on throughput. Not for single-sample calling (see variant-calling/gatk-variant-calling) or VQSR/hard-filter mechanism (see variant-calling/filtering-best-practices).

tools986

bio-variant-normalization

Left-align and trim indels to parsimonious canonical form, decompose MNPs (atomize), and split multiallelic variants with bcftools norm. Use when comparing variants across callers or cohorts, preparing a VCF for database annotation or ClinVar/dbSNP matching, merging VCFs, reconciling vt-vs-bcftools representation discordance, or resolving the VCF-left-align vs HGVS-3'-rule clash.

tools986

bio-variant-calling-clinical-interpretation

Classify variant clinical significance with the ACMG/AMP germline framework and its 2018-2025 ClinGen refinements (graded PVS1 decision tree, PM2 downgraded to Supporting, PP5/BP6 retired, calibrated PP3/BP4, Bayesian points), the AMP/ASCO/CAP somatic tiers and ClinGen oncogenicity system, ClinVar star-rating and gnomAD grpmax filtering-AF interpretation. Use when deciding germline-vs-somatic framework, applying current (not flat-2015) ACMG points, checking for a gene-specific VCEP specification, judging whether a ClinVar assertion or gnomAD frequency is usable evidence, calibrating a pathogenicity predictor, evaluating PVS1 on the MANE Select transcript, or building a VUS reanalysis loop. Not for functional annotation itself (see variant-calling/variant-annotation).

tools986

bio-temporal-genomics-temporal-grn

Infers directed, time-delayed gene regulatory edges from BULK time-series expression using Granger causality (statsmodels VAR F-test), dynGENIE3 (tree ensembles regressing ODE-derived derivatives; Random Forests by default, Extra-Trees optional), and dynamic Bayesian networks (bnlearn). Use when the output is a RANKED HYPOTHESIS list for perturbation validation, not validated causal edges; deciding Granger vs dynGENIE3 vs DBN by timepoint count and linearity; sizing maxlag against the n>3*maxlag+1 degrees-of-freedom floor; handling stationarity/differencing before Granger; restricting regulators to known TFs; and comparing network rewiring across conditions at matched edge density. Not for single-cell pseudotime GRNs (see gene-regulatory-networks/scenic-regulons) or static co-expression (see gene-regulatory-networks/coexpression-networks).

testing982

bio-temporal-genomics-periodicity-detection

Discovers a periodic signal of UNKNOWN period in time-series omics data and puts a defensible significance on it, especially when sampling is IRREGULAR (dropped timepoints, pooled harvests) so FFT/Welch/JTK are invalid. Estimates the dominant period with Lomb-Scargle / generalized Lomb-Scargle (scipy, astropy), corroborates with autocorrelation, resolves transient/time-varying periodicity with the wavelet CWT (pywt), and screens genome-wide with false-alarm probabilities under BH FDR. Use when finding an oscillation whose period is not known a priori, analyzing cell-cycle or ultradian rhythms, or handling unevenly sampled time courses. Not for testing a KNOWN 24-hour rhythm (see temporal-genomics/circadian-rhythms).

testing982

temporal-genomics/trajectory-modeling

--- name: bio-temporal-genomics-trajectory-modeling description: Models continuous temporal trajectories from BULK or time-resolved omics where the x-axis is measured experimental time: penalized GAMs (mgcv) for smooth trends and changepoint detection (segmented, ruptures) for abrupt regime shifts. Use when deciding between a smooth GAM and a changepoint model; choosing the GAM distribution (nb() plus a library-size offset for raw counts vs Gaussian on vst/log-CPM); setting the basis-dimension c

development982

bio-tcr-bcr-analysis-scirpy-analysis

Integrates single-cell paired TCR/BCR (10x VDJ, AIRR, dandelion, BD Rhapsody) with gene expression in an AnnData/MuData object using scirpy - chain-pairing QC, clonotype definition, clonal expansion, diversity, repertoire overlap, V(D)J usage, and VDJdb specificity. Operates on the awkward-array AIRR model (adata.obsm['airr'], accessed via get.airr after pp.index_chains), not legacy per-chain obs columns. Use when deciding clonotype definition for TCR (exact CDR3-nt identity via define_clonotypes) versus BCR (nucleotide distance clustering via define_clonotype_clusters with normalized_hamming plus same_v_gene/same_j_gene, because somatic hypermutation shatters identity clonotypes); tuning receptor_arms (all vs any), dual_ir, and within_group; filtering chain_qc categories (multichain doublets, orphan dropout, extra-VJ dual-TCR) without biasing clonal-expansion and diversity estimates; and overlaying clonality onto the transcriptomic UMAP.

development982

bio-tcr-bcr-analysis-vdjtools-analysis

Computes immune-repertoire diversity, clonal structure, overlap, and segment usage from TCR/BCR clonotype tables with VDJtools (immunarch as the modern R alternative). Use when deciding which diversity estimator answers a question (q=0 observed richness/chao1/chaoE, q=1 shannonWienerIndex, q=2 inverseSimpson as a Hill profile); normalizing sequencing depth before any cross-sample claim (DownSample or the resampled CalcDiversityStats table); choosing an overlap metric (depth-robust MorisitaHorn/F2 vs depth-biased Jaccard/public counts) and a clonotype match key (-i nt/aa, +/-V/J); summarizing clonality as 1 - normalizedShannonWienerIndex; reading spectratype and V-J usage under primer bias; interpreting public clonotypes; and choosing VDJtools (stable Java CLI) vs immunarch (active tidy R).

tools982

bio-temporal-genomics-temporal-clustering

Clusters temporally variable genes by expression-profile SHAPE (not significance) using Mfuzz fuzzy c-means, TCseq, DEGreport degPatterns, and tslearn DTW/soft-DTW. Use when grouping pre-selected time-course genes into shared trajectory programs (co-expression modules), choosing between soft vs hard clustering, picking k, selecting a distance metric (Euclidean/correlation/DTW), or interpreting clusters with per-cluster enrichment. Requires temporally variable genes selected FIRST (differential-expression/timeseries-de or a variance filter); clustering is descriptive and downstream of selection, never a test of which genes are dynamic.

tools982

bio-tcr-bcr-analysis-repertoire-visualization

Draws TCR/BCR repertoire figures - V-J chord/circos, CDR3 spectratype, clonal-space stratification, clonal tracking across timepoints, rarefaction/extrapolation curves, overlap heatmaps, and clonotype-similarity networks - and encodes how to read them. Use when choosing between a raw Shannon bar and a rarefaction curve for a diversity comparison; deciding a depth-robust overlap metric (Morisita-Horn) vs a set metric (Jaccard) for a heatmap; setting the distance threshold that defines a clonotype-similarity network; interpreting a Gaussian vs skewed spectratype as polyclonal vs clonally expanded; or laying out clonal-space and clone-tracking plots. Covers VDJtools PlotFancyVJUsage/RarefactionPlot, R circlize and iNEXT, and matplotlib/seaborn recipes.

tools982

tcr-bcr-analysis/mixcr-analysis

--- name: bio-tcr-bcr-analysis-mixcr-analysis description: Align V(D)J reads and assemble TCR/BCR clonotypes with MiXCR, driven by a chemistry-matched preset. Use when choosing/auditing the preset for a library (5'RACE/template-switch vs multiplex-primer amplicon -> rigid vs floating boundaries; RNA vs gDNA -> --rna/--dna; bulk vs 10x single-cell; UMI vs no-UMI -> tag pattern and barcode collapse; kit presets Takara/NEBNext/QIAseq/BD/MiLaboratory); assembling clonotypes by CDR3 vs VDJRegion; set

development982

bio-tcr-bcr-analysis-specificity-annotation

Maps TCR/BCR receptor sequences toward candidate antigen specificity and clusters repertoires by shared-specificity signal, while enforcing that a database match or a cluster label is a HYPOTHESIS, not a specificity call. Use when deciding among database annotation (VDJdb/McPAS/IEDB+TCRMatch, requiring V-gene and HLA concordance plus a confidence score) versus sequence clustering (tcrdist3 meta-clonotypes, GLIPH2, GIANA, clusTCR, which find enrichment not per-receptor labels) versus generation-probability nulls (OLGA Pgen, IGoR, SONIA Ppost) for testing public/convergent/shared claims; and when guarding against overclaiming specificity, base-rate false positives from bare CDR3 matches, unpaired beta-only annotation, ML predictor failure on unseen epitopes, and ignored MHC restriction. TCR-focused with a BCR/antibody note (SHM, conformational epitopes, IGHV3-53/3-66 public clonotypes). Keywords CDR3, pMHC, HLA restriction, cross-reactivity, meta-clonotype, Pgen, public clonotype, convergent recombination.

tools982

bio-systems-biology-model-curation

Validates, gap-fills, and standardizes genome-scale metabolic models using memote for consistency and annotation scoring and COBRApy for manual curation, including mass/charge balance, energy-generating-cycle detection, dead-end resolution, GPR fixes, and SBML/SBO/MIRIAM annotation. Use when improving a draft model, gap-filling to a target medium, detecting erroneous ATP-from-nothing cycles, interpreting a memote score correctly (consistency vs biological validity), validating predictions against measured growth/essentiality, or preparing a model for publication.

testing979

bio-systems-biology-gene-essentiality

Performs in-silico single and double gene deletions, condition-dependent essentiality, and synthetic-lethality screens on genome-scale metabolic models with COBRApy, evaluating gene-protein-reaction rules and comparing FBA re-optimization against MOMA/ROOM minimal-adjustment. Use when predicting essential genes, finding synthetic-lethal pairs for drug targets, choosing a growth cutoff, deciding FBA vs MOMA vs ROOM for a knockout, making essentiality medium-specific to match an experiment, or validating predictions against Keio/Tn-seq/CRISPR screens with MCC.

development979

bio-systems-biology-metabolic-reconstruction

Builds draft genome-scale metabolic models from an annotated genome using CarveMe (top-down carving of a BiGG universal model) or gapseq (bottom-up pathway-evidence reconstruction), then loads and sanity-checks the draft in COBRApy. Use when creating a model for an organism without one, choosing between CarveMe and gapseq, gap-filling to a target medium, understanding why a draft that grows is still only a hypothesis, handling BiGG-vs-ModelSEED namespace mismatch, or preparing a draft for curation and community modeling.

development979

bio-systems-biology-context-specific-models

Builds tissue-, cell-type-, and condition-specific metabolic models by integrating transcriptomic or proteomic data into a generic genome-scale model, using extraction algorithms (GIMME, iMAT, INIT/tINIT, MADE, E-Flux, CORDA, FASTCORE) via troppo and corda in Python or the COBRA Toolbox/RAVEN in MATLAB. Use when pruning a generic model to a context, choosing an extraction method and expression threshold, mapping expression through GPR rules to reactions, deciding whether an objective is required (GIMME vs iMAT), avoiding the growth-objective trap for non-proliferating tissue, or judging how much of a context-specific model is real signal versus an artifact of the threshold and method.

tools979

bio-systems-biology-flux-balance-analysis

Performs flux balance analysis (FBA), flux variability analysis (FVA), parsimonious FBA (pFBA), loopless FBA, flux sampling, and production envelopes on genome-scale metabolic models with COBRApy, solving the biomass-maximization linear program under a defined medium. Use when predicting growth rate on a carbon source, computing flux ranges and alternative optima (FVA), setting exchange bounds and minimal media, distinguishing a real growth phenotype from an under-constrained model, sampling the flux solution space, or choosing between FBA, pFBA, loopless FBA, and sampling for a flux distribution.

data-ai979

bio-workflow-management-wdl-workflows

Authors bioinformatics pipelines in WDL (Workflow Description Language) run by Cromwell or miniwdl, targeting the GATK/Broad and Terra/AnVIL/BioData Catalyst cloud ecosystem, with tasks, workflows, scatter-gather parallelism, structs, and a runtime block that sizes the cloud VM. Use when deciding to target Terra/AnVIL/GATK/WARP (chosen for the ecosystem, not the language); sizing runtime disks dynamically for a fresh-per-task cloud VM (ceil(size(f)*factor)+buffer); choosing preemptible vs on-demand VMs by task length and idempotency; picking Cromwell (production, cloud, call-caching) vs miniwdl (local dev, miniwdl check linting, readable errors); enabling and debugging call-caching silent-miss modes; pinning Docker by digest for reproducibility and cache stability; or scattering an array for parallel fan-out.

development979

bio-workflow-management-nf-core-pipelines

Runs and configures curated nf-core community Nextflow pipelines (rnaseq, sarek, atacseq, methylseq, ampliseq, taxprofiler, fetchngs) reproducibly, pinning the pipeline revision with -r and selecting a container engine and institutional config via -profile. Use when deciding to adopt a community pipeline versus author one from scratch; picking a pipeline and pinning its -r revision; selecting -profile test/docker/singularity/conda plus an institutional config from nf-core/configs; building and validating a samplesheet CSV against the pipeline schema (nf-schema); choosing --genome/iGenomes versus custom references; configuring resources and max_memory for SLURM/AWS Batch; using -resume and -stub; and reading MultiQC outputs.

development979

bio-workflow-management-snakemake-workflows

Authors reproducible bioinformatics pipelines with Snakemake - rules wired by output-file pattern, wildcards and expand() for sample fan-out, checkpoints for runtime-unknown outputs, resource/retry escalation, and conda/container software deployment on HPC and cloud. Use when deciding rule-based (Snakemake) vs channel/dataflow (Nextflow) authoring; wiring rules by OUTPUT-file pattern rather than imperative order; using wildcards + expand() for sample fan-out and constraining them to stop silent mis-routing; adding checkpoints when the set of outputs is unknown until a step runs (dynamic DAG); diagnosing why a job reran (or did not) under the mtime-plus-provenance trigger set; escalating memory on retry for OOM-killed jobs; and porting a Snakemake 7 `--cluster`/remote-provider command to the Snakemake 8+ executor-plugin and storage-plugin model (snakemake-executor-plugin-slurm) with `--software-deployment-method`.

tools979

bio-systems-biology-community-metabolic-modeling

Builds and simulates multi-species metabolic community models from member genome-scale models, using MICOM for abundance-weighted steady-state community FBA and cooperative tradeoff, SMETANA for cross-feeding and competition scoring, and SteadyCom/COMETS for common-growth-rate and dynamic simulation. Use when modeling a microbiome or co-culture, predicting cross-feeding and competition, abundance-weighting members from metagenomics, choosing steady-state vs dynamic community modeling, avoiding the compartment-pooling artifact, or judging how member-model quality and namespace propagate into community predictions.

development979

bio-systems-biology-strain-design

Computes metabolic-engineering strain designs on genome-scale models with StrainDesign (OptKnock, RobustKnock, minimal cut sets, OptCouple) and cameo (heuristic knockout and FSEOF over/under-expression targets), finding gene/reaction interventions that couple product formation to growth. Use when designing knockouts to overproduce a target chemical, choosing between OptKnock and RobustKnock, growth-coupling a product so evolution maintains it, computing minimal cut sets, finding amplification targets with FSEOF, or understanding why MILP strain design needs a strong solver and why a design is only a hypothesis.

development979

bio-workflow-management-nextflow-pipelines

Authors reproducible Nextflow DSL2 pipelines built on reactive dataflow, where processes communicate only through channels and execution order is not guaranteed. Use when deciding channel/dataflow (Nextflow) vs rule-based (Snakemake) authoring; wiring queue vs value channels and fixing shared-reference exhaustion with .first(); composing DSL2 modules and subworkflows with take/main/emit; selecting container/conda profiles and pinning images by digest for portability across local/SLURM/LSF/AWS Batch/Google Batch/Kubernetes executors; diagnosing why -resume misses the cache (nondeterministic input order, mtime on network filesystems, mutable :latest tags) with cache 'lenient' and -dump-hashes; managing work/ vs publishDir and dynamic retry escalation; and choosing whether to adopt an nf-core community pipeline or author from scratch.

development979

bio-structural-biology-structure-modification

Modifies protein structures in place with Biopython Bio.PDB - transforms coordinates, strips waters/heteroatoms, overloads the B-factor column, renumbers, and builds entities. Use when applying a rotation matrix and needing to know whether it is row-convention (Entity.transform, Superimposer) or column-convention (REMARK 350 / _pdbx_struct_oper_list assembly operators) so geometry is not silently mirrored; when overloading B-factors with pLDDT/conservation for coloring and needing to preserve the destroyed originals; when stripping solvent by HETFLAG (r.id[0]) rather than residue name so catalytic metals and cofactors survive; and when building or copying entities through StructureBuilder/Select without breaking SMCRA parent-child links or the (hetflag, resseq, icode) id tuple. Keywords transform, rotation matrix, occupancy, assembly operators.

development974

bio-epitranscriptomics-m6a-peak-calling

Calls m6A peaks from MeRIP-seq / m6A-seq paired IP-vs-input data using exomePeak2 (transcript-aware, GC-bias-corrected Poisson GLM), MeTPeak (HMM over sliding windows), MACS3/MACS2 with --nomodel --broad --keep-dup all (genome-wide broad alternative), and DRACH motif enrichment via HOMER or ggseqlogo as a sanity check (NOT a filter). Covers BED12 vs narrowPeak output, exonic vs intronic peak handling, multi-tool reconciliation (intersection vs union), the m6A-vs-m6Am ambiguity at 5'UTR peaks that antibody methods cannot resolve, and orthogonal validation (miCLIP/GLORI/m6A-SAC-seq/m6Anet). Use when calling peaks from paired IP/input genome BAMs, choosing exomePeak2 (transcript-aware default) vs MACS3 (broad genomic) vs MeTPeak (HMM-smoothed low-coverage), confirming DRACH enrichment as a sanity check on the peak set, reconciling differing peak sets across tools, validating MeRIP peaks against single-base methods, interpreting 5' peaks where m6Am contamination is possible, or recommending a consensus strategy.

tools974

bio-epitranscriptomics-modification-visualization

Visualises RNA-modification data with transcript-feature metagene plots (Guitar GuitarPlot; MetaPlotR; deepTools computeMatrix scale-regions), peak-centred heatmaps (ComplexHeatmap; deepTools plotHeatmap), IP-vs-input paired browser tracks (log2 IP/input bigWig via deepTools bamCompare; pyGenomeTracks; Gviz; IGV/UCSC track hubs), DRACH sequence-logo plots (ggseqlogo; MEME), feature-distribution stacked bars, and volcano/MA plots for differential modification. Establishes stop-codon enrichment in the metagene plot as the biological QC anchor for any MeRIP dataset (Dominissini 2012; Meyer 2012). Use when producing the canonical metagene plot with stop-codon enrichment as a QC anchor, building paired IP/input genome-browser tracks at single-locus resolution, plotting peak-centred heatmaps clustered by condition, summarising peak distribution across transcript features, generating DRACH motif logos as sanity checks, rendering volcano plots of differential m6A, or reproducing the stop-codon enrichment plot.

tools974

bio-crispr-screens-prime-editing-screens

Designs and analyzes pooled prime-editor (PE) screens for installing precise genetic variants without bystander confounding. Covers pegRNA design with PRIDICT and PRIDICT2 for predicting per-pegRNA editing efficiency, pegRNA architecture (spacer + scaffold + PBS + RTT), PE2/PE3/PE3b/PEmax/PEAR variants, MOSAIC in situ saturation mutagenesis, the PRIME pooled-screen methodology (Erwood/Doman 2023; ~3,699 ClinVar variant screens), chromatin context as a primary determinant of PE efficiency, scaffold-incorporation and indel byproduct quantification with CRISPResso2, and the cross-modal validation strategy of PE + base-editor screens for variant function. Use when designing a pegRNA library for variant installation, choosing between BE and PE for a specific edit, predicting pegRNA efficiency before library synthesis, analyzing PE screen output, distinguishing intended-edit from scaffold-incorporation, or scaling PE screens to thousands of variants.

tools974

bio-epidemiological-genomics-transmission-inference

Infers person-to-person transmission from pathogen genomes using outbreaker2, TransPhylo, phybreak, BadTrIP, SCOTTI, BEASTLIER, and SNP-distance / cluster-picker approaches (HIV-TRACE for HIV; transcluster). Defines outbreak clusters using pathogen-specific SNP thresholds (NOT a universal cutoff -- TB <=12 SNPs; MRSA <=15; C. difficile <=2; Klebsiella <=21), models within-host diversity and transmission bottlenecks, integrates contact-tracing data, distinguishes generation from serial interval, and attributes source via Bayesian source attribution (islandR). Use when investigating outbreaks for who-infected-whom, defining SNP-cluster outbreak definitions, accounting for unsampled intermediates, choosing between outbreaker2 (rich epi data) and TransPhylo (genomic-only after a dated phylogeny), running source attribution between host populations, calling HIV-TRACE thresholds appropriate to the local subtype, or distinguishing recent transmission from reactivation in TB or chronic HIV.

development974

bio-epitranscriptomics-m6anet-analysis

Detects m6A modifications from Oxford Nanopore direct-RNA-seq (DRS) signal using m6Anet (multiple-instance-learning over DRACH 5-mer signal). Covers the upstream pipeline (Dorado/Guppy basecalling -> minimap2 map-ont -> nanopolish eventalign -> m6anet dataprep -> m6anet inference), per-site vs per-read probability including the mod_ratio stoichiometry column, the DRACH-only constraint, minimum-coverage thresholds (20-50 reads/site), multi-condition comparison via xPore/Nanocompore/ELIGOS, Dorado native modification calling (RNA004, 2024+), and the cDNA-vs-DRS distinction (cDNA Nanopore CANNOT detect modifications). Use when calling m6A from ONT DRS without immunoprecipitation, choosing m6Anet vs xPore vs Nanocompore vs ELIGOS vs Dorado native, interpreting probability_modified vs mod_ratio vs per-read probabilities, deciding between m6Anet (known DRACH sites) and Dorado/Remora (genome-wide screening), pinning RNA002 vs RNA004 chemistry and basecaller versions, or troubleshooting eventalign/dataprep failures.

development974

bio-structural-biology-binding-site-detection

Detects putative ligand-binding pockets and druggable cavities de novo on an apo protein structure with fpocket, P2Rank, CASTp, and DoGSiteScorer, ranking them by druggability/ligandability score. Use when detecting cavities on an apo structure with no bound ligand; choosing geometric pocket enumeration (fpocket alpha-spheres, CASTp) vs ML ligandability scoring (P2Rank, DoGSiteScorer); recognizing that a geometric cavity is a hypothesis, not automatically a functional or druggable site (may be a crystal-additive or non-functional cleft); knowing druggability scores were trained on holo sets and under-detect apo, shallow, and cryptic pockets; detecting cryptic or transient pockets over an MD or conformational ensemble (mdpocket); and detecting on a predicted model whose pocket-lining rotamers are the least reliable atoms. Keywords binding site, pocket, cavity, druggability, ligandability, fpocket, P2Rank, CASTp, DoGSiteScorer, apo, cryptic pocket, alpha sphere, mdpocket.

development974

bio-epitranscriptomics-m6a-differential

Identifies differential m6A methylation between conditions from MeRIP-seq paired IP/input data using exomePeak2 (GC-bias-aware differential via its bam_ip/bam_input control + bam_treated_ip/bam_treated_input treatment arms), QNB beta-binomial, MeTDiff HMM, and RADAR, plus the paired-symmetric edgeR/DESeq2-on-peak-counts route when batch/lot covariates need fixed-effect handling that exomePeak2's API does not accept. Covers paired vs unpaired vs interaction designs, batch confounding and per-lot meta-analysis, the stoichiometry-vs-expression-vs-IP-efficiency confound, and effect-size filtering against under-powered N=2 designs. Use when comparing m6A across two or more conditions, choosing between exomePeak2/QNB/RADAR/MeTDiff for a design, handling batch confounding when exomePeak2's API is too rigid, distinguishing real hyper/hypo-methylation from expression shifts, applying effect-size thresholds, or planning orthogonal stoichiometry validation (GLORI/SAC-seq/m6Anet mod_ratio).

development974

bio-epitranscriptomics-merip-preprocessing

Aligns and QCs methylated-RNA-immunoprecipitation (MeRIP / m6A-seq) IP and input libraries using STAR or HISAT2 splice-aware mapping, samtools sort/index, IP/input matched-pair tracking, antibody-lot metadata recording, replicate concordance via deepTools multiBamSummary + plotCorrelation, IP enrichment QC via plotFingerprint and per-transcript IP/input ratio distributions, library-complexity saturation curves via PreSeq, and the explicit do-NOT-deduplicate convention for standard non-UMI MeRIP. Use when preparing paired IP and input BAM files for exomePeak2 / MeTPeak / MACS3 peak calling, evaluating MeRIP replicate concordance and IP enrichment, deciding whether to deduplicate (standard MeRIP typically NOT), choosing genome-vs-transcriptome alignment for downstream peak vs m6Anet workflows, recording antibody clone and lot metadata for cross-batch reconciliation, detecting failed IPs via saturation curves and IP/input distribution shape, or generating IP-over-Input bigWig tracks for visualisation.

tools974

bio-structural-biology-modern-structure-prediction

Predicts protein and complex structures with deep-learning models (ESMFold, AlphaFold2/ColabFold, AlphaFold3, Chai-1, Boltz-1/2) and reconciles them with confidence metrics. Use when choosing a predictor by input and question rather than novelty (ESMFold single-chain, no-MSA, fast, metagenomic-scale vs AlphaFold3/Chai-1/Boltz for complexes, ligands, nucleic acids, ions, PTMs); recognizing that MSA depth is the dominant accuracy determinant so ESMFold trades accuracy for speed and degrades on orphan proteins; gating a complex on ipTM plus inter-chain PAE, not per-chain pLDDT; reading pLDDT as local confidence, PAE as inter-domain/inter-chain positioning, pTM as global fold; knowing a single prediction is one dominant conformer not an ensemble (no apo/holo, allosteric, or fold-switch states), that these are not variant-effect/ddG/affinity engines, and that a confident prediction is a hypothesis, not an experiment. Keywords ESMFold, AlphaFold3, Chai-1, Boltz-1, ColabFold, ipTM, PAE, pLDDT, MSA depth.

testing974

bio-structural-biology-structure-navigation

Navigate the Bio.PDB SMCRA hierarchy (Structure-Model-Chain-Residue-Atom) safely, surfacing the heterogeneity it hides by default. Use when deciding how to handle altloc/DisorderedAtom conformers before a distance or RMSD, indexing residues insertion-code-safe with the full (hetflag, resseq, icode) tuple, choosing the ATOM/observed vs SEQRES/canonical vs UniProt sequence, selecting the right Model for an NMR ensemble, filtering waters/hetero/metals correctly, and reconciling auth vs label numbering. Keywords SMCRA, altloc, DisorderedAtom, insertion code, SEQRES, PPBuilder, auth_seq_id.

development974

bio-structural-biology-alphafold-predictions

Retrieves and interprets AlphaFold Protein Structure Database (AFDB) models by UniProt accession, reading pLDDT and PAE confidence correctly. Use when treating pLDDT as PER-RESIDUE confidence (not global accuracy) and recognizing a long low-pLDDT stretch as an intrinsically disordered region rather than a modeling error; reading PAE to segment confident domains and judge inter-domain/relative-position confidence that high mean pLDDT cannot certify; recognizing a static AFDB model carries NO ligands, ions, cofactors, PTMs, quaternary assembly, or alternative conformations (pLDDT sits in the B-factor column with opposite polarity to thermal motion); and deciding an AFDB entry vs re-running prediction. Keywords AlphaFold DB, pLDDT, PAE, B-factor column, intrinsic disorder, UniProt, Foldseek.

development974

bio-structural-biology-geometric-analysis

Measures geometric properties of protein structures with Biopython Bio.PDB - interatomic distances, distance matrices, bond and dihedral angles (phi/psi/chi, Ramachandran), superposition and RMSD, center of mass, radius of gyration, and solvent accessible surface area (SASA). Use when deciding that RMSD depends on BOTH the superposition and the atom selection (a global all-atom RMSD is dominated by flexible loops and hinge motion and is NOT a cross-protein similarity metric); choosing the metric that matches the question (RMSD for same-molecule displacement, TM-score for same-fold, lDDT for superposition-free local model quality - the quantity pLDDT predicts); recognizing Superimposer needs an equal-length ordered atom-to-atom correspondence; and reporting SASA only alongside its probe radius (1.4A water, Shrake-Rupley) with a preference for relative SASA. Keywords RMSD, TM-score, lDDT, SASA, Shrake-Rupley, superposition, Kabsch, dihedral, Ramachandran, radius of gyration.

development974

bio-structural-biology-interface-analysis

Maps protein-protein and protein-ligand interfaces with Bio.PDB, computing contact residues and buried surface area (BSA). Use when choosing a contact cutoff and stating its rationale (heavy-atom 4-5A vs CA-CA 8A vs a SASA-based definition); deciding a contact list is not an interface and computing buried surface area (dSASA/BSA) instead; distinguishing a genuine biological interface from a crystal-packing artifact; identifying ligand-contact or epitope residues; and computing on the biological assembly rather than the asymmetric unit. Keywords interface, buried surface area, BSA, contacts, NeighborSearch, PISA, crystal packing, epitope, binding site, ShrakeRupley.

testing974

bio-structural-biology-structure-validation

Judges whether a macromolecular model (or a region of it) is reliable enough to build on, using resolution, R-free, B-factors, MolProbity geometry, and predicted-model confidence with Bio.PDB. Use when deciding if a structure or a specific region is trustworthy before docking/mechanism/measurement; reading resolution, R-work vs R-free and the R-free-minus-R-work overfitting gap; sanity-checking per-residue and mean B-factors; flagging clashscore, Ramachandran and rotamer outliers and cis non-proline peptides; validating a PREDICTED (AlphaFold/ESMFold) model via pLDDT bands and PAE before docking or molecular replacement; and interpreting cryo-EM global-vs-local resolution (FSC 0.143 half-map vs 0.5 map-model) or an NMR ensemble spread. Keywords validation, resolution, R-free, B-factor, MolProbity, clashscore, Ramachandran, rotamer, pLDDT, PAE, wwPDB, cryo-EM local resolution.

development974

bio-structural-biology-structure-io

Reads, writes, downloads, and converts macromolecular structures with Biopython Bio.PDB. Use when choosing a format (mmCIF/PDBx vs legacy PDB vs BinaryCIF) for a structure that may exceed PDB's ~62-chain / 99,999-atom limits; when residue numbers do not match the paper because of auth_* vs label_* numbering (MMCIFParser defaults auth_residues=True); when metadata (resolution, method, R-free) is missing because Bio.PDB drops it and MMCIF2Dict is needed; when the deposited coordinates are the asymmetric unit and the biological assembly must be downloaded separately; when downloading from RCSB (files.rcsb.org, PDBList); and when a legacy MMTF path is dead (RCSB retired MMTF July 2024, use BinaryCIF).

development974

bio-structural-biology-structure-preparation

Prepares a deposited or predicted structure for docking, molecular dynamics, or electrostatics by adding hydrogens, assigning protonation and tautomer states, and filling missing atoms and short loops with PDBFixer, reduce, PROPKA, and PDB2PQR. Use when adding hydrogens an X-ray model never resolved; assigning His HID/HIE/HIP tautomers, Asn/Gln/His 180-degree flips, and Cys/Lys/Asp/Glu pKa-shifted protonation at a stated pH and microenvironment rather than trusting standard pKa 7; filling missing side-chain atoms and modeling short missing loops as disorder hypotheses; making a receptor docking- or MD-ready and recording what was built; preparing a predicted model after trimming low-pLDDT regions; and writing a PQR for Poisson-Boltzmann electrostatics. Keywords PDBFixer, reduce, PROPKA, PDB2PQR, protonation, tautomer, missing atoms, hydrogens, pKa, docking prep, MD prep.

development974

bio-epidemiological-genomics-pathogen-typing

Assigns isolate identity at the right resolution for the question -- ANI/Mash species triage, 7-locus MLST historical comparability, cgMLST/wgMLST outbreak resolution (chewBBACA, BIGSdb, Ridom SeqSphere, EnteroBase HierCC), in-silico serotyping (SISTR/SeqSero2 Salmonella, SerotypeFinder E. coli, Kaptive Klebsiella, SeroBA pneumococcus, spa+SCCmec S. aureus), and lineage callers (TB-Profiler/Mykrobe barcode for MTBC, Pangolin + Nextclade for SARS-CoV-2, PopPUNK GPSC for S. pneumoniae). Use when typing bacterial isolates for surveillance or outbreak investigation, choosing between cgMLST allele distance and core-SNP distance for cluster definition, harmonising calls across schemas/database versions, assigning MTBC lineage with the Napier 90-SNP barcode, calling Salmonella serovar via SISTR with monophasic Typhimurium awareness, running Pangolin UShER mode with pangolin-data version pinning, or selecting a typing resolution to match the surveillance question.

development974

bio-epidemiological-genomics-phylodynamics

Estimates time-scaled phylogenies, molecular-clock rates, effective reproduction number R_e, and population dynamics from dated pathogen genomes using TreeTime (maximum-likelihood) and BEAST2 (Bayesian; strict/relaxed clocks; coalescent, Bayesian-Skyline, Skygrid, Birth-Death-Skyline, and sampled-ancestor priors; structured coalescent via MASCOT). Covers root-to-tip clock QC via TempEst, date-randomisation tests, recombination masking via Gubbins/ClonalFrameML before clock inference for recombining bacteria, BDSKY origin-vs-rootHeight pitfalls, sampling-bias correction, multi-chain convergence diagnostics, and reconciling phylodynamic R_e with case-based R_t. Use when dating outbreak origins, estimating substitution rates, inferring R_e through time, building time-calibrated Nextstrain Augur trees, choosing between strict and relaxed clocks, fitting Birth-Death-Skyline models, diagnosing temporal-signal failure, running MASCOT for structured-population analyses, or using UShER for pandemic-scale placement.

development974

bio-epidemiological-genomics-variant-surveillance

Assigns pathogen lineages (SARS-CoV-2 Pangolin UShER mode; Nextclade clade + QC; pango-designation alias resolution) and tracks variant frequencies over time using Nextstrain (Augur + Auspice), wastewater deconvolution (Freyja, COJAC, alcov, lineagespot), lineage-fitness modelling (multinomial logistic), and recombinant detection (3SEQ, RDP4, Bolotie). Covers Pangolin pangolin-data and Nextclade dataset version pinning (mandatory; lineage-defining mutations change with dataset), Freyja barcode forward-only date constraint, ARTIC primer scheme churn (V3/V4/V4.1/V5.3.2/Midnight) with dropout regions, and recombinant X-prefix designation lag. Use when assigning Pango lineages and Nextclade clades to viral consensus sequences, building Nextstrain Augur surveillance pipelines, deconvolving wastewater into lineage frequencies with Freyja, tracking lineage frequencies over time, handling ARTIC primer dropouts, or running surveillance for SARS-CoV-2/influenza/Mpox/RSV/H5N1/measles.

development974

bio-spatial-transcriptomics-spatial-deconvolution

Estimates per-spot cell type composition of spatial transcriptomics mixtures (Visium, Slide-seq, Stereo-seq) from an scRNA-seq reference with cell2location, RCTD, SPOTlight, stereoscope, SpatialDWLS, or reference-free STdeconvolve. Use when deciding whether a platform even needs deconvolution (the resolution fork -- a 55um Visium spot is a 1-10-cell MIXTURE -> deconvolve, but a Xenium/MERFISH/CosMx cell is already single -> segment instead, and running deconvolution there invents fractions that do not exist); choosing cell2location (absolute abundance) vs RCTD/SPOTlight/stereoscope/SpatialDWLS (proportions only) by output and runtime; matching the scRNA reference to tissue and condition (the reference IS the result -- a missing cell type is silently misassigned to its nearest neighbor with no error flag); and handling compositional outputs that sum to 1 with CLR/ILR rather than naive per-type t-tests.

testing957

bio-spatial-transcriptomics-spatial-preprocessing

Quality control, filtering, and normalization for spatial transcriptomics (Visium, Visium HD, Xenium, MERFISH/MERSCOPE, CosMx, Slide-seq) with Squidpy and Scanpy. Use when setting QC floors that do NOT delete real low-count imaging cells (an scRNA min_counts=500 floor deletes nearly every Xenium cell, whose vector is tens-to-low-hundreds of transcripts); deciding whether to normalize at all when library size carries spatial biology rather than pure technical depth; choosing cell-volume/area normalization over Pearson residuals for skewed targeted panels; reading negative-control-probe / blank-barcode false-discovery rates; and inspecting QC spatially on the tissue rather than only in violins.

development957

bio-spatial-transcriptomics-spatial-statistics

Detects spatially variable genes, spatial autocorrelation, and cell-type colocalization for spatial transcriptomics using Squidpy with PySAL/esda for local statistics. Use when choosing an SVG method by its null and scaling (SpatialDE/SPARK GP variance-component vs SPARK-X/nnSVG linear vs Moran/Geary graph autocorrelation); separating genes that are spatially variable because of cell-type composition from genes regulated within a cell type; choosing the right autocorrelation statistic (global Moran/Geary vs Getis-Ord hot/cold spots vs local LISA and its FDR trap); and choosing a colocalization null strong enough to defeat the abundance/compartment confound (conditional or toroidal vs the weak Squidpy default permutation).

content-media957

bio-spatial-transcriptomics-spatial-visualization

Plots spatial transcriptomics expression, clusters, and annotations on tissue using Squidpy and Scanpy. Use when choosing the plotter and spot size by platform fork (sc.pl.spatial / sq.pl.spatial_scatter with real scalefactors and capture diameter for spot/capture data like Visium and Slide-seq, versus molecule/segmentation overlays for imaging/FOV data like Xenium, MERFISH, and CosMx); getting the histology coordinate-frame transform right (micron<->pixel, scalefactors) so points land on the image; and avoiding the honest-visualization traps where interpolation/KDE manufactures spatial pattern not in the data, oversized markers fake tissue coverage, jet and other non-uniform colormaps distort structure, and non-metric UMAP/tSNE distances are misread as spatial conclusions.

data-ai957

bio-spatial-transcriptomics-high-resolution-binning

Reconstructs single cells from sub-cellular spatial capture units (Visium HD 2um bins, Stereo-seq DNB spots, Slide-seqV2 beads) by aggregating bins UP into cells rather than deconvolving a mixture DOWN. Use when choosing a bin size and recognizing the sparsity-vs-mixture dilemma (2um bins are too sparse to cluster, but binning to 8/16um re-creates the multi-cell mixture deconvolution was meant to escape); deciding between morphology-driven cell reconstruction (Bin2cell -- StarDist/Cellpose nuclei on a registered H&E/DAPI image, then assign 2um bins to nuclei) and fixed-bin aggregation by whether a co-registered cell image exists; recognizing this as the INVERSE of deconvolution (bin UP, not mix DOWN -- this is the AMBIGUOUS regime of the resolution fork); and handling each platform (Visium HD has an image so reconstruct, Slide-seqV2 has no per-bead image so aggregate or deconvolve, Stereo-seq depends on a registered stain).

development957

bio-spatial-transcriptomics-image-analysis

Segments cells/nuclei and extracts image features from imaging spatial transcriptomics (Xenium, MERFISH/MERSCOPE, CosMx) and H&E/IF tissue images using Cellpose, StarDist, Baysor, and Squidpy. Use when choosing a segmentation strategy (DAPI nucleus + expansion vs membrane-stain whole-cell vs transcript-aware Baysor/proseg vs segmentation-free SSAM) given the available stain; judging whether transcript spillover is fabricating false co-expression and short-range cell-cell signal; and deciding whether the derived cell-by-gene matrix is trustworthy before downstream typing, DE, or ligand-receptor analysis.

development957

bio-spatial-transcriptomics-spatial-proteomics

Analyzes multiplexed antibody-imaging data (CODEX/PhenoCycler, MIBI-TOF, IMC, CyCIF, Opal/Vectra mIF) as continuous protein intensity rather than transcript counts, using scimap and squidpy. Use when choosing an intensity transform/normalization (arcsinh cofactor vs z-score vs percentile -- NOT log1p-of-counts) and correcting channel spillover and antibody-batch effects; deciding whether to phenotype by gating or by clustering on intensities; recognizing that a bounded antibody panel makes marker absence uninformative; treating whole-cell segmentation (Mesmer) as the dominant error source; and knowing which platform applies and when to defer to the imaging-mass-cytometry skills for the IMC pipeline.

development957

bio-spatial-transcriptomics-spatial-multiomics

Integrates spatial RNA with a second modality (protein, ATAC, or histone marks) on spatial CITE-seq, DBiT-seq, spatial-ATAC, or Visium CytAssist data. Use when deciding vertical (same-pixel co-profiling -> WNN/MOFA joint factors) versus diagonal (serial adjacent sections -> registration via PASTE/STalign) integration; recognizing that modalities from serial sections are DIFFERENT cells so joint same-cell methods do not apply; handling a bounded antibody/feature panel where absence is uninformative; or treating a pixel/spot as a multi-cell mixture rather than a single cell.

development957

bio-spatial-transcriptomics-spatial-communication

Maps cell-cell communication and ligand-receptor co-expression in spatial transcriptomics (Visium, Xenium, MERFISH, CosMx, Slide-seq) with Squidpy ligrec, COMMOT, stLearn, CellChat-spatial, and NicheNet. Use when choosing a method by whether spatial distance is actually modeled (squidpy ligrec is space-blind cluster-permutation vs COMMOT optimal-transport is distance-aware vs stLearn neighborhood vs CellChat-spatial filter) and by secreted-vs-contact-dependent range; choosing the ligand-receptor database knowingly because it drives the result as much as the algorithm; guarding against segmentation-spillover circularity that fabricates short-range hits; treating every ligand-receptor score as a co-expression hypothesis on a confidence ladder, not validated signaling; correcting for thousands of pair-by-cell-type-pair permutation tests; and recognizing that a targeted imaging panel rarely contains the relevant ligands and receptors so a "no communication" call is uninformative.

development957

bio-single-cell-markers-annotation

Detect cluster marker genes and assign manual cell type labels in single-cell RNA-seq using Scanpy (Python) and Seurat (R). Use when finding genes that distinguish clusters, ranking markers for annotation, scoring gene signatures, hand-labeling clusters, or deciding between Wilcoxon marker ranking and pseudobulk condition DE.

development954

bio-single-cell-differential-abundance

Test whether cell-type proportions or composition changed between conditions in single-cell data using Milo (miloR), scCODA, sccomp, and propeller. Use when comparing cell-type proportions / composition between conditions, asking which populations expanded or contracted with treatment or disease, running neighborhood-level (cluster-free) abundance testing, or guarding against compositional shifts that masquerade as differential expression.

testing954

bio-small-rna-seq-smrna-preprocessing

Trims kit-specific 3' adapters, strips UMIs or 4N degenerate ends, size-selects, and collapses small RNA-seq reads (miRNA, piRNA, tRF) with cutadapt or fastp. Use when choosing the kit's 3' adapter; setting the size window (18-26 nt miRNA vs 24-32 nt piRNA); deciding whether a library carries a true UMI (QIAseq) versus a 4N debiasing spacer (NEXTflex); reading the read-length histogram to judge library quality; or deciding whether to collapse identical reads before mapping.

development954

bio-single-cell-lineage-tracing

Reconstructs single-cell lineage trees and clonal relationships from CRISPR/Cas9 scars, static expressed barcodes (LARRY/CellTag), or somatic mtDNA mutations using Cassiopeia, Startle, and CoSpar. Use when building a phylogeny from barcode scars, choosing a tree-reconstruction solver, handling homoplasy and dropout, grouping clones from mtDNA, integrating clone with transcriptomic state, or judging whether a state-based fate call is trustworthy.

development954

bio-single-cell-scatac-analysis

Analyze single-cell ATAC-seq with Signac/ArchR (R) and SnapATAC2 (Python alternative). Use when processing scATAC fragments, choosing a framework, calling consensus peaks, running TF-IDF/LSI while diagnosing the depth component, scoring chromVAR motif deviations against GC-matched backgrounds, detecting homotypic vs heterotypic doublets, or deciding whether to binarize the count matrix.

development954

bio-single-cell-multimodal-integration

Integrate multimodal single-cell data (CITE-seq RNA+protein, 10x Multiome RNA+ATAC, unpaired/diagonal RNA+ATAC) and choose the right joint method. Use when classifying an integration task by anchor structure (paired vs unpaired), denoising CITE-seq ADT background before joint embedding, picking between WNN, totalVI, MultiVI, MOFA+, GLUE, or Seurat v5 bridge integration, or diagnosing why a modality dominates a joint clustering.

development954

bio-single-cell-preprocessing

Quality control, ambient-RNA handling, normalization, and feature selection for single-cell RNA-seq using Scanpy (Python) and Seurat (R). Use when filtering low-quality cells with MAD-adaptive thresholds, setting tissue-aware mito cutoffs, removing ambient RNA (SoupX/CellBender/DecontX), choosing a normalization (shifted-log vs scran vs sctransform vs Pearson residuals), selecting highly variable genes, or deciding whether to scale and regress out covariates.

development954

bio-single-cell-perturb-seq

Analyze Perturb-seq / CROP-seq single-cell CRISPR screens. Use when assigning guides as a mixture problem, removing non-perturbed escaper cells with Mixscape, choosing a calibrated test (SCEPTRE conditional resampling) over naive DE, quantifying effect size with E-distance, separating compositional shifts from within-state expression change, or judging whether a perturbation-prediction foundation model actually beats a baseline.

testing954

bio-single-cell-doublet-detection

Detect and remove doublets (two or more cells in one droplet) from single-cell RNA-seq using scDblFinder (R), Scrublet (Python), and DoubletFinder (R). Use when flagging artificial intermediate populations before clustering, setting the expected doublet rate from recovered-cell counts, running detection per sample before integration, choosing between simulate-and-score methods, or interpreting a non-bimodal score histogram.

development954

bio-single-cell-cell-communication

Infers ligand-receptor cell-cell communication from scRNA-seq with a consensus-first workflow (LIANA), plus CellPhoneDB specificity tests, CellChat pathway probabilities, and NicheNet downstream ligand-activity. Use when ranking ligand-receptor interactions between cell types, comparing communication across conditions, asking which ligand drives a receiver response, or deciding which CCC method and resource to trust.

development954

bio-single-cell-clustering

Dimensionality reduction and graph-based clustering for single-cell RNA-seq with Scanpy (Python) and Seurat (R). Resolves which algorithm to use (Leiden vs Louvain), how many PCs and neighbors to set, how to sweep and validate resolution, when a split is over-clustering, and why post-clustering marker p-values are not valid inference. Use when clustering cells, choosing a clustering resolution, deciding whether two clusters are one population, building a UMAP/tSNE, or judging whether clusters are real.

development954

bio-single-cell-batch-integration

Integrate multiple scRNA-seq samples or batches with Harmony, scVI/scANVI, Seurat (CCA/RPCA), fastMNN, Scanorama, or BBKNN. Resolves which method to use for the dataset size and design, how strongly to correct, when integration is the wrong move (confounded batch/biology), how to score integration with scIB metrics without gaming them, and why corrected expression must not be used for differential expression. Use when integrating batches or datasets, choosing an integration method, diagnosing over-correction, or judging integration quality.

testing954

bio-single-cell-metabolite-communication

Infers metabolite-mediated cell-cell communication from scRNA-seq by scoring enzyme-to-sensor pairs (MEBOCOST), with metabolic flux (scFEA), FBA state (Compass), and neurotransmitter (NeuronChat) alternatives. Use when studying metabolic crosstalk between cell types, predicting metabolite secretion and sensing, or deciding which metabolic-communication method fits and how speculative the result is.

testing954

bio-small-rna-seq-trf-pirna-profiling

Profiles non-miRNA small RNAs - tRNA-derived fragments (tRFs/tsRNAs), piRNAs, and rRNA/snoRNA-derived species - with MINTmap, unitas, SPORTS, and proTRAC. Use when annotating all small-RNA classes in a library; quantifying tRFs at locus resolution where tRNA loci are redundant (exclusive vs ambiguous); testing the piRNA ping-pong signature; deciding whether a species is a processed functional RNA or a degradation fragment; or judging whether the prep could even capture 5'-OH/cyclic-phosphate classes.

tools954

bio-small-rna-seq-mirge3-analysis

Quantifies known miRNAs, isomiRs, tRFs, and A-to-I editing fast with miRge3.0 by aligning collapsed reads to curated miRBase or MirGeneDB libraries. Use when choosing miRBase versus MirGeneDB as the reference; deciding whether to collapse isomiRs to the parent miRNA or keep 5'-isomiRs separate (they shift the seed and retarget); confirming the organism is among the six supported species; or remembering that RPM output is for display only and raw counts go to DESeq2/edgeR.

development954

bio-single-cell-data-io

Read, write, create, and convert single-cell objects across AnnData (Python), Seurat (R), and SingleCellExperiment (R). Use when loading 10X Cell Ranger output (raw vs filtered), importing or exporting h5ad/RDS/h5mu/zarr, building AnnData or Seurat objects from matrices, moving objects between Python and R, or debugging lost layers, transposed matrices, or mangled gene names during conversion.

development954

bio-single-cell-cnv-inference

Infer large-scale copy-number alterations from tumor single-cell or single-nucleus RNA-seq to separate malignant from normal cells and call subclones, using inferCNV, copyKAT, Numbat, and SCEVAN. Use when separating malignant from normal cells in a tumor scRNA-seq dataset, inferring chromosome-arm CNVs or aneuploidy from expression, calling tumor subclones from single cells, choosing a CNV-inference method (reference-based vs reference-free, expression-only vs allele-aware), or deciding which cells are tumor before downstream analysis.

testing954

bio-single-cell-trajectory-inference

Infers developmental trajectories, pseudotime, RNA velocity, and directed fate probabilities from single-cell data using PAGA, Slingshot, Monocle3, DPT, Palantir, scVelo, and CellRank 2. Use when ordering cells along a differentiation continuum, choosing a trajectory method by topology, rooting pseudotime, estimating RNA velocity direction, computing fate probabilities near a bifurcation, or judging whether an inferred trajectory is real.

development954

bio-small-rna-seq-target-prediction

Predicts and prioritizes miRNA target genes with seed-based tools (miRanda, TargetScan, miRDB) and experimentally validated databases (miRTarBase, multiMiR). Use when deciding that a predicted target is a hypothesis not a finding; ranking by the right score (weighted context++, mirSVR, miRDB); raising confidence by intersecting predictions with inversely-correlated mRNA DE; weighing validated (CLIP/reporter) over predicted evidence; or avoiding the circular enrichment of unfiltered target lists.

tools954

bio-small-rna-seq-differential-mirna

Tests miRNAs for differential expression with DESeq2 or edgeR using small-RNA-aware normalization and filtering. Use when deciding which normalization survives a library dominated by a few hyper-abundant miRNAs (compositional fragility); choosing DESeq2 vs edgeR vs a compositional method; setting a lower prefilter than mRNA; handling biofluid data with no endogenous normalizer; or remembering that RPM is for display and TDMD can make a miRNA drop without transcriptional repression.

development954

bio-single-cell-cell-annotation

Automated reference-based cell type annotation for single-cell RNA-seq using CellTypist, SingleR, Azimuth, scANVI, and scmap to transfer labels from a reference. Use when annotating cell types from a reference atlas or pretrained model, transferring labels onto a query, assessing prediction confidence and rejection, or triaging whether an unexpected cluster is a novel type versus a doublet, low-quality, or batch artifact.

testing954

bio-small-rna-seq-mirdeep2-analysis

Discovers novel miRNAs and quantifies known miRNAs with miRDeep2 by scoring genome-mapped read stacks against the Dicer/Drosha biogenesis signature. Use when deciding whether a study needs de novo discovery at all versus known-miRNA quantification; choosing the species and related-species miRBase references; reading the miRDeep2 score as a signal-to-noise hypothesis rather than a fixed cutoff; or filtering novel candidates against tRNA/rRNA loci to reject the classic false positives.

testing954

bio-seq-objects

Create and manipulate Seq, MutableSeq, and SeqRecord objects using Biopython. Use when creating sequences from strings, modifying sequence data in-place, building annotated records for file output, or debugging post-1.78 Bio.Alphabet and immutability errors.

development952

bio-paired-end-fastq

Handle paired-end FASTQ files (R1/R2) using Biopython while keeping mates synchronized. Use when working with Illumina paired reads, synchronizing pairs, filtering both mates together with orphan routing, interleaving/deinterleaving, or matching mates by read name.

development952

bio-batch-processing

Process many sequence files in batch (count, merge, split, convert, summarize) with memory-safe streaming and on-disk indexing using Biopython, pysam, or pyfastx. Use when iterating over a directory of FASTA/FASTQ files, merging or splitting datasets, building random access across many or huge files, or automating per-file operations without exhausting RAM.

development952

bio-sequence-slicing

Slice, extract, and concatenate biological sequences and annotated records using Biopython. Use when extracting subsequences by position, splicing exons into a transcript, joining sequences, or carrying a sub-region of an annotated record (with quality scores and features) into a new record.

development952

bio-transcription-translation

Transcribe DNA to RNA and translate to protein using Biopython, with NCBI codon-table selection, CDS validation, and six-frame ORF finding. Use when converting a CDS or ORF to its amino-acid sequence, selecting a non-standard (mitochondrial, bacterial, ciliate) genetic code, validating a coding sequence, or scanning all reading frames.

development952

bio-reverse-complement

Generate reverse complements and complements of DNA/RNA sequences using Biopython, including IUPAC ambiguity codes, gapped alignments, and minus-strand features. Use when working with the opposite strand, building reverse primers, normalizing strand orientation before alignment, or extracting a coding sequence from a minus-strand feature.

development952

bio-codon-usage

Analyze codon usage and calculate CAI (Codon Adaptation Index), RSCU, and Nc with Biopython, and produce naive max-CAI codon-optimized sequences. Use when scoring a gene's codon bias against a host, optimizing a CDS for heterologous expression, or studying synonymous codon selection.

development952

bio-read-sequences

Read biological sequence files (FASTA, FASTQ, GenBank, EMBL, ABI, SFF) with Biopython Bio.SeqIO, choosing between streaming, in-memory, and on-disk-indexed access. Use when parsing sequence files, iterating multi-record files, randomly accessing records by ID in large files, or maximizing parse throughput.

development952

bio-sequence-statistics

Calculate assembly and sequence statistics (N50/L50, auN, NG50/NGA50, length distribution, GC content with ambiguity handling, summary reports) using Biopython. Use when analyzing sequence datasets, generating QC reports, or comparing genome assemblies.

development952

bio-motif-search

Find sequence motifs, degenerate IUPAC patterns, and transcription-factor binding sites in DNA/RNA using Biopython and regex, including position weight matrix (PWM/PSSM) scoring. Use when locating regulatory elements, counting overlapping motif occurrences, scanning for binding-site matches above a significance threshold, or reading motif matrices from JASPAR/MEME/TRANSFAC files. For restriction enzyme sites, use restriction-analysis/restriction-sites.

development952

bio-write-sequences

Write biological sequences to files (FASTA, FASTQ, GenBank, EMBL) using Biopython Bio.SeqIO. Use when saving sequences, creating new sequence files, or outputting modified records.

development952

bio-format-conversion

Convert between sequence file formats (FASTA, FASTQ, GenBank, EMBL, Stockholm) and re-encode FASTQ quality offsets using Biopython Bio.SeqIO. Use when changing a file format for a downstream tool, fixing FASTQ quality encoding (Phred+33 vs Phred+64 vs Solexa), or when a conversion risks silently dropping annotations or quality scores.

tools952

bio-sequence-properties

Calculate nucleotide and protein sequence properties (GC content, GC skew, molecular weight, melting temperature, isoelectric point, instability, hydropathy) with Biopython. Use when analyzing sequence composition, computing primer Tm, estimating DNA or protein mass, or profiling protein biophysical properties.

development952

bio-compressed-files

Read, write, and index compressed sequence files (gzip, bzip2, xz, BGZF) with Biopython and bgzip/samtools. Use when working with .gz, .bz2, or .bgz sequence files, when random access into a compressed FASTA/FASTQ is needed, or when SeqIO.index/faidx/tabix rejects a plain .gz. Covers the BGZF-vs-gzip seekability asymmetry, the 'rt'-not-'rb' handle trap, virtual offsets, and gzip-to-BGZF conversion.

tools952

bio-filter-sequences

Filter and select sequences by criteria (length, ID, GC content, N content, motifs, patterns, description) using Biopython, streaming so large files never load into RAM. Use when subsetting a FASTA/FASTQ file, removing unwanted or low-quality records, or selecting records by specific criteria. Use the paired-end-fastq skill instead whenever the input is paired R1/R2 reads.

development952

bio-fastq-quality

Work with FASTQ quality scores using Biopython - access Phred scores, filter and trim by quality, compute per-position profiles, and convert between Sanger/Phred+33, Solexa, and Illumina/Phred+64 encodings. Use when analyzing read quality, filtering or trimming low-quality bases, generating quality reports, or deciding which FASTQ quality encoding a file uses before parsing.

development952

bio-rna-structure-structure-probing

Processes experimental RNA structure probing data (SHAPE-MaP, DMS-MaPseq) into per-nucleotide reactivity profiles with ShapeMapper2, then uses them as soft restraints on thermodynamic folding. Covers reagent and readout choice (SHAPE vs DMS, mutational-profiling vs RT-stop), the three control samples, per-transcript normalization, the Deigan vs Zarringhalam pseudo-energy models, in-cell versus in-vitro interpretation, and multi-conformation deconvolution. Use when converting probing reads to reactivities; deciding SHAPE versus DMS parameters; judging whether low reactivity means base-paired or protein-bound; or detecting whether an RNA populates more than one structure.

development950

bio-rna-structure-secondary-structure-prediction

Predicts RNA secondary structure with ViennaRNA, treating the Boltzmann ensemble (partition function, base-pair probabilities, centroid, MEA, stochastic samples) as the object rather than a single MFE fold. Covers consensus folding from alignments (RNAalifold), SHAPE-constrained folding, RNA-RNA interaction (RNAcofold/RNAduplex/RNAup), local and linear-time methods for long RNA, and pseudoknot-aware tools. Use when folding an RNA and choosing between MFE, centroid, MEA, or ensemble sampling; judging whether a single structure is well-defined; folding long RNAs where a global MFE is meaningless; handling suspected pseudoknots; or weighing thermodynamic versus comparative versus deep-learning prediction.

tools950

bio-rna-structure-ncrna-search

Searches for non-coding RNA homologs and classifies RNA families with Infernal covariance models against Rfam, scoring sequence AND secondary-structure conservation jointly. Use when deciding whether a covariance model is the right tool versus BLAST/nhmmer (structured ncRNA versus lncRNA or mature miRNA); choosing the Rfam gathering threshold over a flat E-value; resolving clan overlaps; building and calibrating a custom CM from a structure-annotated alignment; or preferring a family-specialized tool (tRNAscan-SE, barrnap) over a generic Rfam scan.

tools950

bio-rna-structure-covariation-analysis

Tests whether a proposed or predicted RNA secondary structure is supported by evolutionary covariation using R-scape, which scores compensatory substitutions against a phylogeny-aware null and estimates the statistical power of the alignment. Use when validating a conserved-structure claim before trusting it (the test that found no support for HOTAIR/Xist/SRA lncRNA structures); separating real covariation from phylogenetic correlation; deciding whether an alignment even has the power to test structure; or building a covariation-supported consensus (CaCoFold) to seed a covariance model or folding.

development950

bio-rna-quantification-count-matrix-qc

Quality control and exploration of RNA-seq count matrices before differential expression. Use when checking library sizes and composition, choosing VST vs rlog for visualization, running PCA and sample correlation, detecting outliers with Cook's distance, deciding how to handle known vs unknown batch effects, screening for sample swaps, or judging whether a sample or design is too compromised to test.

development945

bio-rna-quantification-featurecounts-counting

Count reads per gene from aligned BAM files using Subread featureCounts. Use when turning STAR/HISAT2 BAMs into a gene-level count matrix for DESeq2/edgeR, deciding library strandedness, handling paired-end fragment counting, choosing how to treat multi-mapping and multi-overlapping reads, or diagnosing a low assignment rate from the summary file.

development945

bio-rna-quantification-alignment-free-quant

Quantify transcript expression from FASTQ with Salmon (selective alignment) or kallisto (pseudoalignment), bypassing genome mapping. Use when quantifying RNA-seq without alignment, deciding whether a decoy-aware index is required, detecting and verifying library strandedness, enabling GC and sequence bias correction, or choosing whether to generate inferential replicates (bootstraps/Gibbs) for transcript-level downstream testing.

development945

bio-rna-quantification-tximport-workflow

Import transcript-level quantifications from Salmon/kallisto/RSEM into R for gene-level analysis with DESeq2/edgeR using tximport or tximeta. Use when summarizing transcript abundances to gene counts with the correct length offset, choosing a countsFromAbundance mode (full-length vs 3'-tag vs DTU), resolving transcript-ID version mismatches, or handing off to DESeq2/edgeR without double-applying the offset.

research945

bio-read-qc-quality-filtering

Filters reads by quality, length, N content, and complexity with Trimmomatic, fastp, and Cutadapt, including sliding-window trimming, per-read unqualified-base filtering, and 2-color poly-G removal. Use when reads have poor-quality tails, when an assembly or k-mer workflow needs clean input, or when a junk read subpopulation must be dropped. For adapter removal use adapter-trimming; for all-in-one preprocessing use fastp-workflow.

testing929

bio-reporting-automated-qc-reports

Aggregates per-tool QC metrics (FastQC, fastp, alignment, quantification, variant calling, single-cell) into one interactive MultiQC report, and guides module scoping, sample-name resolution, large-cohort behavior, and turning the report into an actual QC gate. Use when summarizing QC across many samples, building a shareable quality report, or wiring automated QC into a pipeline.

tools929

bio-restriction-enzyme-selection

Select restriction enzymes for cloning or diagnostics using Biopython Bio.Restriction. Finds enzymes by cut frequency, overhang type, recognition-site length, commercial availability, compatible ends, and methylation sensitivity, and identifies isoschizomers and compatible pairs. Use when choosing which enzymes to use to linearize a vector, drop in an insert, set up a diagnostic digest, or pick a methylation-insensitive enzyme.

development929

bio-reporting-quarto-reports

Builds reproducible Quarto reports, presentations, and websites across R, Python, and Julia, with correct engine selection, cache-vs-freeze semantics, native cross-references, parameters, and environment pinning. Use when creating a Quarto report of an analysis, setting up freeze for CI, or debugging cross-references, caching, or working-directory issues.

development929

bio-ribo-seq-riboseq-preprocessing

Preprocess ribosome profiling reads with UMI handling, adapter trimming, contaminant/rRNA depletion, and footprint-aware alignment. Use when preparing Ribo-seq FASTQ for periodicity QC, ORF detection, translation efficiency, or stalling analysis, or when deciding how to deduplicate, which aligner to use, or how to size-select ribosome-protected fragments.

testing929

bio-read-qc-adapter-trimming

Removes sequencing adapters from FASTQ reads with Cutadapt and Trimmomatic, including paired-end read-through, small-RNA 3' adapters, amplicon primers, and anchored/linked adapters. Use when FastQC shows adapter content climbing toward the 3' end, when inserts are shorter than the read length (small-RNA, cfDNA, FFPE), or before assembly/k-mer analysis. For all-in-one trimming use fastp-workflow; for quality/length filtering use quality-filtering.

tools929

bio-restriction-golden-gate-assembly

Design and validate Type IIS scarless DNA assembly (Golden Gate, MoClo) using Biopython Bio.Restriction. Screens parts for internal BsaI/BsmBI/BbsI/SapI sites (domestication), previews the fusion overhangs a digest exposes, and validates a fusion-overhang set for distinctness and fidelity. Use when designing a Golden Gate or MoClo assembly, domesticating a part by removing internal Type IIS sites, or choosing and checking fusion overhangs for one-pot assembly.

development929

bio-read-qc-umi-processing

Extracts UMIs and collapses reads to original molecules with umi_tools (directional dedup) or builds error-corrected single-strand/duplex consensus reads with fgbio. Use when the library has UMIs and accurate molecule counting or below-sequencer-floor error correction is needed - single-cell, low-input RNA-seq, targeted panels, and ctDNA/liquid-biopsy rare-variant detection. For UMI extraction during QC use fastp-workflow; do not dedup non-UMI bulk RNA-seq.

tools929

bio-restriction-sites

Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Searches single enzymes, batches, or commercial enzyme sets and returns cut positions for linear or circular DNA. Use when locating where one or more restriction enzymes cut a sequence, screening a sequence for the presence or absence of a site, or counting how often an enzyme cuts.

development929

bio-ribo-seq-ribosome-periodicity

Validate Ribo-seq library quality by measuring 3-nucleotide periodicity and calibrating read-length-specific P-site offsets. Use when checking whether footprints capture genuine translation, determining P-site offsets for downstream ORF/TE/stalling analysis, or deciding which read lengths to keep.

development929

bio-reporting-jupyter-reports

Runs parameterized Jupyter notebooks as reproducible batch report generators with papermill, renders them to HTML/PDF with nbconvert, aggregates results across samples, and makes notebook outputs trustworthy. Use when generating per-sample analysis reports, executing a notebook template across many datasets, or fixing notebooks that do not reproduce.

development929

bio-read-qc-contamination-screening

Detects contamination in sequencing reads - cross-species (FastQ Screen, Kraken2), vector/PhiX/adapter, rRNA, and same-species cross-sample/index-hopping and sample swaps (SNP fingerprints via verifyBamID2/NGSCheckMate/somalier). Use when suspecting cross-contamination, PDX host reads, microbial carry-over, or sample swaps, and to decide whether to report, filter, or align to a combined reference. For deep taxonomic profiling use metagenomics/kraken-classification.

testing929

bio-restriction-fragment-analysis

Predict restriction digest fragment sizes and gel patterns using Biopython Bio.Restriction. Computes fragment lengths and sequences for single and double digests on linear or circular DNA, and interprets them against an agarose gel. Use when predicting the fragments from a digest, planning a diagnostic digest to verify a clone, or matching observed gel bands to an expected pattern.

development929

bio-reporting-figure-export

Exports publication-ready figures with the correct vector/raster split, embedded editable fonts, color-space-robust palettes, and journal-correct sizing and resolution in matplotlib and ggplot2. Use when preparing figures for journal submission, exporting a dense single-cell or GWAS plot without producing an unopenable vector file, or fixing fonts and colors that break in print.

testing929

bio-reporting-publication-tables

Builds publication-ready tables - descriptive Table 1, regression and differential-expression result tables, and supplementary tables - with gtsummary, gt, flextable, and kableExtra (R) or great_tables, pandas, and tableone (Python), choosing the right statistics and the right export format. Use when making a Table 1, exporting a formatted results table for a paper, or writing a gene-symbol-safe supplementary table.

development929

bio-restriction-mapping

Build restriction maps showing enzyme cut positions and inter-site distances along DNA using Biopython Bio.Restriction. Produces text or graphical maps for linear and circular molecules, orders sites from single and double digests, and overlays GenBank features. Use when creating a restriction map of a sequence, ordering cut sites along a plasmid, or relating sites to annotated features.

development929

bio-read-qc-quality-reports

Generates and interprets per-file and cross-sample QC reports from FASTQ data with FastQC, falco, and MultiQC, covering Phred quality, per-base composition, GC, duplication, overrepresented sequences, and adapter content. Use when performing initial QC on raw sequencing reads, validating preprocessing, or judging a multi-sample cohort for outliers and batch effects. For long reads use NanoPlot; for adapter/quality remediation route to adapter-trimming, quality-filtering, or fastp-workflow.

testing929

bio-read-qc-fastp-workflow

Runs all-in-one FASTQ preprocessing with fastp in a single pass - adapter trimming via paired-end overlap analysis, quality/length filtering, 2-color poly-G removal, base correction, optional dedup/UMI/merge, and HTML/JSON reports. Use when preprocessing bulk Illumina data and wanting one fast tool instead of separate Cutadapt, Trimmomatic, and FastQC steps. For precise small-RNA/amplicon adapters use adapter-trimming; for molecule-accurate UMI dedup use umi-processing.

tools929

bio-ribo-seq-translation-efficiency

Quantify translation efficiency (TE) as ribosome occupancy relative to mRNA abundance and test for differential TE between conditions. Use when separating translational from transcriptional regulation, distinguishing genuine translational control from buffering, or choosing between riborex, Xtail, anota2seq, and DESeq2 interaction models.

testing929

bio-ribo-seq-ribosome-stalling

Detect ribosome pausing and stalling at codon resolution from Ribo-seq, using local-relative occupancy metrics and A-site assignment. Use when studying elongation dynamics, codon dwell times, pause motifs, or ribosome collisions, and when judging whether a pause is real biology or a cycloheximide artifact.

research929

bio-ribo-seq-orf-detection

Detect and quantify translated ORFs from Ribo-seq using 3-nucleotide periodicity, including uORFs, internal ORFs, dORFs, and novel ORFs. Use when finding actively translated regions beyond annotated CDS, classifying ORFs by the 2022 community standard, quantifying ORF-level translation, or choosing between periodicity-based callers.

testing929

bio-ribo-seq-initiation-site-mapping

Map translation initiation sites, including non-AUG and alternative starts, from initiation-drug ribosome profiling (TI-seq). Use when locating start codons, detecting near-cognate or upstream initiation, or analyzing harringtonine, lactimidomycin (GTI-seq/QTI-seq), or retapamulin (Ribo-RET) data.

testing929

bio-reporting-rmarkdown-reports

Creates reproducible R Markdown analysis reports (HTML, PDF, Word) with knitr, covering the render pipeline, the interactive-vs-knit session trap, cache invalidation, bookdown cross-references, parameterization, and environment pinning. Use when generating an R-based analysis report, debugging a report that knits differently than it runs interactively, or fixing caching or cross-references.

development929

bio-read-qc-rnaseq-qc

Runs RNA-seq-specific post-alignment QC - strandedness inference, gene-body 5'-3' coverage, read distribution (exonic/intronic/intergenic), rRNA/globin/mitochondrial rate, transcript integrity (TIN), and saturation - with RSeQC, Qualimap, RNA-SeQC, and Picard. Use when validating RNA-seq libraries before quantification or differential expression, diagnosing degradation or gDNA contamination, or determining library strandedness. For raw-FASTQ QC use quality-reports; for UMI dedup use umi-processing.

development929

bio-read-alignment-star-alignment

Aligns RNA-seq reads to a genome with STAR, the fast splice-aware aligner whose splice-junction database (built from a GTF at sjdbOverhang = readlength-1) and two-pass mode set junction sensitivity, whose 255-for-unique MAPQ breaks GATK, and whose GeneCounts output reveals library strandedness. Use when RNA reads must be placed on the genome for novel-isoform discovery, fusion detection, RNA variant calling, coverage tracks, splicing QC, or single-cell (STARsolo). Memory-constrained RNA alignment is hisat2-alignment; DE on known transcripts only should skip alignment for rna-quantification/alignment-free-quant; the QC gate and contig-naming reconciliation are alignment-files; counting is rna-quantification; DNA is bwa-alignment/bowtie2-alignment.

development910

bio-read-alignment-bowtie2-alignment

Aligns DNA short reads to a reference with Bowtie2, choosing end-to-end (whole read must align) vs local (soft-clip read ends) mode and a sensitivity preset; the de-facto aligner for ChIP-seq, ATAC-seq, and CUT&RUN, where fragment-geometry flags (--no-mixed, --no-discordant, --dovetail, -X) and a tool-appropriate MAPQ filter feed the peak caller. Use when aligning ChIP/ATAC/CUT&RUN reads, when read ends are adapter-contaminated and need soft-clipping, or when a tunable sensitivity/speed preset is wanted. DNA variant calling prefers bwa-alignment; RNA spliced alignment is star-alignment/hisat2-alignment; the QC gate and cross-tool MAPQ scale are alignment-files; peak calling is chip-seq/atac-seq; bisulfite uses methylation-analysis/bismark-alignment.

tools910

bio-pathway-go-enrichment

Runs Gene Ontology over-representation analysis (ORA) on a gene LIST with clusterProfiler enrichGO, the one-sided hypergeometric/Fisher 2x2 test phyper(k-1, M, N-M, n, lower.tail=FALSE). Covers why the BACKGROUND universe (not the gene list) is the null and decides significance, why omitting universe= is a bug, why enrichGO defaults to ont='MF' not 'BP', why pvalueCutoff filters p.adjust not raw p, why ORA discards effect magnitude and inherits GO-DAG true-path redundancy (simplify, topGO), why RNA-seq gene-length bias inflates long-gene terms (GOseq Wallenius), plus GeneRatio/BgRatio, bitr ID mapping, minGSSize/maxGSSize, groupGO. Use when a pre-selected gene list (DE hits, co-expression module, screen, GWAS-mapped) needs GO annotation. For a ranked no-cutoff analysis see gsea; for other databases see kegg-pathways, reactome-pathways, wikipathways; DE source is differential-expression/de-results; plots in enrichment-visualization.

development910

bio-read-alignment-bwa-alignment

Aligns DNA short reads (paired- or single-end) to a reference genome with bwa-mem2, the maintained successor to BWA-MEM, for WGS/WES and germline/somatic variant-calling pipelines; covers index build, read-group injection, the collate/fixmate/sort/markdup ordering, soft-clipping for SV split reads, ALT/decoy-aware mapping on GRCh38, -K determinism, and streaming straight to a sorted BAM. Use when mapping DNA short reads to a reference for variant calling, coverage, ChIP/ATAC (alongside bowtie2-alignment), or SV detection. RNA-seq spliced alignment is star-alignment/hisat2-alignment; BAM sort/dedup/stats, the QC gate, and the cross-tool MAPQ scale are alignment-files; read trimming is read-qc; counting reads over features is rna-quantification.

tools910

bio-pathway-reactome

Tests a gene list or ranked gene vector for over-representation or coordinated shifts in Reactome's curated, peer-reviewed, reaction-level pathways using ReactomePA's enrichPathway (ORA) and gsePathway (GSEA), reading the local reactome.db so a run is reproducible given the Bioconductor release. Covers why Reactome's atomic unit is the REACTION and pathways are nested containers so a parent and child enrich on the same genes and double-count one signal, why only human is curated and every other species is orthology-inferred, why enrichPathway has NO keyType argument and returns nothing unless genes are ENTREZ (bitr first), and why viewPathway draws a LOCAL reaction network from a pathway NAME. Use when reaction-level granularity, peer-reviewed curation, or an offline-reproducible database is wanted; for comparative multi-sample or multi-omics analysis use ReactomeGSA. The DE list comes from differential-expression; plots from enrichment-visualization.

development910

bio-pathway-enrichment-visualization

Turns an enrichResult or gseaResult from clusterProfiler/enrichplot into a figure that collapses or shows gene-set redundancy, using dotplot, barplot, cnetplot, emapplot, treeplot, ridgeplot, gseaplot2, and upsetplot. Covers why a default top-20 GO dotplot is one biological theme drawn twenty times (the DAG/nesting guarantees redundant overlapping terms), so the figure is a modeling choice between SHOWING redundancy (pairwise_termsim -> emapplot/treeplot) and DELETING it (simplify/REVIGO); why cnetplot/emapplot/treeplot need pairwise_termsim first; why enrichplot ships no barplot for gseaResult (a bar cannot carry a signed NES); why GeneRatio is not fold enrichment; and why showCategory silently truncates. Use when plotting ORA or GSEA results, collapsing redundant GO terms visually, encoding a dotplot, or building a publication enrichment figure. Statistics come from go-enrichment and gsea; generic ggplot -> data-visualization/ggplot2-fundamentals.

development910

bio-read-alignment-hisat2-alignment

Aligns RNA-seq reads to a genome with HISAT2, the splice-aware aligner whose hierarchical graph FM-index runs at roughly a quarter of STAR's memory (~7 GB for human), whose SNP/haplotype graph index reduces reference bias in the index itself, and whose MAPQ is GATK-friendly (60 for unique, no 255 problem). Use when RNA alignment must fit a memory-constrained machine, when feeding StringTie/Cufflinks transcript assembly via --dta, or when a SNP-aware graph index is wanted for allele-robust mapping. Feature-rich/high-RAM RNA alignment and fusion detection are star-alignment; DE on known transcripts only should skip alignment for rna-quantification/alignment-free-quant; the QC gate and contig-naming reconciliation are alignment-files; counting is rna-quantification.

testing910

bio-pathway-gsea

Tests a ranked gene vector for coordinated expression shifts in GO, KEGG, Reactome, or MSigDB gene sets with clusterProfiler's gseGO, gseKEGG, gsePathway, and GSEA (fgseaMultilevel engine), and scores per-sample pathway activity with ssGSEA and GSVA. Covers why a GSEA result is a deterministic function of three implicit choices (the ranking STATISTIC, the weight exponent p, and which LABELS are permuted), why the input must be a NAMED vector sorted DECREASING by a signed variance-calibrated metric (DESeq2 stat, limma t) not a raw p-value that erases direction, why preranked gene-permutation is anti-conservative for correlated sets (CAMERA is the fix), why nPerm is gone (eps governs tiny p), and why set.seed is required. Use when every gene carries a DE statistic, when a hard cutoff is arbitrary, or when ORA finds nothing. For gene-list ORA see go-enrichment; the ranking statistic comes from differential-expression/de-results.

development910

bio-proteomics-proteomics-qc

Quality control for bottom-up proteomics across three levels -- instrument/raw-signal (mass accuracy, RT/iRT fit, FWHM, TIC vs injection time, % MS2 identified), identification/run (missed cleavages, charge states, PTM handling artifacts, contaminants), and experiment/quantitative (replicate correlation on log2, CV on the linear scale, completeness, MNAR-vs-MCAR missingness, PCA/batch, TMT channel balance, DIA q-values). Frames QC as a control chart against a per-instrument rolling baseline, not fixed cutoffs, and mandates inspecting raw boxplots, per-sample ID counts, total signal, and contaminant removal BEFORE normalizing -- because median normalization erases loading failures. Use when assessing proteomics data quality, diagnosing outlier samples, or deciding which samples to exclude before differential testing. The statistical test itself is differential-abundance; normalization mechanics are quantification; DIA q-value internals are dia-analysis.

testing904

bio-proteomics-quantification

Quantifies protein abundance from mass spectrometry using label-free (LFQ/MaxLFQ, DIA fragment-level), isobaric (TMT/iTRAQ reporter ions, MS2 vs SPS-MS3), and metabolic (SILAC) approaches, including peptide-to-protein summarization (Tukey median polish, MaxLFQ, msqrob), sample-loading and IRS cross-plex normalization, and isotopic impurity correction. Use when turning peptide/PSM/reporter signal into a protein-by-sample abundance matrix for downstream analysis. Statistical testing of that matrix is differential-abundance; DIA quant mechanics and DIA-NN runs are dia-analysis; reading search-engine outputs is data-import; razor/shared-peptide group assignment is protein-inference.

testing904

bio-proteomics-dia-analysis

Analyzes data-independent acquisition (DIA) proteomics by scoring reconstructed fragment-chromatogram peak groups against a decoy null with DIA-NN (library-free directDIA, library-based, or deep-learning predicted-library routes), Spectronaut, OpenSWATH, and EncyclopeDIA. Frames the deliverable around q-value LEVEL (precursor/peptide/protein-group) and CONTEXT (run vs experiment-wide/global) rather than a bare "1% FDR", and around the duty-cycle-vs-selectivity acquisition tradeoff (window design, staggered demultiplexing, diaPASEF, narrow-window Astral). Use when identifying and quantifying proteins from DIA mass spectrometry runs and filtering DIA-NN report.parquet/matrix output. Building the spectral library itself is spectral-libraries; normalization and protein roll-up is quantification; statistical testing of the matrix is differential-abundance.

development904

bio-proteomics-differential-abundance

Tests for differentially abundant proteins between conditions with limma/DEqMS empirical-Bayes moderation, proDA/msqrob2/MSstats missingness modeling, and Python Welch+BH alternatives. Frames missing values as left-censored MNAR (model, do not impute), makes variance moderation the load-bearing step at n=3-5, and prefers feature/peptide-level testing. Use when identifying proteins with significant abundance changes between experimental groups. Summarization and normalization mechanics are proteomics/quantification; volcano and MA plots are data-visualization/volcano-and-ma-plots; pathway enrichment of the hit list is pathway-analysis/go-enrichment.

development904

bio-proteomics-ptm-analysis

Frames PTM/phosphoproteomics analysis as three stacked inference layers on a biased enrichment - chemistry selection, site localization (FLR), and protein-level-adjusted quantification with MSstatsPTM - plus kinase-activity and functional triage. Covers MaxQuant Phospho (STY)Sites multiplicity expansion, localization-probability filtering (class I, Ascore, ptmRS, DIA EG.PTMLocalizationProbabilities, DIA-NN PTM.Site.Confidence), false localization rate (LuciPHOr/DeepFLR), motif analysis with experiment-matched backgrounds, diGly/K-GG ubiquitin specificity, acetyl/glyco traps, and KSEA/PTM-SEA. Use when localizing and quantifying phosphorylation, acetylation, ubiquitination, or glycosylation sites from enrichment-based runs and deciding whether an apparent site change is real after subtracting protein abundance. Peptide ID and open/variable-mod search is peptide-identification; underlying protein-level quant is quantification and differential-abundance; DIA acquisition mechanics is dia-analysis.

testing904

bio-proteomics-data-import

Loads mass-spectrometry data into Python/R and strips the search engine's bookkeeping before any number is trusted -- removes decoys (REV__/Reverse), contaminants (CON__/Potential contaminant), Only-identified-by-site groups, and resolves semicolon razor/leading protein-ID ambiguity in MaxQuant proteinGroups.txt, DIA-NN report.parquet, and mzML/mzXML. Distinguishes Intensity (raw) vs LFQ intensity (MaxLFQ) vs iBAQ, treats a MaxQuant zero as missing (NaN, not log2(-inf)), and inherits the acquisition mode's missingness contract (DDA MNAR vs DIA MCAR). Use when starting an analysis from raw spectra or a search engine output. Downstream normalization and stats are differential-abundance; reporter-ion/MaxLFQ quant is quantification; protein grouping is protein-inference.

development904

bio-proteomics-protein-inference

Groups proteins from peptide identifications and controls protein-level FDR, framing inference as a chosen explanation (parsimony or a probability model) of underdetermined peptide evidence rather than a measurement. Reports protein GROUPS (proteins indistinguishable by observed peptides) with a leading protein, not flat lists. Covers shared-vs-unique peptides, indistinguishable/subsumable proteins, parsimony vs probabilistic (ProteinProphet, EPIFANY) vs razor inference, picked-protein and picked-group FDR, and why the two-peptide rule is wrong. Use when resolving which proteins are present from a peptide list, building protein groups, or estimating protein-level FDR. PSM/peptide FDR and search engines are peptide-identification; razor-vs-unique quant consequences are quantification; isoform/proteoform resolution is top-down and out of scope.

development904

bio-proteomics-peptide-identification

Peptide-spectrum matching from MS/MS with target-decoy FDR control, framing identification confidence as a property of a ranked list (q-value/PEP) rather than a raw engine score (XCorr, hyperscore, Andromeda, SpecEValue). Covers sequence-database search engines (Comet, MS-GF+, MSFragger, Sage, MaxQuant, MetaMorpheus), concatenated vs separate target-decoy competition, PEP vs q-value, the multi-level FDR cascade, open/mass-tolerant search, rescoring (Percolator, mokapot, MS2Rescore), and pyOpenMS SimpleSearchEngineAlgorithm + FalseDiscoveryRate. Use when identifying peptides from tandem mass spectra and deciding what FDR threshold to act on. Protein grouping and protein-level FDR are protein-inference; PTM site localization is ptm-analysis; DIA peptide-centric scoring is dia-analysis; intensity quant is quantification.

development904

population-genetics/plink-basics

--- name: bio-population-genetics-plink-basics description: Manages PLINK genotype filesets - format conversion (VCF, BED/BIM/FAM, PED/MAP, pgen/pvar/psam) and sample/variant QC (missingness, MAF, HWE, sex check, heterozygosity, KING relatedness) with PLINK 1.9 and 2.0. PLINK rewrites allele bookkeeping: PLINK 1.x A1 defaults to the minor allele and is recomputed every load, silently flipping effect-allele meaning unless --keep-allele-order, while PLINK 2.0 tracks explicit REF/ALT. QC order matt

development901

bio-primer-design-primer-validation

Validates chosen PCR/qPCR oligos for intramolecular thermodynamic liabilities with primer3-py - hairpins, self-dimers, cross-dimers (calc_hairpin/homodimer/heterodimer), and 3'-end stability (calc_end_stability) - returning ThermoResult dG/Tm and ASCII structures. Covers why a "dimer-free" verdict is a PREDICTION at the supplied salt/Mg/dNTP/oligo conditions and temp_c (so the same primer is fine or dimer-prone depending on conditions), why a 3'-END dimer or hairpin is the lethal class (polymerase-extendable into primer-dimer) so structures are ranked by dG at the annealing temperature and 3'-end involvement rather than global Tm, that ThermoResult dG is in cal/mol not kcal/mol, and that .structure_found must gate the numbers. Use when checking primer pairs before ordering, troubleshooting primer-dimers or smears, or screening oligos for secondary structure. Genome off-target/mispriming is primer-specificity; design is primer-basics; probe assays are qpcr-primers.

development901

bio-genome-engineering-hdr-template-design

Designs donor/repair templates for precise CRISPR knock-ins -- choosing the format (ssODN, long-ssDNA/Easi-CRISPR, dsDNA/plasmid, AAV6), sizing homology arms, placing the cut within ~10 bp of the edit, and adding a mandatory codon-checked blocking (PAM/seed) mutation so the edited allele is not re-cut. Frames the HDR-vs-NHEJ-vs-MMEJ pathway competition, the MMEJ (PITCh) and homology-independent (HITI/HMEJ) alternatives for post-mitotic cells, ssODN strand/asymmetry choice, phosphorothioate end-protection, and ranked HDR enhancers. Use when designing a donor for a point mutation, epitope/fluorophore tag, allele replacement, or knock-in, or when HDR efficiency is low. Guide design and base/prime editing are separate skills.

development901

bio-population-genetics-linkage-disequilibrium

Computes linkage disequilibrium (r2, D', composite Rogers-Huff r2), prunes correlated variants, clumps GWAS summary statistics to lead SNPs, and defines haplotype blocks with PLINK 1.9/2.0 and scikit-allel. r2 and D' answer different questions - r2 (= chi2/N) is the tagging and GWAS-power currency, D' marks observed recombination and is upward-biased for rare variants. PLINK 2.0 has no bare --r2 (split into --r2-phased and --r2-unphased); pruning (--indep-pairwise, genotype-blind) and clumping (--clump, p-value-aware) are distinct operations that are constantly confused. The clumping or fine-mapping LD reference must be ancestry-matched or it fails silently into false credible sets. Use when calculating LD, pruning variants for PCA or structure, clumping GWAS hits, or selecting tag SNPs. For QC see plink-basics; for PCA see population-structure; fine-mapping is causal-genomics/fine-mapping.

testing901

bio-population-genetics-rare-variant-association

Gene and region-based rare-variant aggregation - burden/collapsing, SKAT, SKAT-O, ACAT-V/ACAT-O, annotation-weighted STAAR - with regenie (--vc-tests), SAIGE-GENE+, and the SKAT R package. Single-variant tests are powerless at low minor allele count, so rare variants are aggregated across a gene or region under an explicit mask (functional class plus a MAF cutoff). A burden test collapses variants into one score assuming a single effect direction (powerful when true, near-zero power when risk and protective variants cancel); SKAT is a variance-component test robust to mixed directions; SKAT-O blends the two; ACAT/STAAR are dependence-robust and annotation-weighted. The mask is the hypothesis, imbalance needs SPA or Firth, and testing burden is per-gene-per-mask. Use when aggregating rare coding or regulatory variants into gene or region tests, choosing burden vs SKAT vs SKAT-O, or building masks. For single-variant GWAS see association-testing; for mask annotations see variant-calling/variant-annotation.

development901

bio-primer-design-primer-basics

Designs and ranks PCR primer pairs for a target template with primer3-py (design_primers), returning pairs with nearest-neighbor Tm, GC, product size, and complementarity scores. Covers why primer3 is a LOCAL weighted-penalty minimizer over the single template supplied (so PRIMER_PAIR_0 is the lowest-penalty pair under the given bounds, never a genome-specificity guarantee), why Tm is a salt/concentration-dependent SantaLucia prediction not a fixed property, why the two primers must be Tm-matched, the seq_args/global_args tag semantics (SEQUENCE_TARGET/INCLUDED/EXCLUDED/OVERLAP_JUNCTION/FORCE_*, 0-based [start,length]), 3'-end and GC-clamp mechanism, 5'-tail handling, masking SNPs under 3' ends, and diagnosing zero-pair runs. Use when designing standard PCR, cloning, genotyping, or sequencing primers, flanking a target, or screening pairs by Tm/size/GC. Genome off-target checking is primer-specificity; dimers/hairpins primer-validation; qPCR and probes qpcr-primers.

testing901

bio-genome-engineering-grna-design

Designs and ranks guide RNAs (sgRNAs) for CRISPR-Cas9/Cas12a gene knockout by scanning a target for PAM sites (NGG SpCas9, NNGRRT SaCas9, TTTV Cas12a, NG SpCas9-NG, near-PAMless SpRY), enumerating candidate spacers, applying hard filters (Pol-III TTTT terminator, 5' G, GC), ranking on-target activity with the context-appropriate model (Rule Set 2/Azimuth for U6/lentiviral, CRISPRscan for T7/embryo, DeepHF for high-fidelity variants, DeepCpf1 for Cas12a), and predicting the indel/frameshift outcome (Bae out-of-frame score, inDelphi, FORECasT, Lindel). Use when selecting sgRNAs to knock out a gene, choosing a nuclease/PAM for a constrained locus, picking which exon to target, or shortlisting guides before an off-target check. Off-target specificity, base/prime editing, and HDR donors are separate skills.

testing901

bio-population-genetics-selection-statistics

Scans genomes for natural selection with SFS tests (Tajima's D, Fay & Wu H, Zeng E, SweepFinder2 CLR), haplotype tests (iHS, nSL, XP-EHH, Rsb, H12), and differentiation (FST, PBS) using scikit-allel, selscan, and SweepFinder2. No single statistic separates selection from demography at one locus, so the deliverable is empirical genome-wide outliers plus multiple orthogonal signals, not an absolute cutoff. iHS detects incomplete sweeps and collapses to zero at fixation while XP-EHH catches fixed sweeps; iHS/nSL standardize within derived-allele-frequency bins but XP-EHH gets a genome-wide z-score; derived-allele tests need substitution-model polarization; background selection mimics FST and CLR. Use when computing selection statistics like FST, Tajima's D, iHS, or XP-EHH, or scanning for selective sweeps. For phasing inputs see phasing-imputation/haplotype-phasing; for dN/dS see comparative-genomics/positive-selection.

development901

bio-primer-design-qpcr-primers

Co-designs qPCR/RT-qPCR primers and hydrolysis (TaqMan) or molecular-beacon probes with primer3-py (PRIMER_PICK_INTERNAL_OLIGO, PRIMER_INTERNAL_* tags), for assays whose deliverable is a quantitative measurement device. Covers why amplification efficiency (90-110%, slope -3.6 to -3.1) and single-product specificity make the 2^-ddCq / Pfaffl math valid, why the short amplicon (70-150 bp), tight Tm, and zero-dimer requirement exist, the coupled probe rules (probe Tm 8-10 C above primers so it is bound when Taq's exonuclease cleaves it; no 5' G as it quenches the reporter; C-rich strand; primer3 has NO no-5'-G tag so enforce PRIMER_INTERNAL_MUST_MATCH_FIVE_PRIME=HNNNN), gDNA exclusion by exon-junction spanning AND why pseudogenes defeat it, SYBR melt-curve QC, and reference-gene validation (geNorm/NormFinder). Use when designing TaqMan/SYBR assays, exon-spanning primers, probes, or matched-efficiency multiplex panels. Genome specificity is primer-specificity; dimers primer-validation; standard PCR primer-basics.

development901

population-genetics/population-structure

--- name: bio-population-genetics-population-structure description: Infers and describes population structure with PCA (plink2 --pca, smartpca/EIGENSOFT, FlashPCA2), model-based clustering (ADMIXTURE, fastSTRUCTURE), FST estimators (Weir-Cockerham vs Hudson), and f-statistics (f3/f4/D via AdmixTools/admixr), plus Python plotting of PCs and Q barplots. Every output is a model-conditioned description of variance, not truth: PCs conflate ancestry with LD/inversions/relatedness/batch, ADMIXTURE Q-va

tools901

population-genetics/scikit-allel-analysis

--- name: bio-population-genetics-scikit-allel-analysis description: In-memory Python population genetics with scikit-allel - GenotypeArray/HaplotypeArray/AlleleCountsArray, diversity (pi, theta, Tajima's D), SFS, FST (Weir-Cockerham, Hudson, Patterson), f3/D admixture stats, LD pruning, PCA, and selection scans (iHS, XP-EHH, nSL, Garud H). Nearly every statistic is a ratio or density with one silent denominator bug in two faces: omit is_accessible= and per-base pi/theta divide by total span not

development901

bio-primer-design-primer-specificity

Checks whether a PCR primer PAIR amplifies only the intended target genome-wide, using pair-aware in-silico PCR (MFEprimer-3.0, UCSC isPcr, NCBI Primer-BLAST) plus a primer3-py 3'-end-stability prefilter, against the correct database. Covers why plain BLAST is the wrong tool (it scores per-primer similarity, blind to 3'-terminal anchoring and to whether the two primers form a convergent amplicon in range), why a single 3'-terminal mismatch suppresses amplification while internal mismatches are tolerated, why intron-spanning RT-qPCR is defeated by processed pseudogenes that force a GENOME search not transcriptome-only, how to read a Primer-BLAST report (empty unintended-products means none passed its filter, not none exist), and that in-silico checking reduces but never replaces empirical validation. Use when confirming specificity, screening off-target amplicons, avoiding paralog/pseudogene hits, or checking SNPs under the 3' end. Design is primer-basics; dimers primer-validation; alignment read-alignment.

tools901

phasing-imputation/foundations

--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil

tools894

bio-pathway-enrichment-foundations

Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.

tools894

bio-phylo-tree-io

Read, write, and convert phylogenetic tree files with Biopython Bio.Phylo, and choose an annotation-preserving parser (treeio, DendroPy) when metadata matters. Covers why a tree file is a lossy serialization, why format conversion silently drops BEAST/MrBayes node annotations (posteriors, HPD intervals, rates), the Newick support-vs-label ambiguity that mislabels bootstrap values, and the Nexus TRANSLATE and rooted/unrooted traps. Use when parsing Newick, Nexus, NHX, phyloXML, or NeXML, converting between formats, handling posterior tree sets, or moving annotated BEAST trees without losing the credible intervals. Routes annotation-critical reads to DendroPy or treeio and orthology/alignment context to sibling skills.

development894

bio-phylo-species-trees

Estimates species trees under the multispecies coalescent from per-locus gene trees with the modern ASTER astral binary (ASTRAL-III/wASTRAL/ASTRAL-Pro), plus SVDQuartets, BPP, and StarBEAST2. Covers why a species tree is not a gene tree, why each locus has its own genealogy that disagrees by incomplete lineage sorting (ILS) even with zero error, why concatenation is statistically inconsistent and positively misleading in the anomaly zone where more loci converge on the wrong tree with full support, why gene-tree estimation error biases summary methods, that localPP is not bootstrap and ASTRAL branch lengths are coalescent units, and how minority-quartet symmetry separates ILS from introgression. Use when multi-locus discordance, rapid radiations, anomaly-zone risk, concordance-factor interpretation, or concatenation-vs-coalescent choice arise. Routes per-locus gene-tree inference and gCF/sCF to modern-tree-inference, dating to divergence-dating, orthology to comparative-genomics/ortholog-inference.

development894

bio-multi-omics-data-harmonization

Harmonizes already-normalized per-omic matrices onto a common footing before joint integration - assembling a MultiAssayExperiment, choosing the per-omic variance-stabilizing transform, deciding per-view versus per-feature scaling, picking a cross-omic batch strategy, and triaging missing data (feature, value, or whole sample; MAR versus MNAR). Covers why a shared-latent integrator is blind to what an omic is so scaling silently decides which block dominates, why batch confounded with biology is irrecoverable and should be modeled as a covariate not scrubbed, and why stacking blocks and running one ComBat erases cross-omic signal. Use when preparing two or more omics for MOFA2, mixOmics, or SNF, deciding a transform or scaling, correcting batch across modalities, or handling missing omics per sample. For deep per-omic normalization see differential-expression, methylation-analysis, proteomics, metabolomics; for the method decision see integration-design; for fusion see mofa-integration, mixomics-analysis.

testing890

bio-multi-omics-mixomics-analysis

Builds supervised and unsupervised multivariate integration across bulk omics blocks with mixOmics - sPLS for sparse pairwise correlation, DIABLO (block.splsda) for a multi-block discriminant signature, rCCA for regularized canonical correlation, and MINT for multi-study integration. Covers why these projection methods maximize covariance or correlation and not truth, why DIABLO's design matrix is the central correlation-versus-discrimination decision, why cross-validation must wrap keepX selection or the reported error is leaked, why balanced error rate is required under class imbalance, and why DIABLO needs matched samples while MINT handles multiple cohorts. Use when finding a cross-omic discriminant signature for a known outcome, selecting correlated features between two omics, tuning keepX, or integrating one omic across studies. For unsupervised factors see mofa-integration; for the method decision see integration-design; for cross-validation theory see machine-learning/model-validation.

development890

bio-multi-omics-mofa-integration

Discovers shared and view-specific latent factors across bulk multi-omics blocks (RNA-seq, proteomics, methylation) on a common sample axis with MOFA2's unsupervised Bayesian group factor model, then attributes per-view variance explained and interprets signed factor weights. Covers why a factor is an unsupervised axis of variance and not a pathway, why a factor that correlates with batch is a batch factor, why the per-view variance-explained table is the primary read-out rather than p-values, why raw counts in a Gaussian view make factor 1 the library-size factor, and why MOFA2 handles missing omics-per-sample natively. Use when integrating two or more bulk omics to find joint axes of variation, choosing factor count, labeling factors against metadata, or running enrichment on factor weights. For supervised discriminant integration see mixomics-analysis; for the method decision see integration-design; for single-cell see single-cell/multimodal-integration; for enrichment see pathway-analysis/gsea.

development890

bio-multi-omics-similarity-network

Stratifies patients into multi-omics subtypes by building one patient-by-patient similarity network per omic, fusing them with SNF's cross-network diffusion, and spectral-clustering the fused graph - then defending the clusters with stability, survival separation, and replication. Covers why spectral clustering always returns the requested cluster count so a subtype is a claim not a discovery, why the eigengap is a graph property not a biological truth, why fusion is not automatically better than the best single omic, why SNF needs complete data while NEMO handles mosaic cohorts, and the SNFtool API gotchas (dist2 returns squared distance, affinityMatrix width is sigma, spectralClustering K is the cluster count). Use when discovering patient subtypes from multiple omics, choosing a cluster number, validating subtypes, or handling partial multi-omic data. For feature-space factors see mofa-integration; for supervised signatures see mixomics-analysis; for survival see clinical-biostatistics/survival-analysis.

tools890

bio-multi-omics-integration-design

Chooses a bulk multi-omics integration strategy before any tool runs by mapping the biological question (subtype discovery, shared axis of variation, predictive signature, pairwise correlation) to a method class, naming the sample correspondence (paired-vertical, horizontal, mosaic, diagonal), enforcing the n<<p discipline that makes a held-out cohort the endpoint instead of in-cohort cross-validation, and running the per-view variance-imbalance diagnostic. Covers the early/mixed/intermediate/late taxonomy, why vertical and horizontal integration are different problems, and why a shared factor dominated by one omic is not integration. Use when deciding which integration method fits a question, whether data is paired or mosaic, supervised or unsupervised, or how to validate an integrated result. For unsupervised factors see mofa-integration; for supervised signatures see mixomics-analysis; for stratification see similarity-network; for single-cell see single-cell/multimodal-integration.

tools890

bio-microbiome-amplicon-processing

Infers exact amplicon sequence variants (ASVs) from demultiplexed 16S rRNA or ITS amplicon FASTQ with DADA2 - removing primers with cutadapt (--discard-untrimmed), learning a per-run error model (filterAndTrim -> learnErrors -> dada -> mergePairs), merging run-level tables with mergeSequenceTables, then one removeBimeraDenovo. Covers why primers come OFF before truncation, why the error model is per-run, truncLen as a merge-overlap detection budget (V4 vs V3-V4), DADA2 vs Deblur and q2-dada2 (denoise-paired/single/pyro/ccs), ASV vs OTU, NovaSeq binned-quality error-fit breakage, ITSxpress for variable-length ITS, and decontam removal of reagent/kit contaminants. Use when turning demultiplexed amplicon reads into an ASV/feature table, choosing truncation lengths, handling multi-run studies, or ITS. For shotgun reads see metagenomics/kraken-classification; for QIIME2 CLI mechanics see qiime2-workflow; for primer trimming theory see read-qc/adapter-trimming.

tools882

bio-microbiome-taxonomy-assignment

Assigns taxonomy to amplicon ASVs/OTUs (16S, ITS, 18S) with a classifier conditioned on a reference database and primer region - DADA2 assignTaxonomy + addSpecies (RDP naive Bayes), DECIPHER IDTAXA, and QIIME2 q2-feature-classifier (classify-sklearn naive Bayes, classify-consensus-vsearch alignment-consensus). Covers region-specific training (extract-reads, fit-classifier-naive-bayes), why a full-length classifier fabricates calls on a V4 read, the scikit-learn version-pinning trap on pre-trained .qza classifiers, confidence thresholds (classify-sklearn 0.7, assignTaxonomy minBoot 50), and choosing SILVA/GTDB/Greengenes2/UNITE/PR2/RDP. Use when classifying ASVs after DADA2, picking a reference database, training a region-matched classifier, setting a confidence threshold, or deciding whether a 16S species call is defensible (usually not - genus at best). For shotgun read classification see metagenomics/kraken-classification and metagenomics/metaphlan-profiling.

testing882

bio-microbiome-diversity-analysis

Alpha and beta diversity of an amplicon (16S/ITS) ASV/OTU community table - observed features, Shannon, Pielou evenness, Faith PD, Bray-Curtis, Jaccard, weighted/unweighted/generalized UniFrac, Aitchison/RPCA - via QIIME2 core-metrics-phylogenetic, phyloseq/vegan, and scikit-bio. Covers the three knobs that set the answer before it is seen (rarefaction sampling depth, the tree, the metric), why core-metrics silently deletes samples below the sampling depth, why de novo trees lose to SEPP fragment-insertion and Greengenes2, why unweighted and weighted UniFrac can flip the story, why observed features is an ASV count not a species count, the QIIME2-log2 vs R-ln Shannon mismatch, and pairing PERMANOVA (adonis2) with betadisper. Use when summarizing whole-community richness/evenness or testing group differences in community structure. Per-taxon testing -> differential-abundance. Shotgun tables -> metagenomics/metagenome-visualization. Shared CoDA/rarefaction theory -> metagenomics/abundance-estimation.

testing882

bio-microbiome-differential-abundance

Tests which individual taxa differ between groups on an amplicon ASV/feature table (phyloseq) using compositionally-aware methods - ALDEx2 (Dirichlet-MC CLR, conservative), ANCOM-BC2/ANCOMBC (sampling-fraction bias correction, structural zeros, passed_ss, default p_adj_method=holm), MaAsLin2/MaAsLin3 (multivariable GLM, random effects, prevalence/abundance split), LinDA (CLR mixed-model regression), ZicoSeq (permutation FDR), LEfSe, and q2-composition ancombc. Covers why the hit list depends more on the DA tool than the biology (Nearing benchmark) so the deliverable is a CONSENSUS of >=2 tools, why a relative change is not absolute without a load anchor, the prevalence-filter knob, BH/FDR plus an effect-size floor, and why DESeq2/edgeR misfire here. Use when finding differentially abundant taxa, handling covariates or longitudinal designs, or choosing a method. Whole-community diversity -> diversity-analysis; shotgun DA -> metagenomics/metagenome-visualization; CoDA theory -> metagenomics/abundance-estimation

tools882

bio-microbiome-qiime2-workflow

Operates the QIIME2 framework as the glue for an amplicon analysis - the .qza/.qzv artifact model, semantic types (FeatureTable[Frequency], SampleData[PairedEndSequencesWithQuality], Phylogeny[Rooted], FeatureData[Taxonomy]), embedded provenance plus provenance replay, import (Casava/manifest/EMP/BIOM), export, the Metadata object, and the q2cli vs Artifact API interfaces. Covers why a .qza is data-plus-executable-history not a file, why export drops provenance, why a .qzv is terminal, why classifier .qza are version-pinned, and the 2026 distribution/rachis rename. Use when importing reads, choosing a manifest/Casava/EMP/BIOM path, reading or replaying provenance, exporting to BIOM/phyloseq, fixing semantic-type or Phred or sklearn-version errors, or orchestrating the pipeline. Denoising -> amplicon-processing; classifier/DB -> taxonomy-assignment; diversity metric/depth -> diversity-analysis; DA tool -> differential-abundance; PICRUSt2 -> functional-prediction; shotgun moshpit -> metagenomics.

tools882

bio-microbiome-functional-prediction

Predicts community functional POTENTIAL from 16S/ITS amplicon ASVs with PICRUSt2 (or q2-picrust2) by phylogenetic interpolation of reference-genome gene content - EPA-ng placement, gappa, castor hidden-state prediction of KO/EC/Pfam copy number, 16S copy-number normalization, and MinPath MetaCyc/KEGG pathways - gated by the NSTI quality index. Covers why predicted function is taxonomy re-encoded (never measured gene content and never activity), the mandatory NSTI report (--max_nsti 2 silently drops novel ASVs), why accuracy IS reference coverage (gut Spearman ~0.8, soil/marine collapse), the circularity trap, and Tax4Fun2/FAPROTAX/BugBase alternatives. Use when inferring KO/EC/MetaCyc potential from an ASV table, gating on NSTI, or choosing a prediction method. For MEASURED shotgun function see metagenomics/functional-profiling; for enrichment of KO lists see pathway-analysis/go-enrichment; for DA of predicted tables see differential-abundance.

development882

bio-metagenomics-amr-detection

Profiles the antimicrobial-resistance gene content (resistome) of shotgun metagenomes - read-based quantification with RGI bwt, AMR++/MEGARes, ARGs-OAP/SARG, deepARG, or GROOT, and presence calling with AMRFinderPlus/ABRicate on assembled contigs or MAGs. Covers why an ARG hit is a sequence match not a phenotype, why a metagenomic ARG has no host and no genomic context until assembly (and assembly breaks at ARGs), per-gene curated thresholds vs a flat 80/80, gene-fraction false-positive control, and cross-study normalization pitfalls. Use when quantifying a community resistome, normalizing ARG abundance, or calling ARGs from metagenome contigs. For pure-culture isolate AMR, point mutations, and phenotype/MIC prediction see epidemiological-genomics/amr-surveillance.

content-media878

bio-methylation-differential-cpg

Tests individual CpG sites for differential methylation (DMC/DMP) from bisulfite sequencing counts or array/continuous beta-value matrices. Covers the count-vs-continuous fork that dictates the model, beta-value vs M-value logit (Du 2010), beta-binomial overdispersion count models (DSS, methylKit, MOABS, RADMeth) for sequencing, limma moderated-t on M-values (eBayes trend/robust) for arrays, the bare-beta Welch t-test caveat, coverage-as-precision coupling, delta-beta effect size, BH-FDR with the neighboring-CpG dependence problem, EWAS genome-wide thresholds, and differential variability (DiffVar/iEVORA). Use when comparing per-CpG methylation between groups from WGBS/RRBS/targeted bisulfite or 450K/EPIC arrays, choosing a per-site test, or scanning for variance (not just mean) differences. For region-level aggregation see dmr-detection; for covariate/cell-fraction strategy and genomic inflation see ewas-design.

testing878

bio-metagenomics-visualization

Turns a shotgun profiler table (MetaPhlAn relative abundance, Bracken counts, HUMAnN function tables) into honest figures and defensible community statistics with phyloseq, vegan, microViz, and Python. Covers why an ordination/bar/diversity number is a modeling choice that can manufacture a result, the MetaPhlAn-percent-vs-Bracken-counts fork that decides everything, CLR/Aitchison vs Bray-Curtis, Hill numbers and why shotgun richness is a database readout, pairing PERMANOVA with betadisper, and the multi-tool differential-abundance consensus. Use when plotting taxonomic/functional profiles, computing alpha/beta diversity, running ordination/PERMANOVA, or testing differential abundance. For amplicon/QIIME2 stats see the microbiome category; for compositional theory see abundance-estimation.

tools878

bio-metagenomics-contamination-controls

Cleans a shotgun metagenome of everything that is not the target community before profiling - host-read depletion (Hostile, bowtie2/T2T-CHM13), reagent/kitome contamination control with blanks and decontam, mock-community validation, and depth-adequacy checks (Nonpareil). Covers why a metagenomic result is a position in a choice-chain rather than a direct observation, why extraction is the experiment, why a low-biomass community can be entirely kitome, why absence means not-detectable-by-this-chain, and why a confident classifier call can still be wrong when the reference is contaminated. Use when designing controls, removing host reads, identifying reagent contaminants, validating with mocks, or judging whether a low-biomass result is real. For adapter/quality trimming see read-qc; for MAG-level decontamination see genome-assembly/metagenome-assembly.

testing878

bio-metagenomics-abundance

Turns shotgun classifier output into a defensible abundance table with Bracken Bayesian re-estimation, then compositional treatment (CLR, zero handling), library-size normalization, reference-frame differential abundance, and optional absolute quantification. Covers why a relative-abundance change is not a change, why Bracken read fractions and MetaPhlAn percentages are different physical quantities, the silent -r read-length bias, the genome-size confound no library-size method fixes, and the rarefaction debate. Use when estimating species abundance from a Kraken2 report, normalizing a community count table, choosing a compositional transform, or converting relative to absolute load. For classification see kraken-classification; for diversity/ordination/DA mechanics see metagenome-visualization.

development878

bio-metagenomics-functional-profiling

Profiles the functional potential of shotgun metagenomes with HUMAnN 3's tiered search (MetaPhlAn prescreen, Bowtie2 pangenome, translated DIAMOND vs UniRef), giving gene-family (RPK) and MetaCyc pathway abundances stratified by species. Covers why a metagenome measures potential not activity, why dropping UNMAPPED/UNINTEGRATED biases everything, why stratification is an estimate, coverage-vs-abundance and MinPath/gap-fill, UniRef90-vs-50 and biome database bias, and the assembly/eggNOG/dbCAN/antiSMASH alternatives. Use when obtaining pathway or gene-family abundances, regrouping to KO/EC/GO, normalizing functional tables, or choosing read-based vs assembly-based functional profiling. For AMR genes see amr-detection; for host-gene enrichment see pathway-analysis.

development878

bio-metagenomics-kraken

Classifies shotgun metagenomic reads to taxa with Kraken2's minimizer/LCA matching against a chosen reference database, then hands off to Bracken for abundance re-estimation. Covers why the database (not the algorithm) decides what can be detected, the --confidence and --minimum-hit-groups precision levers, unique-minimizer false-positive control, host-read removal, and why raw Kraken2 read counts are not abundances. Use when profiling who-is-there from shotgun reads, choosing a Kraken2 database, setting a confidence threshold, controlling false positives, or feeding reports to Bracken. For marker-gene profiling see metaphlan-profiling; for abundance mechanics see abundance-estimation; for assembly/MAG recovery see genome-assembly/metagenome-assembly.

development878

bio-metagenomics-strain-tracking

Resolves and compares bacterial strains below the species level from shotgun metagenomes with inStrain (popANI/conANI microdiversity), StrainPhlAn (marker-SNV consensus phylogeny and nGD), MIDAS2, metaSNV, and StrainGE, plus genome-vs-genome ANI (skani/fastANI/MASH) for isolate/MAG comparison. Covers why a strain is a threshold not a thing, why ANI answers same-genome while popANI/nGD answer same-population-in-situ, the 99.999% popANI and per-species nGD definitions, the coverage detection limit (absence is not absence), why sharing is not transmission direction, and mapping to the dataset's own dRep MAGs. Use when detecting shared strains, tracking transmission, resolving within-host strain dynamics, or deconvoluting co-occurring strains. For pure-culture isolate outbreak SNP trees see epidemiological-genomics; for MAG assembly see genome-assembly/metagenome-assembly.

testing878

bio-methylation-array-preprocessing

Turns raw Illumina Infinium methylation BeadChip IDATs (450K, EPIC, EPICv2) into a defensible beta/M matrix with sesame (openSesame/SigDF) or minfi (RGChannelSet -> MethylSet -> GenomicRatioSet). Covers Type I vs Type II probe chemistry and why raw Type II beta is compressed, the signal-to-beta math (beta = M/(M+U+100)) and M-value logit, detection-p / pOOBAH masking including the out-of-band deletion-artifact catch, dye-bias correction, and the normalization decision (noob, funnorm, quantile, SWAN, BMIQ, dasen, sesame QCDPB). Use when reading IDATs, choosing a normalization for a 450K/EPIC/EPICv2 cohort, deciding beta vs M, masking failed probes, or producing the corrected matrix before testing. For probe/sample filtering, EPICv2 replicate collapse, and sample-identity QC see array-qc-filtering; for native long-read 5mC see long-read-sequencing/nanopore-methylation (a different platform).

testing878

bio-methylation-dmr-detection

Detects differentially methylated regions (DMRs) from short-read bisulfite (WGBS/RRBS), array, and long-read methylation count tables using dmrseq (permutation region-FDR over the region selection), DSS callDMR (beta-binomial), methylKit tiles, bsseq BSmooth, DMRcate Gaussian-kernel smoothing, metilene, and comb-p. Covers why a DMR is DEFINED by arbitrary thresholds (min-CpGs, max-gap, delta-beta, q) and a smoothing bandwidth, why selecting extreme runs of CpGs then testing them on the same data is post-selection inference, why region q-values are not comparable across tools, and a single-sample domain-segmentation section (PMD, UMR/LMR, MethylSeekR, solo-WCGW) that must run before focal calling on cancer/aging genomes. Use when calling region-level methylation differences, choosing a DMR caller, controlling region-level FDR, or segmenting megabase methylation domains. For per-site testing see differential-cpg-testing; for the methylKit object model see methylkit-analysis.

tools878

methylation-analysis/epigenetic-clocks

--- name: bio-methylation-epigenetic-clocks description: Computes DNA methylation age (DNAm age) and pace of aging by applying frozen elastic-net epigenetic clocks to a clean beta matrix with methylclock, dnaMethyAge, or methylCIPHER. Covers the clock menu by question (chronological Horvath/Hannum/skin&blood; health-mortality PhenoAge/GrimAge; DunedinPACE pace; pediatric/gestational; mitotic epiTOC), age acceleration (EAA/IEAA/EEAA) as the real endpoint, the principal-component (PC) clock fix fo

testing878

methylation-analysis/array-qc-filtering

--- name: bio-methylation-array-qc-filtering description: Performs probe filtering and sample-level QC on Illumina Infinium methylation arrays (450K / EPIC / EPICv2) to decide which probes and samples to trust. Drops detection-p-failed and low-bead-count probes, removes cross-reactive/non-specific probes (Chen 2013 / Pidsley 2016 lists via maxprobes), excludes SNP-overlapping probes with dropLociWithSnps, and handles sex-chromosome probes. Collapses EPICv2 replicate probes with betasCollapseToPf

development878

bio-metagenomics-metaphlan

Profiles shotgun metagenomes to species/SGB relative abundance with MetaPhlAn 4's clade-specific marker genes (bowtie2 short reads, minimap2 long reads). Covers why a MetaPhlAn percentage is a cell fraction (genome-size-normalized taxonomic abundance) and must never be merged with Kraken/Bracken read fractions, kSGB vs uSGB units for quantifying database-absent taxa, the unknown-fraction rescaling and its version-default flip, --index pinning as a batch variable, and when mOTUs3 or sourmash gather beat marker profiling. Use when profiling who-is-there with high precision, needing HMP-comparable species abundances, quantifying novel taxa, or deciding marker-gene vs k-mer profiling. For k-mer classification see kraken-classification; for strains see strain-tracking; for 16S amplicon see the microbiome category.

development878

bio-methylation-calling

Extracts per-cytosine methylation calls from aligned bisulfite/EM-seq reads with bismark_methylation_extractor (Bismark BAM) or the aligner-agnostic MethylDackel/BISCUIT (bwa-meth BAM), producing the beta value M/(M+U) as a coverage file, bedGraph, or genome-wide cytosine report across CpG/CHG/CHH context. Covers conversion-rate QC as the first gate, the 5mC vs 5hmC summed caveat, variant-aware calling so a C/T SNP does not masquerade as unmethylation, paired-end --no_overlap double-counting, symmetric CpG dyad collapse, and the 0-based vs 1-based coordinate trap. Use when extracting methylation levels from a bisulfite/EM-seq alignment, choosing an extractor for a non-Bismark BAM, QC-ing conversion efficiency, or producing coverage/cytosine-report input for testing. For long-read MM/ML modification calling see long-read-sequencing/nanopore-methylation; for the upstream BAM see bismark-alignment; for per-CpG statistics see differential-cpg-testing.

testing878

bio-metabolomics-metabolite-annotation

Turns untargeted LC-MS/MS features (m/z, RT, MS/MS) into confidence-stratified metabolite annotations using spectral-library matching (matchms), in-silico tools (SIRIUS/CSI:FingerID, MetFrag) and molecular networking, and assigns a defensible MSI/Schymanski confidence level to each. Use when naming detected features, scoring MS/MS against a reference library, running SIRIUS, or deciding what confidence level an evidence set actually supports. For upstream feature extraction see metabolomics/xcms-preprocessing and metabolomics/msdial-preprocessing; for downstream enrichment that must respect these levels see metabolomics/pathway-mapping; for lipid-specific structural annotation see metabolomics/lipidomics.

tools860

bio-machine-learning-omics-classifiers

Builds diagnostic and prognostic classifiers on omics feature matrices with regularized logistic regression, random forest, and gradient-boosted trees, handling the p>>n regime, batch shortcut learning, class imbalance, and probability calibration. Use when building a classifier from expression, methylation, or variant data, choosing an algorithm for high-dimensional small-n data, or diagnosing a suspiciously perfect AUC. For unbiased evaluation see machine-learning/model-validation; for feature selection see machine-learning/biomarker-discovery; for time-to-event outcomes see machine-learning/survival-analysis.

development860

bio-long-read-sequencing-long-read-alignment

Aligns Oxford Nanopore and PacBio long reads (and assemblies) to a reference with minimap2 using the error-rate-matched preset (map-ont, lr:hq, map-hifi, map-pb, splice/splice:hq, asm5/10/20, ava), producing a sorted/indexed BAM for variant, SV, methylation, or isoform analysis. Covers why the preset rewrites the scoring/chaining model, why SV calling rides on supplementary not secondary alignments, carrying MM/ML methylation tags through with -y, the multi-part-index MAPQ trap, and when to swap in Winnowmap/VACmap/lra/pbmm2. Use when mapping ONT or PacBio reads, choosing a minimap2 preset by platform/chemistry, preparing input for Clair3/medaka/Sniffles/modkit, aligning into repeats/centromeres, or spliced-aligning cDNA/Iso-Seq.

testing860

bio-metabolomics-targeted-analysis

Designs and validates quantitative targeted metabolomics assays (MRM/SRM on triple-quadrupole, PRM on high-resolution instruments) to report absolute concentrations. Covers the internal-standard strategy (external cal -> global IS -> standard addition -> stable-isotope-labeled IS), weighted calibration judged by back-calculated %RE not R-squared, ion-ratio quantifier/qualifier confirmation, matrix-effect/recovery characterization, and ICH M10 method validation. Use when quantifying a closed panel of known metabolites with units, building or validating an LC-MS/MS assay, choosing an IS or calibration weighting, or judging whether a reported concentration is trustworthy. For untargeted feature detection see metabolomics/xcms-preprocessing; for group statistics see metabolomics/statistical-analysis; for flux/MID/tracing see metabolomics/isotope-tracing.

development860

bio-machine-learning-prediction-explanation

Explains ML predictions on omics data with SHAP, LIME, and permutation importance, handling the correlated-feature trap, the conditional-vs-interventional Shapley choice, and the attribution-is-not-causation boundary. Use when interpreting an omics classifier, debugging shortcut/batch learning, or deciding whether an attribution ranking can be trusted as biology. For validated feature selection see machine-learning/biomarker-discovery; explanations are not a selection method.

development860

bio-metabolomics-statistical-analysis

Decision-grade statistical analysis for metabolomics intensity tables. Covers transformation and scaling (Pareto vs unit-variance as a hidden hypothesis), unsupervised structure (PCA/HCA for QC), permutation-validated PLS-DA/OPLS-DA (R2 vs Q2, double CV, VIP as heuristic), univariate testing (Welch/Mann-Whitney/ANOVA/LMM with covariate adjustment), and dependence-aware multiple testing. Use when testing which metabolites differ, building or validating a discriminant model, choosing a scaling, or correcting many correlated tests. For sample-wise normalization/drift correction see metabolomics/normalization-qc; for ML classifiers and selection-inside-CV leakage see machine-learning/biomarker-discovery and machine-learning/model-validation; for pathway interpretation see metabolomics/pathway-mapping; for design/power/multiplicity regime see experimental-design/multiple-testing.

development860

bio-long-read-sequencing-isoseq-analysis

Discovers, classifies, filters, and quantifies full-length transcript isoforms from PacBio Iso-Seq/Kinnex (HiFi) and Oxford Nanopore (cDNA/direct-RNA) long reads, using the isoseq+pigeon pipeline, SQANTI3, and ONT tools (IsoQuant, FLAIR, Bambu, StringTie2). Covers why a novel isoform is an artifact until proven otherwise (RT template-switching, intra-priming, and 5' degradation manufacture junctions and truncations), the SQANTI3 structural categories and their trust order, the Kinnex skera-split step, orthogonal CAGE/poly-A/short-read-junction validation, and why long-read isoform quantification needs EM. Use when building a full-length isoform catalog, classifying/filtering long-read transcripts, running Iso-Seq or ONT cDNA/dRNA analysis, or judging novel-isoform reliability.

tools860

bio-long-read-sequencing-basecalling

Basecalls raw Oxford Nanopore signal (POD5/FAST5) into reads with Dorado, choosing the chemistry-matched model and accuracy tier (fast/hac/sup), requesting modified bases (5mCG_5hmCG, 6mA, m6A) at basecall time, and handling duplex, demultiplexing, trimming, and HERRO read correction. Covers why the model+version is an irreversible analysis decision, why methylation cannot be recovered later, and why downstream polish/variant models must match the basecaller. Use when converting POD5/FAST5 to reads, picking a Dorado model for R9/R10 or RNA004, enabling methylation calling, basecalling duplex, demultiplexing barcoded runs, or correcting reads for assembly.

development860

bio-long-read-sequencing-haplotype-phasing

Phases small variants, SVs, and methylation from Oxford Nanopore and PacBio long reads (read-backed/physical phasing) with WhatsHap, LongPhase, or HiPhase, and haplotags the BAM (HP/PS tags) for allele-resolved downstream analysis. Covers why phase blocks break at het-sparse gaps (read length x heterozygosity), why phasing the VCF is useless until the BAM is haplotagged, the GT-pipe/PS and read HP/PS tag spec, reporting block N50 with switch error, the diploid-assumption/CNV/haploid-region traps, trio phasing as the gold standard, and the boundary to statistical panel phasing. Use when phasing long-read variants, haplotagging reads for allele-specific methylation/expression or phased SVs, choosing WhatsHap vs LongPhase vs HiPhase, trio phasing, or assessing phasing quality.

development860

bio-long-read-sequencing-clair3-variants

Calls germline small variants (SNPs and indels) from Oxford Nanopore and PacBio HiFi long reads with Clair3, a two-stage (pileup + full-alignment) deep-learning caller, selecting the chemistry- and basecaller-version-matched model, enabling read-based phasing, and benchmarking against GIAB with stratification. Covers why the model string is the experiment (no auto-detection, silent degradation on mismatch), why ONT homopolymer/STR indels are the residual error whole-genome F1 hides, and the somatic/trio/RNA boundary to the ClairS/Clair3-Trio family. Use when calling germline SNVs/indels from ONT or HiFi BAMs, choosing a Clair3 model, phasing variants, or benchmarking long-read calls.

testing860

bio-machine-learning-model-validation

Validates predictive models on omics and biomedical data with nested cross-validation, group/batch/temporal-aware splits, the full data-leakage taxonomy, probability calibration, decision-curve net benefit, optimism correction, sample-size planning, and TRIPOD+AI reporting. Use when estimating model performance honestly, choosing a CV scheme, detecting leakage, or judging whether reported discrimination means the model is actually useful. For feature selection itself see machine-learning/biomarker-discovery; for confirmatory-trial inference see clinical-biostatistics/trial-reporting.

tools860

bio-metabolomics-msdial-preprocessing

Runs the MS-DIAL preprocessing workflow (peak picking, MS2Dec spectral deconvolution, alignment, gap-filling) and imports the alignment-result table into R or Python with honest filtering. Use when preprocessing LC-MS DDA/DIA (SWATH) raw data with MS-DIAL, deciding MS-DIAL vs XCMS, configuring the MsdialConsoleApp console run, or parsing an MS-DIAL export into a clean feature matrix. For programmatic R peak detection and the feature-table-as-artifact framing see metabolomics/xcms-preprocessing; for lipid annotation mode see metabolomics/lipidomics; for MSI-level confidence honesty see metabolomics/metabolite-annotation; for drift correction and QC see metabolomics/normalization-qc.

development860

bio-metabolomics-xcms-preprocessing

Programmatic untargeted LC-MS feature extraction in R with the modern xcms 4.x MsExperiment/XcmsExperiment API, taking raw mzML to a feature table via CentWave peak detection, retention-time alignment, peak-density correspondence, gap-filling, CAMERA redundancy collapse, and built-in QC feature filtering. Use when converting centroided LC-MS runs into a features-by-samples matrix and deciding centWave/grouping/alignment parameters. For drift correction and QC/CV filtering execution see metabolomics/normalization-qc; for metabolite identification see metabolomics/metabolite-annotation; for the MS-DIAL GUI alternative with MS2Dec deconvolution see metabolomics/msdial-preprocessing; for downstream statistics see metabolomics/statistical-analysis.

development860

bio-metabolomics-isotope-tracing

Designs and analyzes stable-isotope-resolved metabolomics (SIRM / isotope tracing / fluxomics) experiments that measure metabolic ACTIVITY via 13C/15N/2H tracers, distinct from steady-state pool profiling. Covers tracer choice, isotopologue vs isotopomer, mass-isotopomer distributions (MID), fractional enrichment, the mandatory natural-abundance + tracer-purity correction (IsoCor, AccuCor), and the metabolic/isotopic steady-state vs non-stationary (INST-MFA) distinction. Use when feeding a labeled tracer and interpreting labeling patterns, correcting raw isotopologue intensities, computing or plotting an MID, or deciding tracing vs abundance profiling. For absolute pool concentration and MRM mechanics see metabolomics/targeted-analysis; for constraint-based genome-scale flux (FBA, not empirical tracing) see systems-biology/flux-balance-analysis; for feature detection see metabolomics/xcms-preprocessing; for pathway enrichment that ignores the pool-vs-flux caveat see metabolomics/pathway-mapping.

testing860

bio-metabolomics-lipidomics

Assigns honest lipid annotation levels, designs class-based internal-standard quantification, and runs lipid-aware differential and enrichment analysis with lipidr, guarding against in-source-fragment phantoms, sn-position over-claims, and invalid cross-class quantification. Use when naming or canonicalizing lipid species (shorthand separators, Goslin), deciding shotgun vs RP vs HILIC LC-MS, picking internal standards (SPLASH/EquiSPLASH), interpreting MS-DIAL/LipidSearch output, or comparing lipid classes. For general feature detection see metabolomics/xcms-preprocessing and metabolomics/msdial-preprocessing; for non-lipid annotation confidence see metabolomics/metabolite-annotation; for normalization/QC see metabolomics/normalization-qc; for multivariate stats see metabolomics/statistical-analysis.

development860

bio-machine-learning-survival-analysis

Builds and validates predictive time-to-event models on clinical and omics data with penalized Cox, random survival forests, gradient-boosted and deep survival models, and prediction-grade evaluation (Uno's C, time-dependent AUC, integrated Brier, calibration, competing risks). Use when building an individualized risk predictor or prognostic omics signature, choosing a survival model, or evaluating one beyond the C-index. For Kaplan-Meier, log-rank, and classical Cox hazard-ratio inference in a trial see clinical-biostatistics/survival-analysis.

tools860

bio-metabolomics-normalization-qc

Designs QC, corrects signal drift, removes batch effects, filters features, normalizes samples, and imputes missing values for untargeted LC-MS/GC-MS metabolomics, framing each step as a measurement model that can create or erase biological signal. Use when processing a peak/feature table before statistical analysis, choosing a drift-correction or sample-normalization method, deciding QC RSD vs D-ratio filtering, or handling left-censored missing values. The feature table is produced by metabolomics/xcms-preprocessing or metabolomics/msdial-preprocessing; transformation/scaling for modeling defers to metabolomics/statistical-analysis; cross-study design issues link to experimental-design/batch-design.

development860

bio-machine-learning-biomarker-discovery

Selects biomarker features from high-dimensional omics data using Boruta all-relevant selection, mRMR, LASSO/elastic-net, and stability selection, while controlling the leakage, irreproducibility, and correlated-feature traps that make most published signatures fail to replicate. Use when identifying candidate biomarkers, deciding between an all-relevant and a minimal-optimal selector, or judging whether a selected gene set is reproducible. For unbiased performance estimation of the resulting model see machine-learning/model-validation; for interpreting a trained model see machine-learning/prediction-explanation.

testing860

bio-long-read-sequencing-medaka-polishing

Polishes Oxford Nanopore draft assemblies to higher consensus accuracy with medaka, a basecaller-model-specific neural consensus net, produces haploid variant calls (VCF) for microbial, mitochondrial, or viral samples, and generates amplicon/viral consensus sequences. Covers the model-matching footgun that silently degrades output, why Racon-first is obsolete and medaka runs directly on Flye output as a single pass, why HiFi must never be fed to medaka, the v1->v2 subcommand renames, and the precise medaka_variant deprecation. Use when polishing an ONT-only assembly, generating an amplicon/viral consensus, calling a haploid ONT consensus, or deciding whether medaka, dorado polish, or Clair3 is the right tool.

tools860

bio-long-read-sequencing-long-read-qc

Assesses Oxford Nanopore and PacBio long-read quality with NanoPlot, cramino, NanoComp, pycoQC/toulligQC, and seqkit, and filters reads with chopper/Filtlong for the downstream goal. Covers why read-only Qscore is an uncalibrated posterior (real accuracy needs a reference BAM), why the sequencing_summary.txt is required for run-health metrics, intent-conditioned filtering (preserve long reads and small replicons for assembly, filter almost nothing for variant calling), the chimera/internal-adapter trap that fabricates SVs, and PacBio rq-based HiFi QC. Use when judging a long-read run, computing read N50 or percent identity, filtering reads before assembly or variant calling, comparing barcodes/runs, or reading run-health red flags.

development860

bio-metabolomics-pathway-mapping

Maps metabolomics results to biological pathways via over-representation (ORA), metabolite-set enrichment (MSEA/QEA), mummichog/PSEA on raw m/z peaks, and network-diffusion enrichment (FELLA), with correct background-set construction and honest interpretive ceilings. Use when interpreting differential metabolites or an untargeted LC-MS feature table in pathway context, choosing ORA vs MSEA vs mummichog vs topology, or setting the reference/background set. For annotation confidence levels feeding ORA see metabolomics/metabolite-annotation; for gene-set concepts see pathway-analysis/go-enrichment and pathway-analysis/gsea; for joint gene+metabolite pathways see multi-omics-integration/mofa-integration.

development860

bio-long-read-sequencing-structural-variants

Detects structural variants (deletions, insertions, inversions, duplications, translocations) from Oxford Nanopore and PacBio long-read alignments with Sniffles2, cuteSV, SVIM, and assembly-based callers, joint-genotypes cohorts via the Sniffles2 .snf workflow, and benchmarks with Truvari against GIAB. Covers why an SV call is a representation artifact (the tandem-repeat BED, aligner, and Truvari params set precision/recall as much as the caller), the cuteSV per-platform parameter trap, soft-clipped supplementary alignments as the SV substrate, and the somatic/mosaic boundary to Severus/nanomonsv. Use when calling germline or somatic SVs from ONT/HiFi reads, joint-genotyping a cohort, choosing or tuning an SV caller, or benchmarking SV calls.

tools860

bio-fragment-analysis

Extracts cfDNA fragmentomics features (DELFI genome-wide short/long ratios, WPS nucleosome positioning, Griffin GC-corrected accessibility profiles, end-motifs/MDS, OCF) for cancer detection and tissue-of-origin from plasma WGS. Centers on the nuclease-footprint reframe (every feature re-reads one nucleosome object), the mandatory GC correction, and the cross-protocol non-comparability that breaks naive classifiers. Runs FinaleToolkit (real CLI/Python, MIT) and the Griffin Snakemake pipeline; DELFI is a method, not a package. Use when deriving fragment-based signal from cfDNA, choosing a feature family for detection vs subtyping, or diagnosing why a fragmentomic model failed validation.

tools856

bio-methylation-based-detection

Detects cancer and infers tissue-of-origin from cfDNA methylation by choosing conversion chemistry (bisulfite vs EM-seq vs TAPS vs cfMeDIP), calling read-level methylation haplotypes rather than averaged beta values, and deconvolving a hematopoietic-dominated cfDNA mixture against a methylation atlas via NNLS/quadratic programming. Encodes the GRAIL/CCGA thesis that thousands of tissue-specific markers make methylation outperform sparse mutations for multi-cancer early detection (MCED) and localization, and that single concordantly-methylated fragments give ppm-level sensitivity. Uses MethylDackel for extraction (mbias-then-extract), MEDIPS/QSEA for enrichment data, scipy.optimize.nnls for deconvolution. Use when building an MCED or methylation-MRD assay, picking a conversion chemistry for low-input plasma, or deconvolving tissue-of-origin from cfDNA.

development856

bio-immunoinformatics-immunogenicity-scoring

Rank and prioritize neoantigen/epitope candidates by likely T-cell response using NeoFox feature annotation, PRIME2.0, BigMHC-IM, the Łuksza/Balachandran fitness model (agretopicity + foreignness), and pVACtools tiering. Encodes the field's hard truths that immunogenicity is the least-solved layer (dedicated scores ~AUROC 0.6-0.7, modest PPV), that scores are valid only for RANKING within one patient (never absolute go/no-go or cross-patient), that DAI has anchor-inflation and WT-denominator traps, and that stacking weak correlated scores into one number is a red flag. Use when ordering a candidate list for a vaccine. Binding lives in mhc-binding-prediction; calling in neoantigen-prediction.

tools856

bio-cfdna-preprocessing

Decides how to preprocess plasma cfDNA sequencing data so the recoverable signal survives - library-prep-aware fragment expectations (dsDNA vs ssDNA/adaptase prep), UMI/duplex consensus with fgbio (ExtractUmisFromBam, GroupReadsByUmi --strategy paired for duplex, CallMolecularConsensusReads vs CallDuplexConsensusReads, FilterConsensusReads min-reads "total s1 s2"), the align->group->consensus->RE-align ordering, and the cfDNA dedup trap where naive coordinate dedup collapses nucleosome-coincident independent molecules. Covers when single-strand consensus suffices vs when duplex is mandatory, the singleton/sensitivity tax at low input, and reading the insert-size histogram as a pre-analytical QC instrument. Use when processing plasma cfDNA reads before fragmentomics, ctDNA mutation calling, or tumor-fraction estimation.

development856

bio-immunoinformatics-mhc-binding-prediction

Predict peptide-MHC class I binding and natural presentation with MHCflurry, NetMHCpan-4.1, and MixMHCpred to nominate candidate CD8 T-cell epitopes. Covers the binding-affinity (BA) vs eluted-ligand (EL/presentation) distinction, why %Rank beats raw nM for cross-allele work, the MS abundance bias that misranks low-expression neoantigens, allele-coverage inequity, and length bias. Use when scanning a protein or peptide set for class I epitopes, scoring neoantigen candidates, or choosing a binding predictor. For CD4/HLA class II see mhc-class-ii-prediction.

development856

bio-immunoinformatics-neoantigen-prediction

Identify tumor neoantigens from somatic variants with pVACtools (pVACseq/pVACfuse/pVACbind/pVACvector/pVACview) for personalized cancer vaccines and checkpoint biomarkers. Encodes the field's hard truth that binding prediction is the easy, near-solved part and single-digit-percent PPV lives downstream — so it centers clonality/CCF, HLA LOH (the silent invalidator), expression, proximal-variant phasing, agretopicity/foreignness quality, and the predicted->presented->immunogenic validation tiers. Use when nominating vaccine targets, ranking neoantigens, or building a tumor-to-candidate pipeline. Binding details in mhc-binding-prediction; ranking in immunogenicity-scoring.

tools856

bio-immunoinformatics-epitope-prediction

Predict B-cell and T-cell epitopes for vaccine antigen design and epitope mapping with BepiPred-3.0, DiscoTope-3.0, the IEDB tools, and EL-mode MHC presentation. Encodes the load-bearing asymmetry that T-cell epitope prediction is mature (it reduces to MHC presentation, AUC>0.9) while B-cell prediction is unreliable (linear predictors ~AUC 0.6 because ~90% of real epitopes are conformational) — so structure-based DiscoTope-3.0 on AlphaFold models is the only defensible B-cell path, propensity scales are obsolete, and NetChop is largely redundant on EL-trained models. Use when mapping epitopes or selecting vaccine antigens. MHC binding lives in mhc-binding-prediction.

tools856

bio-longitudinal-monitoring

Tracks ctDNA across serial liquid-biopsy timepoints for molecular residual disease (MRD) and treatment-response monitoring, treating MRD as a binary integrated detection call across the patient's full variant set (with a defined LoD95 and per-sample specificity) rather than a per-timepoint VAF threshold, and handling undetectable samples as left-censored at the per-sample limit of detection rather than true zeros. Covers tumor-informed bespoke vs tumor-naive design, landmark vs surveillance sampling, molecular-response definitions and their non-standardization, censoring-aware clearance kinetics, and the multiple-testing structure of repeated surveillance. Use when monitoring ctDNA during therapy, calling molecular relapse before imaging, or estimating clearance half-life from serial samples.

testing856

bio-analytical-validation

Treats a ctDNA assay as a molecule-counting experiment at the Poisson edge and builds its analytical-validation case the measurement-science way. Covers the genome-equivalent currency (~330 haploid copies/ng), the lambda = input_GE x VAF sampling ceiling (lambda>=3 for ~95% detection), the error-suppression ladder (raw NGS ~1e-3 -> single-strand UMI ~1e-4/1e-5 -> duplex <1e-7), the CLSI EP17 LoB/LoD/LoD95/LoQ framework, the per-locus-vs-panel-integrated LoD distinction that lets bespoke MRD reach ppm, contrived/SEQC2 reference standards, and honest LoD reporting conditioned on input mass + consensus depth + replicate detection rate. Use when stating or trusting a sensitivity claim, designing a dilution-series validation, deciding how many genome equivalents are needed at a target VAF, choosing a single-locus vs panel-integrated LoD, or auditing a "detects 0.1% VAF" claim.

development856

bio-immunoinformatics-mhc-class-ii-prediction

Predict peptide-MHC class II (HLA-DR/DQ/DP) binding and presentation for CD4 T-cell epitopes with NetMHCIIpan-4.3 and MixMHC2pred-2.0. Covers why class II is far less reliable than class I (open binding groove, 9-mer register ambiguity, sparse noisy training data, DR>DP>DQ accuracy asymmetry), the DQ/DP heterodimer alpha/beta pairing trap, and the looser 1%/5% %Rank thresholds. Use when predicting CD4 epitopes for vaccine help, mapping class II neoantigens, or scoring long peptides against DR/DQ/DP. For CD8/class I see mhc-binding-prediction.

development856

bio-ctdna-mutation-detection

Detects somatic mutations in circulating tumor DNA, treating low-VAF detection as a signal-versus-noise problem set by error suppression and molecules sampled, not by the choice of caller. Distinguishes de novo CALLING (scanning a panel for unknown variants, bounded by per-locus error and multiple testing) from tumor-informed DETECTION (tracking a pre-specified variant set, where panel integration reaches single-ppm). Covers VarDict and Mutect2 for de novo calling, UMI-aware callers, and a pysam-based known-variant VAF tracker, with matched-WBC subtraction as the mandatory defense against clonal hematopoiesis (the dominant false positive). Use when calling or tracking tumor mutations from plasma cfDNA, setting a VAF threshold, or deciding whether a low-VAF call is tumor versus CHIP.

testing856

liquid-biopsy/tumor-fraction-estimation

--- name: bio-tumor-fraction-estimation description: Estimates tumor fraction (the genome-wide proportion of cfDNA molecules that are tumor-derived, the cfDNA analogue of bulk-tumor purity) from shallow whole-genome sequencing with ichorCNA, an HMM over 1 Mb bins that jointly EM-estimates tumor fraction, ploidy, and subclonal prevalence over a normal/ploidy grid. Encodes the load-bearing reframes: tumor fraction is the quantity that travels across assays and is NOT mutation VAF (clonal-het VAF a

development856

bio-immunoinformatics-tcr-epitope-binding

Infer or annotate TCR antigen specificity by unsupervised clustering (TCRdist/tcrdist3, GLIPH2, clusTCR, GIANA) and database lookup (VDJdb, IEDB, McPAS-TCR), and rank candidates with supervised predictors (ERGO-II, NetTCR-2.x, pMTnet) under explicit caveats. Encodes the central truth that general TCR-epitope prediction for UNSEEN epitopes essentially does not work (collapses to near-random; IMMREP22, Grazioli 2022) because labeled data is dominated by a few immunodominant epitopes and there is no true negative set — so clustering for discovery is the honest task and de-novo binding needs wet-lab validation. Use when annotating TCR specificity or grouping a repertoire. Epitope/MHC context lives in mhc-binding-prediction.

tools856

bio-genome-intervals-proximity-operations

Performs proximity operations on genomic intervals with bedtools (closest, window, flank, slop) and pybedtools - nearest-feature queries with signed/strand-aware distance, fixed-radius window searches, strand-aware promoter construction, and interval extension. Covers the closest -d/-D a/b/ref/-t/-k/-io/-iu/-id flags, the -D ref strand sign-flip, silent chromosome-end clipping in slop/flank, -t all tie double-counting, and the critical distinction between a geometry answer (nearest TSS) and a biology answer (which gene an element regulates). Use when assigning peaks or variants to genes, defining promoters from a gene model, building distance-to-TSS distributions, finding features within a window, or extending intervals - and when deciding whether nearest-gene is a fair prior (GWAS locus) or a trap (distal enhancer).

tools840

bio-genome-intervals-bigwig-tracks

Reads, queries, and writes bigWig indexed binary signal tracks (coverage, fold-change, conservation, methylation-rate) with pyBigWig (Python) and the UCSC Kent tools (bedGraphToBigWig, bigWigToBedGraph, bigWigInfo, bigWigSummary, bigWigAverageOverBed) and deepTools (multiBigwigSummary, computeMatrix, bigwigCompare). Covers the central trap that a wide query returns a precomputed zoom-level summary (by default the mean, which annihilates narrow peaks) not per-base data, when exact=True/values() is mandatory, the NaN-not-zero gap-handling fork, choosing mean vs max vs sum vs coverage by biological question, and the sorted-bedGraph plus chrom.sizes build requirement. Use when extracting signal at regions, computing mean signal per gene/peak, building a browser track from bedGraph, comparing tracks, or building TSS/gene-body metaprofiles.

tools840

bio-genome-intervals-coverage-analysis

Computes and interprets sequencing read depth and coverage over a genome, windows, or target regions with mosdepth (windowed depth, cumulative distribution, --quantize callable BEDs), bedtools genomecov/coverage (bedGraph tracks, per-target stats), samtools depth/coverage (per-base depth, per-contig depth+breadth). Covers the breadth-vs-mean distinction, the cumulative-coverage curve, evenness (CV/Fano/fold-80/Gini), what each tool silently counts (duplicates, secondary/supplementary, MAPQ, read span vs fragment, mate-overlap), the samtools-depth 8000-cap version trap, and the bedtools coverage -a/-b orientation flip. Use when assessing sequencing adequacy, building coverage tracks, computing breadth at a depth threshold, defining callable regions, or QCing target-capture uniformity.

tools840

bio-genome-engineering-off-target-prediction

Nominates and assesses CRISPR off-target sites genome-wide. Enumerates candidate sites by mismatch and bulge tolerance with Cas-OFFinder/CRISPRitz, ranks them with the published CFD score (SpCas9-only, relative ranker) or MIT/CRISTA/energy models, runs variant-aware screening against gnomAD/individual genomes (CRISPRme), and frames the empirical genome-wide discovery assays (GUIDE-seq, CIRCLE-seq, CHANGE-seq, DISCOVER-seq, Digenome-seq) and high-fidelity nuclease choice (HiFi Cas9, Sniper-Cas9, eSpCas9, SpCas9-HF1). Use when assessing guide RNA specificity, choosing among candidate guides, screening a therapeutic guide against population variation, or planning empirical off-target validation. Distinguishes predicted vs detected vs validated. On-target activity scoring and deaminase (Cas-independent) base/prime-editor off-targets are separate skills.

testing840

bio-genome-intervals-bed-file-basics

Handles BED-format genomic intervals (BED3 through BED12, narrowPeak/broadPeak) and the coordinate-system substrate the whole interval category rests on, with bedtools (CLI) and pybedtools/pyranges/pandas (Python). Covers the 0-based half-open vs 1-based-closed convention boundary and the start-1/end-unchanged conversion, the silent failures (chrom-name mismatch, CRLF, lexicographic-vs-version sort under -sorted), genome/chrom.sizes generation, sorting contracts, BED12 block invariants, validation, makewindows, cross-assembly liftover (liftOver/CrossMap), and BED<->VCF/BAM/FASTA conversion. Use when reading, creating, validating, sorting, lifting between genome builds, or converting interval files, preparing inputs for bedtools/tabix/bigBed, or debugging an off-by-one or empty-overlap result.

tools840

bio-hi-c-analysis-hic-visualization

Renders Hi-C contact matrices honestly and reproducibly with matplotlib, cooltools, HiCExplorer, pyGenomeTracks, FAN-C, CoolBox, and plotgardener. Covers the raw/ICE-balanced/observed-over-expected transform choice, LogNorm vs symmetric-diverging colormaps with vmax/percentile clipping, resolution-to-feature matching (compartments 100-500kb, TADs 10-40kb, loops 5-10kb), square vs rotated-triangle track-stacking, NaN/white-stripe handling, virtual 4C, APA/saddle/on-diagonal pileups, two-condition side-by-side and log2-ratio maps, and interactive (HiGlass) vs scripted-static publication figures. Use when plotting a contact matrix, choosing a normalization or color scale, building a multi-track Hi-C figure, making a virtual 4C profile, piling up loops/boundaries, or comparing two conditions.

tools840

bio-genome-intervals-bedgraph-handling

Generates, normalizes, and converts bedGraph signal tracks (4-column chrom/start/end/value, 0-based half-open) with bedtools genomecov, deepTools bamCoverage/bamCompare/bigwigCompare, bedtools unionbedg, and UCSC bedGraphToBigWig. Covers why a raw coverage bedGraph is not comparable across samples until normalized, the CPM/RPKM/BPM/RPGC normalization menu and the conserved-total assumption that makes them wrong under a global perturbation, the strict sorted-non-overlapping-chrom.sizes bedGraphToBigWig contract that silently corrupts a bigWig, effective-genome-size selection, and bin-size aliasing. Use when building or normalizing a coverage/signal track from a BAM, comparing tracks across samples or conditions, converting bedGraph to a browser-ready bigWig, or diagnosing a track that looks plausible but reports wrong heights.

tools840

bio-hi-c-analysis-contact-pairs

Turns Hi-C/Micro-C FASTQ into a deduplicated, filtered .pairs file with pairtools and decides whether the library worked. Covers the bwa mem -SP5M / bwa-mem2 / chromap --preset hic alignment idiom (mates mapped as independent single-end reads), pairtools parse vs parse2 and the walks-policy choice (5unique pairwise vs all for Pore-C/Micro-C concatemers), pair-type classification (keep UU and rescued UC), dedup (PCR vs optical/by-tile), select by pair_type/MAPQ/distance, restriction-fragment handling (restrict, Arima dual-enzyme, Micro-C/DNase fragment-free), and allele-specific phasing (pairtools phase to two coolers). The library-QC decision uses % long-range cis as the one-number quality metric, trans as the noise floor, orientation balance as fragment-map-free dangling-end/self-circle QC, and % duplicates as a complexity proxy. Use when processing Hi-C/Micro-C/Omni-C reads into pairs, judging library quality, handling multi-enzyme or restriction-agnostic protocols, or generating allele-specific contacts.

tools840

bio-genome-intervals-interval-arithmetic

Performs set operations on genomic intervals - intersect (-wa/-wb/-wo/-wao/-loj/-c/-v/-u), subtract (-A), merge (-d, -c/-o), complement, cluster, multiinter, unionbedg, map, and groupby - with bedtools (CLI) and pybedtools/pyranges/bioframe (Python). Covers the sorted-input contract and the -sorted chromosome-order footgun, reciprocal/fractional overlap (-f/-F/-r/-e) and the A-vs-B asymmetry, -split for spliced/BED12/BAM features, and jaccard/fisher as mechanics only. Use when finding overlapping or unique regions between BED/peak/feature files, building consensus peaksets, removing blacklisted regions, transferring annotation values onto intervals, or computing interval-set similarity; route overlap-significance testing to overlap-significance.

tools840

bio-genome-intervals-gtf-gff-handling

Parses, queries, converts, and extracts from GTF and GFF3 gene-model annotation files - walking the gene/transcript/exon/CDS hierarchy with gffutils (queryable SQLite DB), converting formats and extracting transcript/CDS/protein FASTA with gffread, slurping to dataframes with gtfparse/pyranges, and sanitizing malformed files with AGAT. Covers the 1-based-inclusive vs 0-based BED coordinate conversion (start-1 only), deriving implicit features (introns/UTRs/TSS), phase-not-frame, the stop-codon-in-or-out-of-CDS convention, and the chr1-vs-1 seqid and gene-ID-version mismatches that silently produce all-zero count matrices and dropped joins. Use when extracting features or sequences from an annotation, converting GTF<->GFF3 or GTF->BED, traversing the gene tree, or diagnosing a coordinate/provenance mismatch upstream of counting or DE.

development840

bio-genome-engineering-prime-editing-design

Designs pegRNAs and nicking guides for prime editing (PE) -- choosing the nick/strand, tuning the primer-binding site (PBS) and reverse-transcription template (RTT) as a per-locus panel, selecting the PE system (PE2/PE3/PE3b/PE4/PE5/PEmax/PE7), adding MMR-evading and PAM-disrupting silent edits, appending epegRNA 3' motifs (tevopreQ1/mpknot), and ranking with PRIDICT/DeepPrime. Covers twinPE/PASTE for large insertions and the prime-vs-base-editing decision. Use when designing a scarless point mutation, small insertion/deletion, or any of the 12 base conversions without a double-strand break, when efficiency is low and MMR inhibition or pegRNA stabilization is needed, or when routing a large insertion to an integrase method. Generic guide scoring and base editing are separate skills.

testing840

bio-hi-c-analysis-matrix-operations

Balances Hi-C contact matrices (ICE via cooler.balance_cooler, KR/SCALE/VC context), computes distance-decay expected with cooltools (expected_cis per-diagonal P(s), expected_trans scalar), builds observed/expected (O/E) matrices, and diagnoses polymer state from the P(s) log-derivative. Covers the within-matrix-vs-cross-sample distinction (balancing is NOT a normalizer), the equal-visibility assumption that CNV/aneuploidy violates (use raw counts for copy-number), cis-only balancing, mad_max/blacklist masking before balancing, multiplicative cooler weights vs divisive juicer weights, and the resolution-vs-depth budget. Use when balancing a .cool/.mcool, computing expected or P(s), making O/E matrices for compartments/loops, deciding ICE vs KR vs SCALE, choosing a resolution for a given depth, or troubleshooting NaN/all-NaN balanced matrices; route cross-sample comparison to hic-differential.

tools840

bio-hi-c-analysis-hic-differential

Compares Hi-C contact maps between conditions across the right scale -- differential bin-pair contacts (multiHiCcompare, diffHic), differential A/B compartments (dcHiC), differential TAD boundaries (delta insulation), and differential loops (diffloop, DiffHiChIP) -- with distance-stratified between-sample normalization, replicate-aware NB-GLM FDR, HiCRep SCC reproducibility gating, and CNV correction for cancer/aneuploid samples. Use when comparing Hi-C between treatment and control, finding differential contacts/compartments/boundaries/loops, normalizing two maps of unequal depth, choosing a replicate-aware test, gating replicates with SCC, or correcting copy-number artifacts before a tumor-vs-normal comparison.

testing840

bio-hi-c-analysis-compartment-analysis

Detects A/B chromatin compartments from balanced Hi-C contact matrices via eigenvector decomposition of the distance-normalized, Pearson-correlated cis matrix with cooltools (eigs_cis), then orients (phases) the compartment eigenvector against a GC or gene-density track so the active (A) sign is not arbitrary. Covers the eigenvector-is-a-choice problem (per-arm view_df to remove the centromere gradient; picking the eigenvector by max correlation with activity, not by eigenvalue), GC phasing with bioframe.frac_gc, resolution choice (100kb-1Mb), saddle plots and saddle_strength for compartmentalization strength, the cohesin-loss-strengthens-compartments result, subcompartments (SNIPER/Calder/dcHiC), and cross-condition compartment switching. Use when calling A/B compartments, computing E1/eigenvectors, phasing the eigenvector, building saddle plots, choosing a compartment resolution, quantifying compartment strength, or comparing compartmentalization across conditions.

tools840

bio-hi-c-analysis-hichip-plac-loops

Calls significant loops from protein-directed and targeted 3C assays (HiChIP, PLAC-seq, Capture Hi-C/PCHi-C, ChIA-PET) where the contact background is peak-anchored and coverage-biased, so generic Hi-C loop callers (cooltools dots, Juicer HiCCUPS) use the wrong null. Covers FitHiChIP (config-driven coverage+distance-decay spline regression, peak-to-peak vs peak-to-all foreground, loose vs stringent background, coverage vs ICE bias), MAPS (positive Poisson regression on bias factors for PLAC-seq/HiChIP), hichipper (restriction-site-distance bias model + library QC), CHiCAGO (Delaporte two-component Brownian+technical background for asymmetric bait x other-end Capture Hi-C), the with/without separate-ChIP anchor decision, and differential loops via diffloop. Use when calling loops from HiChIP/PLAC-seq/Capture Hi-C, choosing FitHiChIP/MAPS/CHiCAGO, picking peak-to-all vs peak-to-peak, setting the loop FDR, supplying ChIP peaks as anchors, QCing a HiChIP library, or comparing loops between conditions.

tools840

bio-differential-expression-batch-correction

Handles batch effects in bulk RNA-seq via design-matrix inclusion (the correct path for DE), ComBat/ComBat-seq for visualization, SVA for unknown latent factors, RUVSeq for negative-control-gene-anchored unwanted variation, and limma::removeBatchEffect for plotting only. Encodes the Nygaard 2016 cardinal sin against testing on a batch-corrected matrix, the choice between SVA/RUVg/RUVs/RUVr, the confounding non-identifiability problem, the single-cell boundary (Harmony/MNN are NOT for bulk), and the Goh 2017 harmonization critique. Use when designing a DE analysis with batch structure, troubleshooting batch-dominated PCA, choosing ComBat vs ComBat-seq, handling unknown batch via SVA, integrating across studies, or deciding when (rarely) to subtract batch.

development817

bio-uniprot-access

Query UniProt's REST API (post-2022 endpoint at rest.uniprot.org) for protein sequences, annotations, GO terms, cross-references, ID mappings, and proteomes. Use when fetching UniProtKB entries, navigating the JSON schema, choosing between UniProtKB/UniRef/UniParc/Proteomes resources, deciding stream vs search endpoint for batch retrieval, running ID-mapping jobs with the async pattern, handling isoform suffixes, or filtering reviewed Swiss-Prot vs auto-annotated TrEMBL. Encodes the legacy URL migration (2022), the new JSON schema layout, and bulk-pull patterns.

development817

bio-experimental-design-sample-size

Estimates the minimum biological replicates (or cells/events) for a target power at a target FDR in genomics experiments using ssizeRNA, PROPER, powsimR for scRNA-seq, and pilot-data dispersion estimation from DESeq2/edgeR. Covers the biological-versus-technical replication distinction (technical replicates do not add degrees of freedom for biological inference), replicate-number-versus-sequencing-depth budgeting, scRNA-seq sample-versus-cell allocation under a pseudobulk model, and the critique that "n=3" is a publication convention rather than a power calculation. Use when budgeting a sequencing experiment, writing the sample-size justification in a grant, estimating replicates from pilot data, allocating a fixed budget between samples and depth, or planning scRNA-seq cohort size. For clinical-trial sample size see clinical-biostatistics/power-and-sample-size; for the power-given-n direction see experimental-design/power-analysis.

tools817

bio-local-blast

Build local BLAST databases and run searches using NCBI BLAST+ command-line tools. Use when running >50 queries, building custom databases with -parse_seqids and -taxid, downloading prebuilt NCBI databases via update_blastdb.pl, choosing -task variants (megablast/dc-megablast/blastn/blastn-short), tuning soft/hard masking, scaling threads, or extracting hits with blastdbcmd. Encodes BLAST v5 vs v4 database format, taxonomy filtering, makeblastdb pitfalls.

tools817

bio-flow-cytometry-clustering-phenotyping

Unsupervised clustering and cell-type identification for high-dimensional flow, spectral, and mass cytometry - FlowSOM, PhenoGraph, FlowSOM-via-CATALYST, with UMAP/tSNE for visualization. Covers the type-vs-state marker distinction (cluster on lineage, test state within clusters), over-provision-then-metacluster, the Weber-Robinson benchmark, seed dependence and metacluster stability, why embeddings are for looking not measuring, and median-heatmap annotation/merging. Use when discovering populations without predefined gates, choosing a clustering algorithm, selecting the number of metaclusters, or annotating clusters into cell types.

development817

bio-flow-cytometry-differential-analysis

Differential abundance (DA) and differential state (DS) analysis for flow and mass cytometry - tests which cell populations change in frequency or marker expression between conditions using diffcyt (edgeR/voom/GLMM for DA, limma/LMM for DS), with cydar, CITRUS, and compositional methods (sccomp, scCODA, DCATS) as alternatives. Covers the sample-is-the-experimental-unit principle, design/contrast and mixed-model formulas, compositionality of cluster proportions, and FDR across clusters. Use when comparing populations between groups, choosing a DA method, handling paired/batch designs, or deciding whether compositional correction is needed.

testing817

bio-flow-cytometry-gating-analysis

Defines cell populations in flow and spectral cytometry through manual gates (rectangle, polygon, quadrant, boolean) and reproducible automated gating (openCyto gating templates, flowDensity data-driven thresholds, flowClust model-based gates), organized as a hierarchical GatingSet (flowWorkspace) and round-tripped with FlowJo via CytoML. Covers the canonical gate order (time -> debris -> singlets -> live -> lineage), FMO-vs-isotype boundary setting, gate-order dependence and recompute semantics, rare-event/MRD gating, and per-population statistics. Use when building a gating strategy, automating a manual FlowJo scheme across samples, choosing manual vs data-driven gates, or extracting population frequencies.

development817

bio-flow-cytometry-fcs-handling

Reads, inspects, and writes Flow Cytometry Standard (FCS) files from conventional, spectral, and mass cytometry (CyTOF), and parses FlowJo/Cytobank/Diva workspaces. Covers FCS 2.0/3.0/3.1/3.2 internals ($PnE linear-vs-log, $DATATYPE, $SPILLOVER vs SPILL vs $COMP, $TIMESTEP), channel/parameter metadata, the silent linearize/truncate defaults, and R (flowCore, flowWorkspace, CytoML) plus Python (FlowKit, readfcs) readers. Use when loading flow or mass cytometry data, mapping detector channels to antibodies, extracting the event matrix, choosing a reader, or bridging FCS to the scanpy/AnnData ecosystem before preprocessing.

development817

bio-gene-regulatory-networks-multiomics-grn

Build enhancer-driven gene regulatory networks (eGRNs) by integrating single-cell RNA-seq and ATAC-seq using SCENIC+, CellOracle base GRNs, Pando, FigR, DIRECT-NET, TRIPOD, and scMEGA. Covers the accessibility-defines-enhancers principle, peak-to-gene linking and its cell-composition confound, the paired-vs-unpaired decision, and TF-region-gene eRegulon triplets. Use when analyzing 10x multiome or paired/unpaired scRNA+scATAC to infer cis-regulatory GRNs. For RNA-only regulons see scenic-regulons; for in silico TF perturbation see perturbation-simulation.

development817

bio-genome-annotation-functional-annotation

Assigns GO terms, Pfam/InterPro domains, KEGG orthologs, EC numbers, and product names to predicted proteins using eggNOG-mapper (orthology), InterProScan (domain signatures), and KofamScan (KEGG), routing specialized functions to dbCAN/antiSMASH/AMRFinderPlus/SignalP. Covers the orthology-vs-domain-vs-homology paradigms, the annotation-error percolation cascade, domain-presence-is-not-function, GO IEA circularity in enrichment, evidence tiering, and bit-score/coverage thresholds. Use when adding functional annotation to predicted genes, choosing between eggNOG-mapper and InterProScan, or judging how much to trust a functional label.

development817

bio-genome-annotation-ncrna-annotation

Identifies non-coding RNAs (tRNA, rRNA, snoRNA, snRNA, riboswitches, sRNAs) using Infernal covariance-model search against Rfam, tRNAscan-SE 2.0 for tRNA, barrnap for rRNA, and ARAGORN for tmRNA, plus the small-RNA-seq boundary for miRNA and the transcript-assembly boundary for lncRNA. Covers the structure-conserved-not-sequence-conserved principle (why BLAST fails), GA-threshold and clan-competition correctness, tRNAscan-SE domain modes and pseudogene flags, rDNA copy-number collapse, and why homology annotation is a recall floor. Use when performing genome-wide ncRNA annotation, choosing the right tool for an RNA class, or interpreting ncRNA counts.

tools817

bio-genome-assembly-assembly-qc

Evaluates genome assembly quality across the three orthogonal axes - contiguity (QUAST auN/NG50/NGx, not bare N50), completeness (BUSCO/compleasm gene-space plus Merqury k-mer completeness), and correctness (reference-free Merqury QV, Inspector/CRAQ structural errors, asmgene false-duplication/collapse). Covers why N50 is the most-gamed metric, why QV measured on the polishing reads is circular, distinguishing uncollapsed haplotigs from real WGD, and the EBP/VGP 6.C.Q40 standard. Use when judging whether an assembly is good enough to annotate or publish, comparing assemblers, diagnosing a fragmented or duplicated assembly, or assessing a phased diploid assembly.

development817

bio-genome-assembly-metagenome-assembly

Assembles microbial-community sequencing into metagenome-assembled genomes (MAGs) with metaFlye (ONT), metaSPAdes/MEGAHIT (Illumina), and hifiasm-meta/metaMDBG (PacBio HiFi), then recovers genomes via multi-binner consolidation (MetaBAT2, MaxBin2, CONCOCT, SemiBin2, VAMB -> DAS_Tool) and QCs them against MIMAG with CheckM2, GUNC, and GTDB-Tk. Covers why a metagenome is not a genome (uneven coverage, micro-diversity, strain collapse to consensus), differential-coverage binning, co-assembly vs per-sample, the rRNA-operon collapse that fails short-read MAGs, and strain resolution with inStrain. Use when reconstructing genomes from a microbiome, soil, ocean, or gut community, recovering MAGs, or resolving strain-level variation.

tools817

bio-genome-assembly-short-read-assembly

Assembles a genome de novo from Illumina short reads with SPAdes (isolate/careful/sc/meta/plasmid/rna modes), MEGAHIT (low-memory, huge datasets), Unicycler (bacterial finishing/hybrid), MaSuRCA (large hybrid), ABySS (Bloom-filter), and Platanus (heterozygous diploids), using multi-k de Bruijn graphs. Covers the repeat-resolution limit, why N50 plateaus at the genome not the depth, GenomeScope2 k-mer profiling first, the heterozygosity/haplotig trap, error-correction erasing rare alleles, GC dropout, and NG50/auN/BUSCO reporting. Use when assembling a bacterial isolate, fungal, small-eukaryotic, single-cell, or metagenome genome from Illumina reads, or when deciding whether short reads can even produce the assembly being asked for.

development817

bio-crispr-screens-in-vivo-screens

Designs and analyzes in vivo CRISPR screens in animal tumor models, organoids, and immune-cell adoptive transfers. Covers bottleneck math (250x cells/sgRNA requires ~25M cells implanted; impossible for most syngeneic models, forcing focused libraries), focused library design (Manguso 2017 Nature 547:413 immune screen; Chen 2015 tumor screens), CRISPR-StAR intrinsic-control screening (Uijttewaal 2025 Nat Biotechnol 43:1848), clonal-dynamics-limited detection, tumor-explant DNA recovery, syngeneic vs xenograft vs PDX considerations, and the relationship to downstream MAGeCK / drugZ analysis. Use when designing in vivo CRISPR screens for tumor / immune / metastasis biology, choosing focused vs genome-wide for animal models, addressing bottleneck-induced clonal collapse, picking the syngeneic / xenograft / PDX model, integrating in vivo with in vitro results, or applying CRISPR-StAR for animal experiments.

development817

bio-crispr-screens-screen-qc

Quality control for pooled CRISPR screens covering library representation, Gini index, log-skew, replicate Pearson and Spearman concordance, essentialome precision-recall AUC against CEGv2 (Hart 2017), Cas9 cut-toxicity diagnostics, copy-number amplicon detection (Aguirre 2016 / Munoz 2016), bottleneck propagation through plasmid pool, infection, selection, and endpoint stages, MOI verification, and DepMap-style screen-quality scoring. Use when assessing screen quality before hit calling, deciding whether to repeat or rescue a screen, diagnosing low-confidence hits, choosing between MAGeCK / BAGEL2 / Chronos based on quality grade, picking a normalization strategy from QC signatures, or evaluating whether an in-vivo screen retained adequate library complexity.

development817

bio-crispr-screens-bagel-essentiality

Identifies essential genes from CRISPR-Cas9 fitness screens using BAGEL2 (Kim & Hart 2021 Genome Med), a Bayesian classifier scoring per-gene Bayes Factors via log-likelihood ratios over per-sgRNA fold changes, calibrated against CEGv2 core-essentials (Hart 2017 G3, ~684 genes) and NEGv1 non-essentials (Hart 2014, ~927 genes). Covers the fc + bf + pr workflow, the linear-extrapolation improvement over BAGEL1 truncation, multi-target off-target correction, tumor-suppressor sensitivity (BAGEL2 detects enrichment), and BF-to-FDR calibration (BF >6 ≈ FDR 0.05 from Hart 2017). Use when classifying essential vs non-essential genes, calibrating BAGEL2 thresholds against PR curves, identifying tumor suppressors alongside essentials, comparing BAGEL2 hits to MAGeCK / drugZ, or generating publication-quality essentiality calls.

testing817

bio-data-visualization-volcano-and-ma-plots

Build volcano and MA plots from differential-expression / association results with LFC shrinkage, FDR-adjusted thresholds, sensible label placement, and axis-truncation conventions. Covers EnhancedVolcano, ggplot2, matplotlib, and the apeglm/ashr/normal shrinkage decision. Use when visualizing differential-expression results (RNA-seq, ChIP-seq, ATAC-seq, proteomics) or any per-feature effect-size + p-value table.

development817

bio-differential-expression-de-results

Extracts, filters, annotates, and exports differential expression results from DESeq2 or edgeR with proper handling of padj=NA (independent filtering, Cook's outliers, all-zero), multiple-testing correction choice (BH vs Storey q-value vs IHW vs lfsr), TREAT vs post-hoc fold-change filtering, p-value histogram diagnostics, gene annotation via org.db/biomaRt/mygene, GSEA preranked input, ORA background construction, replication reality (Schurch 2016 small-n result), and SABV/sex-stratified reporting. Use when extracting and interpreting DE results, troubleshooting padj=NA, choosing FDR method, preparing ranked lists for pathway analysis, annotating gene IDs, or comparing DESeq2 vs edgeR outputs.

testing817

bio-experimental-design-power-analysis

Calculates statistical power for high-dimensional genomics experiments (bulk RNA-seq, scRNA-seq, ATAC-seq, ChIP-seq, methylation, proteomics) under negative-binomial count models using RNASeqPower, PROPER, and simulation via powsimR, distinguishing per-gene from marginal (transcriptome-wide) power, the role of mean expression and dispersion, and the sequencing-depth-versus-replicate tradeoff. Covers simulation as the honest default for overdispersed counts, FDR-aware average power versus single-test power, observed/post-hoc power as an anti-pattern, and the winner's-curse / Type-S / Type-M consequences of underpowering. Use when planning replicate number for a sequencing experiment, deciding whether to add depth or samples, choosing closed-form versus simulation power, estimating power from pilot dispersions, or justifying replication in a grant. For clinical-trial power see clinical-biostatistics/power-and-sample-size; for the inverse sample-size question see experimental-design/sample-size.

tools817

bio-genome-annotation-annotation-qc

Assesses the quality and completeness of a genome annotation with BUSCO (conserved single-copy ortholog recovery), OMArk (proteome completeness, consistency, and contamination), CheckM2 (prokaryotic completeness/contamination), and a gene-set sanity panel (gene count, mono-exonic fraction, protein-length distribution, mRNA:gene ratio, coding density). Covers the assembly-BUSCO-vs-proteome-BUSCO diagnostic, what BUSCO-Duplicated really means, why gene count is a vanity metric, and the QC of transferred annotations. Use when judging whether an annotation is good enough to publish or submit, diagnosing a suspect annotation, or comparing annotation completeness across pipelines.

development817

bio-remote-homology

Detect distant homologs using profile and structure-aware methods that go beyond standard BLAST. Use when sequence identity falls into the twilight zone (<35% pairwise), when BLAST fails to find homologs that should exist, when working at metagenomic scale (DIAMOND, MMseqs2), or when structure beats sequence (Foldseek). Covers PSI-BLAST (iterative PSSM), jackhmmer (iterative HMM), HHblits/HHsearch (profile-profile), DIAMOND, MMseqs2, and Foldseek (3Di structural alphabet, van Kempen 2024).

development817

bio-ortholog-inference

Pull pre-computed ortholog calls from public databases (OrthoDB, Ensembl Compara, OMA browser, eggNOG, PANTHER, KEGG Orthology, HomoloGene) via their REST APIs. Use when orthologs are already curated upstream, when the question is "what is the X ortholog of Y" rather than "how to infer orthology de novo", when batch-mapping gene IDs across species, or when comparing the resources for consensus calls. Encodes confidence-level semantics, 1:1 vs 1:many vs many:many, HomoloGene deprecation, and when to defect to de novo computation.

development817

bio-data-visualization-flow-and-transition-plots

Build Sankey, alluvial, river, and CONSORT-style flow diagrams to visualize cohort transitions, cell-state changes, or pipeline filtering using ggalluvial, networkD3, plotly, and consort. Use when showing how entities move between categories across timepoints (cell states, drug response classes, patient flow through a trial) or filtering pipelines (variants filtered through QC stages).

development817

bio-data-visualization-interactive-visualization

Build interactive HTML/web visualizations with plotly (Python/R), bokeh (Python), and gganimate/plotly frames for animation, with awareness of current Kaleido static-export model (post-orca-EOL), HTML file-size bloat, and the limits of interactive-only output for journal submission. Use when producing zoomable/hoverable plots for notebook EDA, supplementary HTML, dashboards, or animated time-course / iteration visualizations.

development817

bio-data-visualization-forest-funnel-plots

Build forest plots (HR, OR, RR, beta-coefficient summaries with CIs) and funnel plots (meta-analysis publication-bias diagnostics) using forestplot, metafor, ggforest, and MendelianRandomization with proper axis-scaling, summary-diamond placement, subgroup nesting, and Egger / trim-and-fill asymmetry tests. Use when summarizing effects across subgroups, trials, or instruments — meta-analysis, Mendelian randomization, subgroup HRs.

development817

bio-differential-expression-deseq2-basics

Performs differential expression on bulk RNA-seq count data with DESeq2's negative-binomial GLM, Wald and LRT testing, apeglm/ashr/normal LFC shrinkage, independent filtering, Cook's outlier handling, VST/rlog transforms, and design formulas including paired, batch, and interaction terms. Use when running bulk DE, choosing DESeq2 over edgeR or limma-voom, building a paired or interaction design, applying LFC shrinkage for ranking or GSEA, choosing Wald vs LRT, troubleshooting padj=NA, picking VST vs rlog, importing salmon/kallisto via tximport, or analyzing prokaryotic RNA-seq.

development817

bio-copy-number-subclonal-copy-number

Resolve subclonal copy number, whole-genome doubling, and copy-number tumor evolution from bulk sequencing with Battenberg, TITAN, and MEDICC2. Covers clonal versus subclonal copy-number states, haplotype phasing for subclonal resolution, cancer cell fraction, whole-genome-doubling detection and timing relative to mutations, mirrored subclonal allelic imbalance, and copy-number phylogenies. Use when a tumor is heterogeneous and bulk data shows non-integer copy number, when calling subclonal CNAs, detecting or timing whole-genome doubling, reconstructing copy-number evolution, or deciding between Battenberg and TITAN.

testing817

bio-copy-number-hrd-scoring

Quantify homologous recombination deficiency (HRD) from tumor copy number using the three genomic-scar metrics — loss of heterozygosity (LOH), large-scale state transitions (LST), and telomeric allelic imbalance (TAI) — with scarHRD, and via the whole-genome HRDetect and CHORD models. Covers the genomic instability score, the PARP-inhibitor clinical context, whole-genome-doubling correction, and the scar-versus-state distinction. Use when computing an HRD score for PARP-inhibitor eligibility, deriving LOH/LST/TAI scars from allele-specific copy number, deciding between scar-based and mutational-signature HRD methods, or interpreting an HRD result in a BRCA-reverted or low-purity tumor.

tools817

bio-copy-number-cnvkit-analysis

Detect somatic and germline copy number variants from targeted, exome, and whole-genome sequencing with CNVkit, a read-depth caller that combines on-target and off-target (antitarget) coverage. Covers panel-of-normals construction, flat-reference tumor-only calling, hybrid/amplicon/WGS modes, CBS vs HMM segmentation selection, purity-aware integer calling, and reconciliation against GATK and allele-specific callers. Use when calling CNVs from hybrid-capture panels or exomes, deciding whether CNVkit (depth-only) is the right tool versus an allele-specific caller, building a panel of normals, diagnosing flat-reference false positives, or interpreting log2 ratios into copy-number states.

tools817

bio-copy-number-focal-amplification-ecdna

Resolve the architecture of focal oncogene amplifications — extrachromosomal DNA (ecDNA), breakage-fusion-bridge (BFB) cycles, homogeneously staining regions (HSR), and linear amplification — from whole-genome sequencing with AmpliconArchitect, the AmpliconSuite pipeline, and AmpliconClassifier. Covers copy-number seed selection, breakpoint-graph reconstruction, balanced-flow optimization, ecDNA classification, and the limits of depth-only amplification calls. Use when a focal amplification needs structural characterization, when distinguishing ecDNA from chromosomal amplification, suspecting ecDNA-driven oncogene amplification or therapy resistance, or selecting copy-number seeds for amplicon reconstruction.

testing817

bio-copy-number-copy-ratio-segmentation

Normalize read-depth copy-ratio profiles and segment them into copy-number regions using circular binary segmentation (CBS, DNAcopy), hidden Markov models, HaarSeg, and fused-lasso methods. Covers GC-content, mappability, and replication-timing (wave-artifact) bias correction, panel-of-normals/PCA denoising, diploid-baseline centering, and algorithm selection by sequencing depth and event size. Use when choosing a segmentation algorithm, correcting depth bias, diagnosing oversegmentation or a mis-centered baseline, tuning CBS or HMM parameters, or understanding why a downstream CNV caller produced fragmented or shifted segments.

development817

bio-copy-number-cnv-visualization

Visualize copy number profiles, segments, allele-specific tracks, and cohort patterns from CNVkit, GATK, ASCAT, FACETS, Sequenza, and other callers. Covers genome-wide and per-chromosome log2 scatter plots, B-allele-frequency/minor-allele-fraction tracks, ideograms, cohort heatmaps, circos views, and caller-native plots. Use when creating publication CNV figures, choosing which plot answers a given question, diagnosing a wrong diploid baseline visually, displaying loss of heterozygosity, or deciding what depth-only plots cannot reveal.

development817

bio-copy-number-germline-cnv-interpretation

Classify constitutional (germline) copy number variants for clinical reporting using the 2019 ACMG/ClinGen technical standards points-based framework, with ClassifyCNV and AnnotSV for semi-automated scoring. Covers the separate copy-number-loss and copy-number-gain rubrics, the five-tier classification, ClinGen haploinsufficiency/triplosensitivity and dosage-sensitive regions, de novo and segregation evidence, and population-frequency benign evidence. Use when assigning pathogenic/likely-pathogenic/VUS/likely-benign/benign to a constitutional CNV, scoring a CNV against ACMG/ClinGen criteria, or distinguishing the automatable evidence from the case-specific evidence requiring manual input.

tools817

bio-genome-assembly-assembly-polishing

Decides whether and how to polish a draft genome assembly to raise consensus accuracy (QV) with read-type-matched tools - Racon and medaka (ONT consensus), dorado polish, Polypolish and pypolca (Illumina, repeat-aware), Pilon (legacy short-read), NextPolish/NextPolish2, Hapo-G (haplotype-aware), ntEdit, and DeepPolisher/PEPPER-Margin-DeepVariant for human. Covers the do-not-polish-HiFi rule, the medaka basecaller-model footgun, held-out Merqury QV as the only honest stop signal, and the haplotype-collapse trap. Use when correcting homopolymer indels or residual SNPs in a long-read assembly, deciding if a HiFi assembly needs polishing, or choosing an ONT vs hybrid vs short-read polishing chain.

tools817

bio-data-visualization-color-palettes

Select colormaps and qualitative palettes for scientific figures using perceptual-uniformity, color-vision-deficiency safety, and luminance-monotonicity criteria. Covers Crameri scientific colormaps, viridis/cividis/magma, Okabe-Ito categorical, ColorBrewer, and the rainbow/jet critique. Use when choosing palettes for heatmaps, scatter, networks, or any encoding where color carries quantitative or categorical meaning.

development817

bio-genome-assembly-contamination-detection

Detects and removes contamination in genome assemblies via two disjoint workflows - foreign-sequence screening of a single-organism (eukaryote/isolate) assembly with NCBI FCS-GX (GenBank-submission-mandatory), FCS-adaptor, and BlobToolKit blob plots; and MAG/bin quality assessment with CheckM2 plus GUNC (chimerism) plus GTDB-Tk taxonomy, judged against MIMAG. Covers why CheckM2 alone is blind to disjoint-marker chimeras, the FCS-GX RAM wall, organelle/NUMT triage, strain heterogeneity, and the HGT-vs-contamination (tardigrade) trap. Use when screening an assembly for foreign contamination before GenBank submission, assessing MAG completeness/contamination/chimerism, deciding which contigs to remove, or distinguishing real HGT from contaminant contigs.

tools817

bio-experimental-design-randomization-blocking

Structures biological experiments so inference is valid by construction, covering Fisher's principles (randomization, replication, local control), the experimental-vs-observational unit distinction and pseudoreplication (Hurlbert 1984; Lazic 2018), randomization mechanics (complete, restricted, stratified, rerandomization, run-order), blocking layouts (randomized complete block, Latin square, incomplete block), factorial designs and interactions, and the split-plot/nested error strata hidden inside multi-batch genomics. Use when deciding the experimental unit and what counts as a replicate, planning randomization and run order, choosing a blocked/factorial/split-plot/nested layout, avoiding pseudoreplication in cell-culture or animal studies, or specifying the random-effects structure of the analysis model. For assigning samples to sequencing batches/lanes/plates and batch-effect correction see experimental-design/batch-design; for regulated clinical-trial randomization see clinical-biostatistics.

tools817

bio-crispr-screens-combinatorial-screens

Designs and analyzes combinatorial CRISPR screens covering paired-Cas9 (Big Papi, Najm 2018), enhanced AsCas12a multiplex (enCas12a, DeWeirdt 2021), in4mer 4-guide-array Cas12a (Esmaeili Anvar N et al 2024 Nat Commun 15:3577) and the Inzolia paralog-pair library, paralog-buffering detection (Dede 2020 Genome Biol; Thompson 2021 Cell Reports 36:109597), genetic-interaction (GI) scoring as observed_double_LFC minus expected_additive_double_LFC, synthetic-lethal and synthetic-rescue interaction interpretation, the half-of-essentiality buffered by paralogs phenomenon, multiplex screen statistical analysis with MAGeCK MLE interaction terms, and the relationship to single-cell combinatorial Perturb-seq. Use when designing a paralog or pathway-pair screen, choosing between paired-Cas9 (Big Papi) and Cas12a multiplex (Inzolia), interpreting genetic interaction scores, identifying synthetic-lethal targets for drug development, or scaling beyond single-gene CRISPR screens.

development817

bio-crispr-screens-base-editing-analysis

Analyzes base-editing screens for variant function. Covers library design (Sanson 2020 GRACE, Hanna 2021 BRCA1/2 SNV scanning, Cuella-Martin 2021), CBE vs ABE chemistry choice (BE3/BE4 vs ABE7.10/ABE8.20/ABE8e), editing-window math (positions 4-8 from PAM-distal end, wider for ABE8e), bystander-edit quantification and the variant-call ambiguity it creates, sgRNA-efficiency filtering before hit calling, indel byproduct interpretation, the substitution-vs-indel diagnostic, variant annotation against ClinVar / COSMIC, and the Broad be-validation-pipeline. Use when designing a BE variant screen, choosing CBE vs ABE for a specific edit, interpreting bystander-confounded hits, distinguishing functional signal from indel artifact, integrating CRISPResso2 output with screen scoring, or deciding BE vs PE for SNV installation.

tools817

bio-copy-number-allele-specific-copy-number

Infer integer allele-specific copy number, tumor purity, and ploidy from tumor sequencing by jointly modeling read depth (logR) and B-allele frequency (BAF) with ASCAT, Sequenza, FACETS, PURPLE, and PureCN (tumor-only). Covers the purity-ploidy identifiability problem, the diploid-baseline (dipLogR) anchor, major/minor copy number, loss of heterozygosity, sunrise/contour fit diagnostics, and reconciliation of conflicting fits. Use when tumor analysis needs absolute copy number rather than relative log2, when estimating purity and ploidy, calling LOH or copy-neutral LOH, resolving whole-genome doubling, running tumor-only allele-specific calling, or choosing among ASCAT, Sequenza, FACETS, and PureCN.

development817

bio-genome-engineering-base-editing-design

Designs cytosine (CBE, C-to-T) and adenine (ABE, A-to-G) base-editor guides by positioning the target base at the activity-peak of the editing window (protospacer positions ~5-7, PAM-distal numbering), minimizing bystander edits for product purity, reading dinucleotide context (APOBEC1 TC favored / GC disfavored), and selecting the editor variant (BE4max, ABEmax, ABE8e, YE1/SECURE, TadCBE, CGBE, SpG/SpRY-BE). Covers knockout by premature stop (CRISPR-STOP/iSTOP) and splice-site disruption, the three off-target classes (Cas-dependent, Cas-independent DNA, RNA), outcome prediction (BE-Hive/DeepBE), and the base-vs-prime-vs-HDR decision. Use when installing a transition mutation without a double-strand break, knocking out a gene without indels, or choosing CBE vs ABE. Generic guide scoring, prime editing, and HDR donors are separate skills.

testing817

bio-experimental-design-batch-design

Designs genomics experiments so technical nuisance variation (batch, lane, plate, flow cell, operator, reagent lot, processing day) is balanced against the biological variable of interest and therefore estimable rather than confounded, using constrained sample-to-batch assignment (designit, OSAT), the confounder/mediator/collider distinction, and the principle that no post-hoc correction recovers a fully confounded design. Covers detecting hidden batches with surrogate variable analysis, a decision table for downstream correction (ComBat-seq, RUVSeq, SVA) whose execution is deferred to differential-expression/batch-correction, and reproducibility metadata. Use when assigning samples to sequencing batches/lanes/plates, avoiding batch-condition confounding, deciding whether a design is salvageable by correction, choosing a correction method, or estimating the number of hidden batches. For the experimental unit, randomization, and blocking concepts see experimental-design/randomization-blocking.

testing817

bio-experimental-design-multiple-testing

Controls error rates across thousands of simultaneous tests in genomics discovery using false-discovery-rate methods (Benjamini-Hochberg 1995; Benjamini-Yekutieli 2001 for arbitrary dependence; Storey q-value with pi0 estimation; local FDR; independent filtering Bourgon 2010; covariate-weighted FDR via IHW Ignatiadis 2016), plus family-wise error control (Bonferroni, Holm) and the GWAS genome-wide threshold. Covers the FDR-versus-FWER choice as the discovery-versus-confirmatory distinction, the dependence assumptions behind BH (PRDS) versus BY, pi0 estimation, the independent-filtering and false-coverage-rate traps, and reproducibility ranking via IDR (Li 2011). Use when correcting p-values from genome-wide tests, choosing between BH/BY/q-value/Bonferroni, setting an FDR threshold, applying IHW or independent filtering, or interpreting q-values. For confirmatory trials with few pre-specified endpoints (closed testing, graphical/gatekeeping), see clinical-biostatistics/multiplicity-graphical.

tools817

bio-copy-number-recurrent-cnv

Identify recurrent and driver copy number alterations across a tumor cohort with GISTIC2 (G-score, Ziggurat deconstruction, focal vs broad/arm-level analysis, q-values from permutation) and quantify copy-number signatures with the Steele 2022 COSMIC framework and the Drews 2022 CINSignatures framework. Covers driver-gene localization from recurrence peaks, distinguishing focal drivers from arm-level passengers, and the caller-sensitivity caveats of copy-number signatures. Use when finding recurrently amplified or deleted regions in a cohort, localizing driver genes, separating focal from broad events, running GISTIC2, or extracting copy-number mutational signatures.

development817

bio-copy-number-gatk-cnv

Call copy number variants with the GATK best-practices workflows — the somatic CNV pipeline (CollectReadCounts, DenoiseReadCounts with tangent normalization, ModelSegments, CallCopyRatioSegments) and the germline GATK-gCNV pipeline (DetermineGermlineContigPloidy, GermlineCNVCaller cohort/case mode, PostprocessGermlineCNVCalls). Covers panel-of-normals construction, AnnotateIntervals/FilterIntervals, allelic-count integration, and QS-based filtering. Use when integrating CNV calling into a GATK variant pipeline, calling rare germline CNVs from an exome cohort, deciding between the somatic and germline GATK workflows, or diagnosing why tangent normalization removed a real event or why gCNV output has low precision.

testing817

bio-differential-expression-edger-basics

Performs differential expression on bulk RNA-seq count data with edgeR's negative-binomial GLM and quasi-likelihood F-test framework. Covers DGEList construction, filterByExpr, TMM/TMMwsp normalization, robust dispersion estimation, glmQLFit/glmQLFTest, TREAT for magnitude-bounded hypotheses, contrasts via no-intercept designs, voom and voomWithQualityWeights for heterogeneous samples, and the edgeR v4 bias-corrected APL changes. Use when running bulk DE with edgeR, choosing edgeR over DESeq2 (small n, transcript DE via catchSalmon, large samples), needing TREAT for a fold-change-threshold hypothesis, troubleshooting v3-to-v4 reproducibility, building paired or interaction designs, or handling library-quality heterogeneity.

development817

bio-epidemiological-genomics-amr-surveillance

Detects acquired antimicrobial-resistance determinants and chromosomal point-mutation resistance in bacterial assemblies using AMRFinderPlus, ResFinder 4.0 (acquired + PointFinder), CARD-RGI, abritAMR, staramr, and species-specific callers (TB-Profiler, Mykrobe). Harmonises cross-tool output via hAMRonization, contextualises determinants with mobile-genetic-element annotation (MOB-suite, PlasmidFinder, MobileElementFinder, ICEberg), predicts phenotype against EUCAST or CLSI breakpoints, and translates calls into WHO GLASS reporting categories. Use when screening clinical or surveillance isolates for AMR, distinguishing acquired vs intrinsic vs point-mutation resistance, calling rpoB / katG / pncA / gyrA / mgrB mutations, reconciling AMRFinderPlus vs RGI vs ResFinder disagreement, contextualising carbapenemases or mcr alleles on plasmids, predicting susceptibility from genotype against the WHO Mtb 2nd-edition catalogue, or building a hAMRonized multi-lab AMR surveillance pipeline.

tools817

bio-genome-assembly-long-read-assembly

Assembles genomes de novo from noisy long reads (Oxford Nanopore R9/R10/Dorado, PacBio CLR) with Flye (repeat graph), Canu (correct-trim-assemble OLC), NextDenovo, Shasta, Raven, wtdbg2, or miniasm, and reconciles bacterial assemblies into a consensus with Trycycler/Autocycler. Covers matching the input flag to the basecaller era (--nano-hq vs --nano-raw), why a raw long-read assembly is contiguous but low-QV and not finished until polished, haplotig false-duplication and purge_dups, coverage and read-N50 as non-substitutable inputs, and mid-read adapter de-chimerization. Use when assembling a bacterial or eukaryotic genome from ONT or PacBio noisy reads, choosing a long-read assembler, or diagnosing an over-collapsed or duplicated assembly. For PacBio HiFi use hifi-assembly instead.

testing817

bio-ecological-genomics-edna-metabarcoding

Processes eDNA metabarcoding from raw paired-end reads to species tables, navigating ASV (DADA2, UNOISE3) vs OTU (swarm v2) decision (Callahan 2017 vs Schloss multi-copy-16S critique), marker/primer choice (Leray COI, MiFish 12S, 515F/806R 16S, ITS2) with primer-specific bias, OBITools3 v3 command-name break (obi stats plural; .tar.gz taxonomy), tag-jumping with dual-indexing (Schnell 2015; NovaSeq 10x MiSeq), decontam as screening-not-classifier (Davis 2018), read-counts-not-abundance critique (Lamb 2019), site-occupancy modeling (Ficetola 2015), Naive-Bayes calibration limits (Bokulich 2018), and eDNA decay (Strickler 2015). Use when going from raw eDNA FASTQ to species tables, picking marker + denoising pipeline, deciding whether read counts represent abundance, applying occupancy modeling, configuring OBITools3 v3, or interpreting decontam output. Not for clinical 16S microbiome (see microbiome/amplicon-processing).

tools817

bio-genome-assembly-genome-profiling

Profiles a genome from raw reads BEFORE assembly with a k-mer spectrum (KMC or Jellyfish histogram), then models it with GenomeScope2 to estimate genome size, heterozygosity, repeat content, and ploidy, and Smudgeplot to infer ploidy from heterozygous k-mer pairs (diploid AB vs triploid AAB vs tetraploid AABB). Covers choosing k via Merqury best_k.sh, the k-mer-coverage vs sequencing-coverage confusion, reading het/repeat/contamination/organelle peaks, why noisy ONT must not be used for counting, and how the estimate becomes the NG50 denominator, the Flye -g value, the hifiasm --hom-cov/purge setting, and the 1.5-2x-too-big haplotig sanity check. Use when starting any de novo assembly, deciding whether short reads can work, estimating genome size for an unknown organism, diagnosing ploidy, or sanity-checking an assembly's size against expectation.

development817

bio-geo-data

Query and download from NCBI Gene Expression Omnibus (GEO) and EMBL-EBI's BioStudies/ArrayExpress mirror. Use when finding expression datasets, navigating SuperSeries vs SubSeries, choosing between series-matrix (submitter-normalized) and raw supplementary files, downloading via GEOparse (Python) or GEOquery (R/Bioconductor), linking GEO to SRA for raw reads, or distinguishing GSE/GSM/GPL/GDS record types. Encodes the SuperSeries trap, the series-matrix normalization-trust caveat, GEOmetadb deprecation, ArrayExpress migration to BioStudies, and processed-vs-raw decision matrix.

development817

bio-genome-annotation-annotation-transfer

Transfers gene annotations between genome assemblies via coordinate liftover (UCSC liftOver, CrossMap for same-species version updates) or feature/sequence projection (Liftoff for same/close species, miniprot for protein-level cross-species, TOGA/GeMoMa/CAT for distant clades). Covers the coordinate-vs-projection decision by divergence, why a successful lift is not biological confirmation, reference bias, the silent-dropping of unmapped features, build/PAR/MHC/inversion hazards, and transfer-vs-de-novo validation. Use when annotating a new assembly of a species with an existing reference, harmonizing coordinates across builds, or mapping annotations across related species.

development817

bio-flow-cytometry-doublet-detection

Detects and removes doublets/aggregates from flow, spectral, and mass cytometry before clustering or quantification. Covers FSC-A vs FSC-H singlet discrimination (the Area-Height non-proportionality, not a 1D area gate), FSC-W/SSC width gating, CyTOF Gaussian discrimination parameters (Center/Offset/Width/Residual/Event_length) and DNA intercalator gating, and the residual heterotypic conjugates that survive scatter gating and masquerade as double-positive populations. Use when filtering aggregates before phenotyping, choosing a doublet method for flow vs CyTOF, or diagnosing a suspicious double-positive cluster.

testing817

bio-copy-number-cnv-annotation

Annotate copy number variant segments with overlapping genes, dosage-sensitivity scores, cancer driver databases, population frequencies, and clinical-variant content. Covers bedtools/pybedtools interval intersection, AnnotSV comprehensive annotation and ranking, ClinGen haploinsufficiency/triplosensitivity scoring, gnomAD-SV/DGV frequency filtering, COSMIC Cancer Gene Census, and ClinVar overlap. Use when interpreting which genes a CNV affects, distinguishing the driver gene of a focal event from passengers, filtering against population CNVs, separating whole-gene from partial-gene overlap, or preparing CNVs for clinical classification.

tools817

bio-biomart-queries

Bulk-query Ensembl BioMart (and other BioMart instances) for cross-database ID mapping, gene/transcript/exon coordinates, and ortholog tables. Use when batch-converting Ensembl IDs to other namespaces (HGNC, RefSeq, UniProt, Entrez), pulling gene coordinate tables for thousands of genes, building ortholog wide-tables across species, or replacing slow Ensembl REST loops with one-shot bulk export. Encodes BioMart's XML query format, R biomaRt vs Python pybiomart trade-off, mart-vs-dataset hierarchy, and the URL endpoint that's BioMart-specific (separate from rest.ensembl.org).

development817

bio-ecological-genomics-community-ecology

Analyzes species-environment relationships with constrained ordination (CCA, RDA, db-RDA), variance partitioning, indicator species (indicspecies IndVal.g group-equalized), PERMANOVA paired MANDATORILY with PERMDISP (Anderson & Walsh 2013; dispersion confounds centroid tests), Joint Species Distribution Models (HMSC, sjSDM, gjam) with explicit rejection of "residual covariance equals biotic interaction", phylogenetic community ecology (SES_MPD/MNTD), trait-environment via RLQ + fourth-corner with corrected modeltype=6 (Dray 2014), bipartite network metrics (NODF, modularity) with curveball null (Strona 2014), and Mantel-test replacements (dbRDA, GDM) for spatial data. Use when testing how environmental gradients structure communities, identifying habitat indicator taxa, partitioning variance among predictors, deciding whether PERMANOVA significance is location vs dispersion, picking among HMSC/sjSDM/gjam, or replacing Mantel tests for landscape data.

testing817

bio-interaction-databases

Query protein-protein and gene interaction databases (STRING, BioGRID, IntAct, SIGNOR, Reactome, HuRI, HuMAP, OmniPath, ConsensusPathDB, DIP). Use when building PPI networks, choosing between physical vs functional vs genetic interactions, signed/directed vs undirected, high-throughput vs curated, picking confidence thresholds, aggregating across resources, or navigating license constraints. Encodes the database decision matrix, STRING v12 channel semantics, OmniPath as meta-database, SIGNOR for signed signaling, and per-resource rate limits.

development817

bio-genome-assembly-hifi-assembly

Assembles haplotype-resolved diploid and telomere-to-telomere (T2T) genomes from PacBio HiFi reads with hifiasm (HiFi-only, Hi-C, or trio phasing) and verkko (HiFi + ultralong ONT for T2T), extracting contigs from GFA and routing phasing QC to k-mer/trio metrics. Covers why a primary assembly is a haplotype mosaic that exists in no cell, partial-vs-full phasing (the .bp. vs .dip. filename convention), the purge-default trap on inbred samples, the --hom-cov coverage-estimate alarm, and verkko-vs-hifiasm for T2T. Use when assembling a diploid eukaryote from HiFi, phasing haplotypes with parents (trio) or Hi-C, deciding whether to chase T2T, or diagnosing switch errors invisible to N50/BUSCO/QV.

testing817

bio-genome-assembly-scaffolding

Orders and orients assembled contigs into chromosome-scale scaffolds from long-range linking data, inserting N-gap spacers (adds no sequence). Covers Hi-C/Omni-C scaffolding (YaHS, SALSA2, 3D-DNA/Juicer), Hi-C read-mapping prerequisites (map each end separately, no mate rescue, dedup, enzyme-aware), reading the contact map for misjoins/inversions/false-duplications, manual curation in Juicebox/PretextView (the VGP/DToL standard), reference-guided scaffolding (RagTag) and its karyotype-erasure hazard, genetic-map (ALLMAPS) and Bionano optical-map integration, chimera-breaking before scaffolding, gap-filling, and telomere/contig-vs-scaffold-N50 QC (tidk). Use when turning contigs into chromosomes with Hi-C, integrating a linkage map or optical map, choosing a scaffolder by available linking data, or judging whether a chromosome-scale assembly is trustworthy.

development817

bio-data-visualization-oncoprint-mutation-matrices

Build OncoPrint and co-mutation matrix plots from somatic-variant cohorts using ComplexHeatmap, maftools, and comut.py with alteration-type stacking, sample ordering by mutational burden, mutual-exclusivity overlays, and clinical annotation tracks. Use when visualizing per-sample mutation patterns across recurrent driver genes, comparing alteration classes, or identifying mutually-exclusive / co-occurring driver pairs.

tools817

bio-ncbi-datasets-cli

Download genome assemblies, gene records, and ortholog data from NCBI using the modern Datasets v2 CLI (replaces assembly_summary.txt scraping and many EFetch workflows). Use when bulk-pulling genome assemblies, gene metadata across species, ortholog sets, or BLAST databases; when E-utilities are too slow for genome-scale work; or when automatic checksum verification, parallel download, and clean accession-driven retrieval are required. Encodes the JSON-lines output format, dataformat conversion, --dehydrated for cloud workflows, and when Datasets is/isn't the right tool.

tools817

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

development816

bio-expression-matrix-normalization

Normalizes and transforms RNA-seq count matrices for DE, visualization, clustering, and ML. Covers between-sample (TMM, TMMwsp, RLE/median-of-ratios, upper quartile), within-sample (TPM, FPKM/RPKM), variance-stabilizing (VST, rlog, log-CPM), GC-content correction (cqn, EDASeq), and single-cell (scran deconvolution, scanpy normalize_total). Encodes the composition-bias rationale, the "most genes not DE" assumption and its catastrophic failure modes (MYC amplification, apoptosis, viral host shutoff, prokaryotic stress), the "lengthScaledTPM is not TPM" naming trap, the "TPM is not for DE" rule, the blind=TRUE vs FALSE decision, ERCC spike-in normalization (SBN), and the single-cell zero-inflation breakdown of TMM/RLE. Use when choosing or applying normalization, debugging shifted-MA-plot diagnostics, handling zero-heavy single-cell data, or correcting GC bias.

development782

bio-ecological-genomics-biodiversity-metrics

Quantifies biodiversity from species abundance/incidence tables using Hill numbers (iNEXT) with coverage-based rarefaction-extrapolation (Chao & Jost 2012), asymptotic richness via Chao1/ACE/jackknife as a lower bound, Baselga turnover/nestedness partition with the Podani alternative as sensitivity check, mandatory Hellinger transformation before ordination (Legendre & Gallagher 2001), Faith PD and SES_MPD/SES_MNTD with explicit null-model choice, and Maire 2015 functional-diversity dimensionality optimization. Use when comparing diversity across sites with unequal sampling effort, picking the right richness estimator for singleton-heavy amplicon data, partitioning beta diversity into turnover vs nestedness, reporting Hill-number effective species counts rather than raw entropies, computing SES_MPD with explicit null-model justification, or deciding whether to apply standard metrics to compositional amplicon data. Not for clinical 16S microbiome diversity (see microbiome/diversity-analysis).

tools782

bio-sra-data

Download raw sequencing reads from NCBI SRA using sra-tools (prefetch, fasterq-dump, vdb-validate) or the ENA mirror. Use when pulling FASTQ for SRR/ERR/DRR accessions, deciding between SRA-direct, ENA mirror, or AWS/GCP cloud mirror (STRIDES), handling --include-technical for 10x and other single-cell records, validating with MD5/vdb-validate, navigating SRR/SRX/SRS/SRP/PRJNA hierarchy, or finding accessions via pysradb. Encodes SRA cloud-egress economics, the fasterq-dump uncompressed-scratch trap, and the --max-size default that silently truncates large prefetches.

tools782

bio-differential-expression-timeseries-de

Analyzes time-series and longitudinal RNA-seq for differential expression and trajectory structure. Covers DESeq2 LRT with reduced models, time as factor vs continuous vs natural splines, maSigPro (Nueda 2014 for RNA-seq), ImpulseDE2 with explicit impulse-model failure modes, DREAM for repeated measures via linear mixed models, pseudoreplication avoidance, conditional vs marginal modeling, and trajectory clustering with DPGP, Mfuzz (with Schwämmle 2010 fuzzifier estimation), and splines+k-means. Use when modeling time-course or longitudinal expression, choosing factor vs spline, handling repeated measures from the same subject, avoiding pseudoreplication, clustering temporal trajectories, or selecting between dedicated time-course tools and pairwise+LRT.

tools782

bio-expression-matrix-sparse-handling

Stores and operates on sparse expression matrices for single-cell and large bulk RNA-seq, covering dgCMatrix/dgRMatrix/dgTMatrix when-each-is-fast, the dgCMatrix (CSC, R) <-> CSR (Python) implicit transpose, AnnData (cells-rows) <-> SingleCellExperiment (cells-cols) orientation flip, HDF5/h5ad vs Zarr cloud-native shift, HDF5SummarizedExperiment + DelayedArray for out-of-memory bulk, scanpy backed mode for large h5ad, the ~10-15% density crossover where dense beats sparse, 10X format proliferation (MTX vs CellRanger H5 vs h5ad), the dense-conversion memory blow-up, and Dask + Zarr for consortium-scale matrices. Use when choosing sparse format, working with single-cell-sized matrices, importing/exporting 10X, debugging R/Python interop transposes, processing matrices too large for RAM, or building cloud-native pipelines.

development782

bio-expression-matrix-metadata-joins

Aligns sample metadata with count matrices and constructs design matrices for downstream DE, handling the alphabetical-reference-level trap (relevel BEFORE DESeq), LRT reduced-model rules, the interaction-term resultsNames trap, continuous-covariate scaling and splines, repeated measures via duplicateCorrelation or dream, high-cardinality categorical pseudo-singular designs, sample swap detection via XIST/RPS4Y1 expression and somalier/NGSCheckMate genotypes, SABV (sex-as-biological-variable) mandate, Simpson's-paradox collapsing of technical replicates, and the `~ 0 + group` parameterization for clean contrasts. Use when building a design matrix, troubleshooting reversed fold-change direction, encoding paired or repeated-measures designs, detecting sample swaps, deciding sex-as-covariate, or aggregating technical replicates.

development782

bio-differential-expression-de-visualization

Creates DE-specific diagnostic and result visualizations using DESeq2/edgeR built-in functions and lightweight ggplot2 wrappers. Covers MA plot (with the shrunken-LFC compression effect), volcano (with the apeglm caveat that p-values are unchanged), PCA on VST/rlog (never raw counts), sample distance heatmaps, top-DE-gene heatmaps with the row-scaling trap, dispersion / BCV plot interpretation, p-value histogram diagnostics, plotCounts for individual genes, blind=TRUE vs FALSE rationale, and the n=3 visualization stake. Use when generating DE diagnostic plots, choosing VST vs rlog for visualization, troubleshooting suspicious plot patterns (shifted MA cloud, batch-dominated PCA, anti-conservative p-value histogram), or building a standard QC figure panel.

development782

bio-expression-matrix-counts-ingest

Imports gene expression count matrices from featureCounts, HTSeq, STAR ReadsPerGene, Salmon/kallisto via tximport or tximeta, RSEM, 10X Genomics MTX/H5, AnnData H5AD, and RDS. Handles silent-miscounting traps (featureCounts -p v2.0.2 API break, STAR strandedness column choice, salmon NumReads-sum without tximport, RSEM non-integer expected_count, GENCODE _PAR_Y suffix, zero-length-transcript TPM divide-by-zero), and encodes the tximport countsFromAbundance decision tree with the "lengthScaledTPM is not TPM" warning. Use when assembling a gene-by-sample count matrix from aligner or quantifier output, importing salmon/kallisto for DESeq2 vs limma-voom, choosing strandedness column for STAR, debugging zero-count panics, or building tx2gene mapping.

development782

bio-ecological-genomics-conservation-genetics

Assesses genetic health of populations for conservation with Ne estimation across time horizons (LDNe NeEstimator V2 option-file API + SNeP physical-linkage correction; recent trajectory via GONE/GONE2; deep history via Stairway Plot 2 / dadi / fastsimcoal2 / PSMC), F-statistics, runs of homozygosity binned by length class to date inbreeding, genetic-load decomposition (Bertorelle 2022 realized vs masked), the modern 100/1000 Ne rule (Frankham 2014), Ne/Nc 2-6 orders of magnitude in marine fish (Hauser & Carvalho 2008), tree-sequence forward simulations (SLiM 4 + pyslim + tskit), and the Sukumaran-Knowles caveat against MSC methods for management-unit definition. Use when estimating Ne by time horizon, detecting inbreeding via F_ROH, decomposing genetic load, justifying conservation thresholds, distinguishing ESU/MU/DPS, configuring NeEstimator V2, or correcting LDNe physical linkage.

development782

bio-ecological-genomics-species-delimitation

Delimits putative species boundaries from molecular data within the de Queiroz 2007 unified-lineage framework using ASAP (Puillandre 2021 successor to ABGD), mPTP C++ (Kapli 2017 successor to bPTP; bPTP is Python NOT R), GMYC single/multi-threshold (Pons 2006; Fujisawa 2013), multilocus BPP v4 with prior calibration from data (NOT defaults; Yang 2015), SNAPP + BFD* for SNP delimitation, DELINEATE (Sukumaran 2021) speciation-process modeling to address Sukumaran & Knowles 2017 PNAS critique that MSC delimits structure not species, integrative-taxonomy congruence (Padial 2010; Carstens 2013), Dsuite for introgression testing before sister claims (Malinsky 2021), and Meyer & Paulay 2005 barcoding-gap-absence caveat. Use when delineating species from DNA barcoding data, resolving cryptic complexes, choosing among ASAP/mPTP/BPP/DELINEATE, calibrating BPP priors, distinguishing introgression from ILS, or applying the Sukumaran-Knowles oversplitting correction.

development782

bio-ecological-genomics-landscape-genomics

Tests genotype-environment associations and identifies adaptive loci while correcting for the four-confound landscape (structure, demography, background selection, sampling design) using LFMM2 with mandatory K via sNMF cross-entropy elbow (LEA 3), BayPass Core/AUX/C2/IS with Omega covariance matrix, RDA / pRDA for polygenic adaptation (Forester 2018; requires imputed genotypes), OutFLANK with trimmed FST null, pcadapt, gradient forests (Ellis-Smith-Pitcher 2012, NOT mis-cited Ellis-Manel), Capblancq & Forester 2021 RDA Swiss-army-knife, genomic-offset prediction with Lind & Lotterhos 2025 three-regime caveat, Lotterhos-Whitlock sampling optima, Wang & Bradburd 2014 IBD vs IBE, and Circuitscape + ResistanceGA. Use when identifying adaptive loci across gradients, choosing K for LFMM2, deciding among GEA methods, predicting maladaptation with the novel-environment caveat, distinguishing IBD vs IBE, or optimizing sampling design.

testing782

bio-expression-matrix-gene-id-mapping

Maps between gene identifier systems (Ensembl, Entrez, HGNC symbol, UniProt, RefSeq, MANE) using AnnotationDbi, biomaRt, mygene, pyensembl, and Ensembl REST. Encodes Ensembl version stripping with GENCODE _PAR_Y preservation, the Ziemann 2016 Excel autocorrect debacle and Bruford 2020 HGNC renames (SEPT*->SEPTIN*, MARCH*->MARCHF*, MARC*->MTARC*, DEC1->DELEC1), OCT4/POU5F1 alias resolution, biomaRt archive endpoints for release pinning, the `filters` (plural) gotcha, MANE Select for clinical reporting, cross-species orthology via Ensembl Compara / OMA / OrthoDB, and tx2gene construction for tximport. Use when converting gene IDs across systems, handling renamed symbols, building tx2gene, pinning to a specific Ensembl release for reproducibility, or mapping cross-species orthologs.

tools782

bio-data-visualization-upset-plots

Build UpSet plots to visualize set intersections beyond 4 sets (where Venn fails) using ComplexUpset (modern, ggplot2-grammar) or the unmaintained UpSetR, with explicit cardinality vs degree sorting, attribute panels, and query highlighting. Use when comparing overlap across many gene sets, peak sets, variant lists, or any set membership matrix where Venn diagrams become illegible.

development776

bio-blast-searches

Run remote BLAST searches against NCBI servers using Biopython Bio.Blast.NCBIWWW. Use when identifying unknown sequences, finding homologs, picking the correct BLAST program (blastn/blastp/blastx/tblastn/tblastx/psiblast/megablast/dc-megablast), interpreting Karlin-Altschul E-values, avoiding the max_target_seqs trap (Shah 2019), choosing composition-based statistics, or limiting searches by organism. Covers RID lifecycle, database choice (nt/nr/refseq_select/swissprot), word-size and CBS taxonomy.

development776

bio-data-visualization-manhattan-qq-locuszoom

Build Manhattan, Miami, QQ, and locuszoom-style regional plots from GWAS, TWAS, PWAS, and QTL summary statistics with correct genomic-inflation diagnostics, multi-trait overlays, lead-SNP labeling, and LD-aware regional rendering. Use when visualizing association results across the genome, comparing two traits, computing genomic inflation lambda, or zooming into a locus with LD coloring.

development776

bio-batch-downloads

Download large datasets from NCBI efficiently using EPost, history server, batching, rate limiting, and retry logic. Use when bulk-fetching tens of thousands of sequences, pulling all results of a large ESearch, designing reproducible pipelines, comparing E-utilities to NCBI Datasets v2 CLI, or implementing checksum-validated downloads. Encodes WebEnv TTL (~8h), EPost 200-ID limit, retmax caps, parallelization design, and integrity verification.

tools776

bio-entrez-link

Find cross-database references between NCBI databases using Biopython Bio.Entrez (ELink). Use when navigating gene to protein/structure, sequence to publication, PubMed to GEO, BioProject to SRA runs, or discovering all link relationships for a record. Covers linkname semantics, cmd= variants, asymmetric link warnings, neighbor_history for >200 input IDs, and per-database link tables.

development776

bio-entrez-fetch

Retrieve records from NCBI databases using Biopython Bio.Entrez (EFetch, ESummary). Use when downloading sequences, fetching GenBank/GenPept records, getting document summaries, parsing nested XML, navigating GI deprecation, choosing between rettype+retmode combinations, and parsing into Biopython SeqRecord/SwissProt objects. Covers nucleotide, protein, gene, pubmed, sra, gds, taxonomy, snp, clinvar.

tools776

bio-data-visualization-lollipop-protein-maps

Plot per-gene mutation distributions on a protein-domain map (lollipop / needle plots) showing mutation position, recurrence count, and variant classification with maftools, g3-lollipop, trackViewer, and ProteinPaint. Use when visualizing recurrent mutation hotspots on a single gene's protein, marking domain boundaries from UniProt/Pfam, comparing missense vs truncating distributions, or contrasting two cohorts on the same lollipop.

tools776

bio-data-visualization-statistical-annotation

Add p-value brackets, significance asterisks, and effect-size annotations to distribution plots using ggpubr, ggsignif, and statannotations with correct test selection (parametric vs non-parametric vs paired), multiple-testing adjustment, and rendering of negative results. Use when a boxplot/violin/raincloud needs in-figure statistical comparisons between groups.

testing776

bio-entrez-search

Search NCBI databases using Biopython Bio.Entrez (ESearch, EInfo, EGQuery, ESpell). Use when finding records by keyword, building reproducible field-qualified queries, navigating the Entrez Query Translator, exploiting the history server for large result sets, handling retmax caps, or interpreting weekly index lag. Covers PubMed, Nucleotide, Protein, Gene, SRA, GEO, Assembly, Taxonomy, ClinVar, dbSNP.

tools776

bio-data-visualization-sequence-logos

Build sequence logos from aligned DNA, RNA, or protein motifs using ggseqlogo (R), Logomaker (Python), or WebLogo with explicit bits vs probability encoding, background-frequency correction, custom alphabets, and multi-logo stacking. Use when visualizing motif PWMs (TF binding, splice sites, CRISPR spacers), aligned-position composition, or comparing two motif sets.

development776

bio-data-visualization-network-visualization

Visualize biological networks (PPI, gene-regulatory, co-expression, pathway) with layout algorithm choice (ForceAtlas2, Fruchterman-Reingold, Kamada-Kawai, hive plots), edge bundling, community-based coloring, and reproducible seeds using NetworkX, PyVis, igraph, and Cytoscape automation. Use when rendering biological networks for static publication, interactive HTML exploration, or Cytoscape-format export.

tools776

bio-crispr-screens-mageck-analysis

Analyzes pooled CRISPR screens with MAGeCK (Li et al 2014), covering count generation (mageck count), the RRA two-condition workflow (mageck test using alpha-RRA over per-sgRNA negative-binomial p-values), the MLE multi-condition workflow (mageck mle with explicit design matrix and beta-score output), normalization choice (median vs total vs control-sgRNA vs spike-in), sgRNA efficiency injection, paired-sample testing, time-course design, drug-screen versus dropout-screen design matrices, MAGeCKFlute and MAGeCK-VISPR downstream visualization, and decision logic for when to use MAGeCK vs JACKS / BAGEL2 / drugZ / Chronos. Use when running a fresh CRISPR screen analysis, picking RRA vs MLE for the experimental design, choosing a normalization method from QC signatures, debugging MLE convergence failure or NaN beta scores, comparing MAGeCK output across tools, or building a batch-aware multi-cell-line / multi-condition MLE design matrix.

tools776

bio-ensembl-rest

Query the Ensembl REST API for gene/transcript/protein lookup, sequence retrieval, comparative genomics (Compara), variant effect prediction (VEP), regulatory features, and cross-species ortholog/paralog calls. Use when pulling Ensembl-native data (Ensembl Gene IDs, version-pinned releases, archive endpoints for reproducibility), gene/transcript/exon structure with stable IDs, or VEP for variant annotation. Encodes the 15 req/sec rate limit, archive (e110.rest.ensembl.org) for reproducibility, Ensembl divisions (vertebrates / plants / fungi / metazoa / bacteria), and the symbol-vs-ID stability problem.

development776

bio-data-visualization-distribution-plots

Plot per-group distributions of continuous data using boxplots, violins, beeswarms, quasirandom jitter, and raincloud plots with sample-size honesty (Weissgerber 2015), KDE-bandwidth awareness, and N-aware encoding choices. Use when comparing distributions across a small number of groups — expression per cluster, biomarker per arm, scores per condition — and the bar-of-mean default is misleading.

devops776

bio-data-visualization-heatmaps-clustering

Build clustered heatmaps for expression matrices and other features-by-samples data with rigorous distance/linkage/scaling choices, robust color mapping, optimal leaf ordering, and ComplexHeatmap/pheatmap/seaborn rendering. Covers the ward.D vs ward.D2 trap, the row-vs-column scaling decision, multi-track annotations, oncoPrint, and raster rendering for large matrices. Use when visualizing expression patterns across samples or identifying co-regulated clusters.

development776

bio-data-visualization-circos-plots

Build circular genome visualizations using circlize (R), pyCirclize (Python), or Circos (Perl CLI) with ideogram tracks, multi-data tracks (scatter, histogram, heatmap), chord/link arcs for interactions, and explicit circos.clear() between plots. Covers when circular is appropriate vs when Cartesian wins (Cleveland-McGill 1984), karyograms, and chromosome adjacency in chord diagrams. Use when adjacency on the circle conveys meaning — chromosome-level overview, structural variants, Hi-C interactions, cross-genome comparisons.

tools776

bio-data-visualization-genome-tracks

Build genome-browser-style multi-track figures with pyGenomeTracks (config-driven), Gviz (R), and IGV batch screenshotting. Covers BigWig coverage tracks, BED/peak overlays, gene-model rendering, Hi-C matrix tracks, BedPE link arcs, spike-in-aware normalization, and the bamCoverage --normalizeUsing trap. Use when producing publication figures of genomic loci with stacked aligned tracks (coverage, peaks, genes, interactions) for ChIP-seq, ATAC-seq, RNA-seq, Hi-C, or generic locus visualization.

development776

bio-data-visualization-multipanel-figures

Compose multi-panel publication figures with patchwork, cowplot, gridExtra (R), or matplotlib GridSpec/subfigures (Python) including shared axes/legends/guides collection, panel labels in Nature/Cell convention, and journal-spec sizing. Covers patchwork ≥1.2.0 axes='collect' feature, Type-42 font embedding, and the cairo_pdf save path. Use when composing 2+ subpanels into a single figure for journal submission.

development776

bio-data-visualization-ggplot2-fundamentals

Build publication-quality figures in R with ggplot2 using the grammar of graphics (data + aesthetics + geometries + scales + facets + themes) with CVD-safe palettes, cairo_pdf TrueType embedding, programmatic aes via tidy evaluation, and the theme_classic publication baseline. Use when producing static figures in R for papers, presentations, or reports.

development776

bio-crispr-screens-copy-number-correction

Corrects the gene-independent copy-number artifact in CRISPR-Cas9 screens (Aguirre 2016 / Munoz 2016 Cancer Discov) where amplified loci appear essential from DNA-damage burden of simultaneous cuts. Covers the p53-dependent G2-arrest mechanism, CRISPRcleanR (Iorio 2018) unsupervised pre-hoc correction, CERES (Meyers 2017) joint CN + gene-effect model, Chronos (Dempster 2021) DepMap-standard population-dynamics + CN model with lowest residual bias, the decision tree by data availability, the Spearman LFC-vs-CN diagnostic, focal-amplification examples (ERBB2 in HER2+, MYC in colorectal, FGFR1 in head and neck), and CRISPRi/a alternatives that bypass the artifact. Use when screening cancer cell lines, diagnosing essentiality at amplified loci, choosing CRISPRcleanR / CERES / Chronos, deciding whether CN correction is needed before MAGeCK / BAGEL2 / drugZ, or switching from Cas9 to CRISPRi.

testing770

bio-crispr-screens-hit-calling

Cross-method decision tree for calling hits in pooled CRISPR screens. Catalogs statistical models (MAGeCK RRA, MAGeCK MLE, BAGEL2, drugZ, JACKS, Chronos, CERES), experimental designs each is built for, failure modes outside design domain, reconciliation when methods disagree, multiple-testing and effect-size thresholds, the order of operations (count -> QC -> CN-correct -> hit-call -> validate), the second-best-sgRNA conservative rule, and consensus-hit strategy. Use when choosing among MAGeCK / BAGEL2 / drugZ / JACKS / Chronos for a given design, reconciling disagreement across two or three methods on the same screen, deciding whether to require consensus, gating downstream validation by hit-confidence tier, or interpreting unstable hit lists across reruns.

testing770

bio-crispr-screens-crispresso-editing

Quantifies CRISPR editing outcomes with CRISPResso2 (Clement 2019 Nat Biotechnol) across Cas9-nuclease (indels, HDR), CBE and ABE base editors (target conversion + bystander), and prime editor (pegRNA-templated) modes. Covers single-amplicon (CRISPResso), multi-sample batch (CRISPRessoBatch), pooled-amplicon (CRISPRessoPooled), WGS off-target (CRISPRessoWGS), and sample-comparison (CRISPRessoCompare) workflows; quantification-window math that controls what is called edited; substitution-vs-indel diagnostic to distinguish BE from Cas9 contamination; MMEJ deletion pattern interpretation; allele-frequency tables; and failure modes from amplicon misalignment or contamination. Use when quantifying editing from amplicon sequencing, choosing CRISPResso mode by design, distinguishing intended edits from bystanders and indel byproducts, debugging low-alignment runs, or generating publication-grade editing reports.

development770

bio-crispr-screens-drugz-chemogenomic

Analyzes CRISPR drug-modifier (chemogenomic) screens with drugZ (Li & Hart 2019 Genome Med), a bidirectional Z-score method that identifies synthetic-lethal sensitizing genes and resistance-conferring suppressor genes from vehicle vs drug comparisons. Covers vehicle-anchored design (not Day-0), the bidirectional Z math giving 2-3x sensitivity over MAGeCK / STARS / edgeR / RIGER on drug screens, per-gene sumZ and normZ, synth (sensitizer) vs supp (suppressor) FDR, multi-dose handling, integration with control sgRNAs, and comparison with MAGeCK MLE with dose covariate. Use when running a drug-modifier CRISPR screen, identifying sensitizing or resistance genes for a drug candidate, choosing drugZ vs MAGeCK MLE for chemogenomic analysis, troubleshooting low-effect drug screens where MAGeCK lacks sensitivity, or designing a drug-screen layout (vehicle vs drug arms).

data-ai770

bio-data-visualization-volcano-customization

Create publication-ready volcano plots with custom thresholds, gene labels, and highlighting using ggplot2, EnhancedVolcano, or matplotlib. Use when visualizing differential expression or association results with gene annotations.

testing770

bio-crispr-screens-library-design

Designs pooled sgRNA libraries for CRISPR knockout, interference (CRISPRi), activation (CRISPRa), Cas12a multiplex, base-editor, and prime-editor screens. Covers on-target scoring (Rule Set 2, Azimuth, DeepSpCas9, CRISPRon), off-target scoring (CFD, MIT), TSS-relative positioning for CRISPRi/a (Horlbeck, Dolcetto, Calabrese), PAM-variant chemistries, control-guide composition, oligo cloning architecture, and library QC. Use when choosing a genome-wide library (GeCKOv2 vs Avana vs Brunello vs TKOv3 vs Inzolia), designing a focused or paralog-focused custom library, picking CRISPRi vs CRISPRa TSS windows, deciding control-guide proportions, or diagnosing library skew and dropout in a freshly cloned pool.

development770

bio-crispr-screens-perturb-seq-analysis

Analyzes single-cell pooled CRISPR screens (Perturb-seq, CROP-seq, Perturb-CITE-seq, ECCITE-seq, multiome) where each cell carries an sgRNA and a scRNA-seq / surface-protein / chromatin readout. Covers experimental design (direct-capture Perturb-seq Dixit 2016 vs CROP-seq 3'UTR-barcoded Datlinger 2017 vs ECCITE-seq vs Multiome), MOI for sgRNA assignment, escaper-cell filtering (Mixscape, Papalexi 2021), SCEPTRE NB GLM + permutation for low-MOI (Barry 2024 Genome Biol 25:124), the Pertpy framework, factor decomposition, genome-scale Perturb-seq (Replogle 2022 Cell, 2.5M cells), and per-perturbation single-cell DE. Use when running a single-cell CRISPR screen, choosing direct-capture vs CROP-seq architecture, filtering escaper cells, performing single-cell DE, integrating Perturb-seq with pathway analysis, scaling to GW CRISPRi via Replogle protocol, or analyzing multi-omics screens.

development770

bio-sequence-similarity

Find homologous sequences using iterative BLAST (PSI-BLAST), profile HMMs (HMMER), and reciprocal best hit analysis. Use when identifying orthologs, distant homologs, or protein family members where standard BLAST is not sensitive enough.

development770

bio-crispr-screens-jacks-analysis

Runs JACKS (Joint Analysis of CRISPR/Cas9 Knockout Screens; Allen et al 2019 Genome Research) which models per-sgRNA log-fold-change as the product of a treatment-dependent gene-essentiality term and a treatment-independent guide-efficacy term. Covers the Bayesian decomposition math, the hierarchical efficacy prior shared across screens performed with the same library, when JACKS outperforms MAGeCK (multi-screen joint analysis, libraries with broad efficacy variance) and when it does not (single screen, novel libraries with no prior efficacy), library-reuse efficacy transfer, downstream essentiality interpretation, and the 2.5x sample-size reduction enabled by efficacy-aware testing. Use when running multiple screens with the same library, when guide-level noise is suspected to dominate per-gene signal, when reusing published essentiality reference screens for efficacy priors, or when comparing screens performed across cell lines that share library but differ biologically.

development770

bio-crispr-screens-batch-correction

Batch effect correction for CRISPR screens covering ComBat empirical-Bayes, RUV, SVA, control-sgRNA normalization, and the model-based alternative of including batch as a covariate in MAGeCK MLE or Chronos. Covers screen-specific batch sources (passage cohort, library lot, infection day, sequencing run, Cas9 lot, FBS lot), PCA + variance-decomposition diagnostic to decide if correction is needed, when correction harms biology by over-correcting condition into batch, limma removeBatchEffect for visualization-only correction, and relationship to multi-condition design matrices. Use when combining screens for joint analysis, when passage cohort confounds biology, when DepMap-style panels need Chronos with batch covariates, when picking ComBat vs RUV, or when correction harms biology and should be replaced with explicit covariate modeling.

development770

bio-data-visualization-genome-browser-tracks

Generate genome browser visualizations using pyGenomeTracks or IGV batch scripting for publication figures. Use when creating publication figures of genomic regions with multiple data tracks.

data-ai769

bio-data-visualization-specialized-omics-plots

Reusable plotting functions for common omics visualizations. Custom ggplot2/matplotlib implementations of volcano, MA, PCA, enrichment dotplots, boxplots, and survival curves. Use when creating volcano, MA, or enrichment plots.

testing769

bioskills

Installs 425 bioinformatics skills covering sequence analysis, RNA-seq, single-cell, variant calling, metagenomics, structural biology, and 56 more categories. Use when setting up bioinformatics capabilities or when a bioinformatics task requires specialized skills not yet installed.

development582

Adoption

GPTomics

bio-comparative-genomics-whole-genome-duplication

bio-clip-seq-ago-clip-mirna-targets

bio-clip-seq-clip-motif-analysis

bio-clip-seq-binding-site-annotation

bio-clip-seq-clip-preprocessing

bio-comparative-genomics-ortholog-inference

bio-comparative-genomics-comparative-annotation-projection

bio-clip-seq-stamp-antibody-free

bio-comparative-genomics-gene-tree-species-tree-reconciliation

bio-clip-seq-m6a-clip

bio-comparative-genomics-genome-distance-and-species-delineation

bio-comparative-genomics-whole-genome-alignment

bio-comparative-genomics-gene-family-evolution

bio-comparative-genomics-hgt-detection

bio-comparative-genomics-introgression-detection

bio-clip-seq-crosslink-site-detection

bio-comparative-genomics-synteny-analysis

bio-clip-seq-clip-peak-calling

bio-clip-seq-clip-deep-learning

bio-comparative-genomics-ancestral-reconstruction

bio-comparative-genomics-pangenome-analysis

bio-workflows-clip-pipeline

bio-comparative-genomics-positive-selection

bio-clip-seq-differential-clip

bio-clip-seq-clip-alignment

bio-clinical-databases-somatic-signatures

bio-clinical-databases-polygenic-risk

bio-workflows-neoantigen-pipeline

bio-clinical-databases-clinvar-lookup

bio-clinical-databases-acmg-classification

bio-clinical-databases-myvariant-queries

bio-clinical-databases-gnomad-frequencies

bio-clinical-databases-hla-typing

bio-clinical-databases-dbsnp-queries

bio-clinical-databases-variant-prioritization

bio-clinical-databases-pharmacogenomics

bio-clinical-databases-tumor-mutational-burden

bio-clinical-databases-msi-detection

bio-clinical-biostatistics-missing-data

bio-chipseq-super-enhancers

bio-clinical-biostatistics-subgroup-analysis

bio-clinical-biostatistics-survival-analysis

bio-clinical-biostatistics-multiplicity-graphical

bio-clinical-biostatistics-power-sample-size

bio-clinical-biostatistics-trial-reporting

bio-clinical-biostatistics-logistic-regression

bio-workflows-clinical-trial-pipeline

bio-chipseq-allele-specific-binding

bio-chipseq-chip-deep-learning

bio-chipseq-motif-analysis

bio-chipseq-differential-binding

bio-chipseq-cut-and-run-tag

bio-chipseq-peak-annotation

bio-chipseq-spike-in-normalization

bio-chipseq-chromatin-state-segmentation

bio-chipseq-qc

bio-chipseq-visualization

bio-clinical-biostatistics-bayesian-trials

bio-clinical-biostatistics-adaptive-designs

bio-clinical-biostatistics-categorical-tests

bio-chipseq-peak-calling

bio-clinical-biostatistics-cdisc-data

bio-clinical-biostatistics-effect-measures

bio-causal-genomics-transcriptome-wide-association

bio-causal-genomics-mediation-analysis

bio-causal-genomics-proteome-mr-drug-target

bio-causal-genomics-mendelian-randomization

bio-causal-genomics-fine-mapping

bio-causal-genomics-genetic-correlation

bio-causal-genomics-genomic-sem

bio-workflows-causal-genomics-pipeline

bio-causal-genomics-heritability-partitioning

bio-causal-genomics-colocalization-analysis

bio-causal-genomics-effector-gene-prioritization

bio-causal-genomics-pleiotropy-detection

bio-admet-prediction

bio-reaction-enumeration

bio-conformer-generation