genome-annotation/ncrna-annotation/SKILL.md
Identifies non-coding RNAs (tRNA, rRNA, snoRNA, snRNA, riboswitches, sRNAs) using Infernal covariance-model search against Rfam, tRNAscan-SE 2.0 for tRNA, barrnap for rRNA, and ARAGORN for tmRNA, plus the small-RNA-seq boundary for miRNA and the transcript-assembly boundary for lncRNA. Covers the structure-conserved-not-sequence-conserved principle (why BLAST fails), GA-threshold and clan-competition correctness, tRNAscan-SE domain modes and pseudogene flags, rDNA copy-number collapse, and why homology annotation is a recall floor. Use when performing genome-wide ncRNA annotation, choosing the right tool for an RNA class, or interpreting ncRNA counts.
npx skillsauth add GPTomics/bioSkills bio-genome-annotation-ncrna-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Infernal 1.1.4+, Rfam 15+ (CM library), tRNAscan-SE 2.0.12+, barrnap 0.9+, ARAGORN 1.2+, pandas 2.2+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesThe Rfam release version drives results (Rfam 15 has ~4,200+ families); record it. The downloaded Rfam.cm ships pre-calibrated, so only cmpress is needed before cmscan. If code throws an error, introspect the installed tool and adapt rather than retrying.
"Find non-coding RNAs in my genome" -> Scan an assembly for structured ncRNA families using covariance models (sequence + secondary structure jointly), with specialist detectors for tRNA and the expression boundary for miRNA/lncRNA.
cmscan --cut_ga --rfam --nohmmonly --fmt 2 --clanin Rfam.clanin Rfam.cm genome.fa (Infernal), tRNAscan-SE -B genome.faProtein annotation rides on sequence/ORF signal; structured-ncRNA annotation rides on base-pairing covariation. A G-C pair can become A-U across evolution while the structure is preserved - both positions mutate together (compensatory substitution). To a sequence-only tool these look like two mismatches; to a covariance model (a profile SCFG, the Rfam/Infernal engine) the correlated change is the strongest possible evidence of homology. Three consequences:
| RNA class | Tool | Citation | Method |
|-----------|------|----------|--------|
| tRNA | tRNAscan-SE 2.0 | Chan 2021 NAR | isotype-specific Infernal CMs + pseudogene/high-confidence logic |
| rRNA (fast) | barrnap | Seemann (software) | nhmmer HMM profiles; kingdom flag; prokaryotic-pipeline default |
| rRNA (structure-aware) | Infernal + Rfam SSU/LSU | Nawrocki 2013 | CM; better boundaries / unusual taxa |
| tmRNA (+ bacterial tRNA) | ARAGORN | Laslett 2004 | heuristic cloverleaf + tmRNA models |
| miRNA | miRDeep2 (+ small-RNA-seq) | Friedländer 2012 | Dicer-processing model on read pileups |
| C/D, H/ACA snoRNA | snoscan / snoReport | Lowe 1999 | guide-target complementarity / SVM |
| Everything else structured | Infernal cmscan vs Rfam | Nawrocki 2013 | covariance models; the general engine |
| lncRNA | StringTie + CPC2/CPAT/FEELnc | - | transcript assembly + coding-potential, NOT CM search |
RNAmmer is the legacy rRNA tool (license-encumbered, HMMER2) - use barrnap instead unless reproducing old annotations.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Prokaryote, fast complete annotation | barrnap + tRNAscan-SE -B/ARAGORN + Infernal/Rfam | what Bakta/Prokka/PGAP wrap |
| tRNA is the question | always tRNAscan-SE 2.0 (not Rfam's generic tRNA model) | isotype, pseudogene, intron, high-confidence logic |
| rRNA, speed matters | barrnap | seconds per genome |
| Broad ncRNA sweep of a new genome | Infernal cmscan vs full Rfam.cm (GA + clan competition) | structure-aware, family-typed |
| miRNA | demand small-RNA-seq; miRDeep2 | genomic hairpin prediction is unreliable |
| lncRNA | transcript assembly + coding-potential | not structurally conserved; no CM |
| Claim a conserved structure | R-scape covariation test (report power) | thermodynamic fold != selected structure |
| Bacterial AMR/CRISPR arrays | -> prokaryotic-annotation / CRISPRCasFinder | array detection is a separate tool class |
# Rfam ships pre-calibrated; press it once, then scan (cmscan = many models vs one genome)
cmpress Rfam.cm
cmscan -Z <dbsize_Mb> --cut_ga --rfam --nohmmonly \
--tblout out.tblout --fmt 2 --clanin Rfam.clanin \
Rfam.cm genome.fa > out.cmscan
grep -v " = " out.tblout > out.deoverlapped.tblout # drop within-clan overlaps
--cut_ga (gathering threshold): the single most important correctness flag. Each family has a curator-set, per-family bit-score threshold; a fixed E-value would treat a 70-nt tRNA model and a 2,900-nt LSU model identically, which is wrong. Overriding GA to "find more" imports the false positives the curator deliberately excluded.--nohmmonly forces full CM (structure-aware) scoring so scores are GA-comparable.-Z <dbsize_Mb> = total_residues x 2 / 1e6 (both strands), making E-values run-comparable.--fmt 2 --clanin Rfam.clanin + the grep -v " = " deoverlap step is mandatory, not a nicety: clans group related families (the tRNA models, SSU/LSU rRNA), so one locus hits several models and the raw table double-counts (a 16S locus becomes "several rRNA genes").tRNAscan-SE -B -o trna.out -f trna.ss -m trna.stats --gff trna.gff3 genome.fa # bacterial
Modes: -E eukaryotic (default), -B bacterial, -A archaeal, -G general (mixed/metagenome), -M mammal/-M vert mitochondrial, -O organellar (disables pseudogene checking). Domain choice is not cosmetic - the wrong mode mis-scores and miscalls isotypes; there is no auto-detect. Report the high-confidence set, not raw hits (raw counts include pseudogenes/SINEs and can be 2-10x inflated in eukaryotes). The pseudogene flag is reliable in eukaryotic nuclear genomes but false-positive-prone in organelles/odd mito-tRNAs (truncated arms read as "decayed") - hence -O/-D.
barrnap --kingdom bac genome.fa > rrna.gff3 # bac | arc | euk | mito
Reports partial rRNA at contig edges as (partial). rDNA copy number from an assembly is essentially always wrong - near-identical rRNA arrays collapse in short-read assemblies, so the annotated count is a floor (off by orders of magnitude in eukaryotes); use long reads or read depth for true copy number.
Goal: Merge Infernal and tRNAscan-SE into one ncRNA annotation, preferring the specialist for tRNA.
Approach: Parse the deoverlapped Infernal table, drop its tRNA rows (tRNAscan-SE is the authority for tRNA), and combine with the tRNAscan-SE high-confidence set; keep evidence provenance per class.
import pandas as pd
def parse_infernal_tbl(tbl_file):
rows = []
with open(tbl_file) as f:
for line in f:
if line.startswith('#'):
continue
p = line.split()
if len(p) < 18:
continue
rows.append({'rfam_name': p[1], 'seqid': p[3], 'strand': p[11],
'score': float(p[16]), 'evalue': float(p[17])})
df = pd.DataFrame(rows)
return df[~df['rfam_name'].str.contains('tRNA', case=False, na=False)] # tRNAscan-SE owns tRNA
Trigger: using BLAST to find ncRNA homologs. Mechanism: BLAST scores sequence identity, blind to covariation. Symptom: misses structure-conserved homologs; ranks pseudogenes above real homologs. Fix: Infernal/Rfam covariance models.
Trigger: cmscan without --cut_ga/--nohmmonly, or without --clanin/--fmt 2 deoverlap. Mechanism: wrong thresholds; HMM-only scores not GA-comparable; within-clan overlaps uncollapsed. Symptom: inflated/redundant family calls (one locus counted several times). Fix: the full canonical command + grep -v " = ".
Trigger: -E on a bacterium, or default on a metagenome. Mechanism: domain CMs differ; no auto-detect. Symptom: mis-scored/miscalled isotypes. Fix: -B/-A/-G per source; report the high-confidence set.
Trigger: reporting raw tRNAscan-SE hits or genomic rRNA as copy number. Mechanism: SINEs/pseudogenes/NUMTs inflate tRNA; rDNA arrays collapse. Symptom: 150+ bacterial "tRNAs"; "5 copies of 18S". Fix: high-confidence set; treat copy numbers as floors; flag NUMT-type organellar tRNAs in the nucleus.
Trigger: calling miRNAs from hairpins or lncRNAs from CM search. Mechanism: no genomic signal suffices. Symptom: false-positive "genes". Fix: small-RNA-seq (miRNA); transcript assembly + coding-potential (lncRNA); label homology-only as candidates.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| --cut_ga per-family GA threshold | Rfam curation | family-specific noise floor; never override genome-wide |
| -Z = residues x 2 / 1e6 | Infernal | run-comparable E-values (both strands) |
| Bacterial tRNA ~28-90 (E. coli ~89; symbionts ~28-35) | GtRNAdb | 150+ implies fragments/pseudogenes |
| Eukaryotic tRNA ~170-570 (amplified) | tRNA literature | raw hits far higher (SINEs/pseudogenes); use high-confidence set |
| Bacterial rRNA operons 1-15 (E. coli ~7) | copy-number norm | annotated count is a floor (rDNA collapse) |
| miRNA needs small-RNA-seq Dicer signature | Fromm 2015 | foldability is not a miRNA; >1/3 miRBase is artifact |
| R-scape covariation + power before claiming structure | Rivas 2017/2020 | thermodynamic fold != selected structure |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| cmscan slow | full Rfam scan | --rfam preset; split genome; parallelize |
| Redundant overlapping calls | no clan competition | --fmt 2 --clanin; grep -v " = " |
| Missing expected ncRNAs | stale Rfam / cmpress not run | check Rfam version; verify .i1{f,i,m,p} files |
| Too many tRNA "pseudogenes" | normal in eukaryotes; false-positive in organelles | high-confidence set; -O/-D for organelles |
| Implausible rRNA copy number | rDNA array collapse | report as floor; use read depth/long reads |
| Two species differ by 1000s of miRNAs | miRBase contamination | use MirGeneDB; demand processing evidence |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.