genome-annotation/repeat-annotation/SKILL.md
Discovers, classifies, and masks repetitive elements and transposable elements with RepeatModeler2 (de novo family library), RepeatMasker (masking against a library), EDTA (plant/structural TEs), or EarlGrey (auto-curating wrapper), and quantifies TE expression from RNA-seq with TEtranscripts/SQuIRE. Covers de-novo-library-as-curation-project, soft-vs-hard masking, the domesticated-gene over-masking massacre, Dfam-vs-RepBase, TE classification (Class I/II, family-vs-copy), Kimura repeat landscapes, LAI, and the RNA-seq multimapping problem. Use when masking repeats before gene prediction, building a TE library for a non-model genome, or analyzing transposable-element content or expression.
npx skillsauth add GPTomics/bioSkills bio-genome-annotation-repeat-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: RepeatModeler 2.0.5+, RepeatMasker 4.1.5+, EDTA 2.1+, EarlGrey 4.0+, TEtranscripts 2.2+, matplotlib 3.8+, pandas 2.2+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesThe library database version matters as much as the binary: RepeatMasker now ships with Dfam (open); RepBase has been paywalled since May 2019, so any pipeline that "requires RepBase" is a reproducibility/access hazard - record the Dfam release and library provenance. If code throws an error, introspect the installed tool and adapt rather than retrying.
"Mask repeats in my genome assembly" -> Build a de novo repeat-family library, annotate copies genome-wide, and soft-mask them as a prerequisite for gene prediction.
RepeatModeler -database mydb -LTRStruct (library), RepeatMasker -lib lib.fa -xsmall assembly.fa (soft-mask), or EarlGrey/EDTA.pl (wrappers)Two load-bearing truths the masker hides:
De novo library construction is a curation research project, not a button. A RepeatModeler2 run emits mydb-families.fa overnight - a draft of a draft: consensi are routinely 5'-truncated (L1 looks 1.5 kb when the active element is 6 kb), boundary-bled into flanking unique sequence, chimeric (two families merged), and 30-60% "Unknown" on a non-model genome. The dominant error in published TE annotations is the library, not the masker engine. Crucially, masking percentage is robust to a bad library (a chimeric consensus still masks roughly the right real estate), so the headline number survives while everything downstream rots: inflated family counts, wrong classification, distorted age landscapes, and - the killer - host-gene-contaminated consensi that silently mask real genes. Masking + gross % can use an automated library; any per-family biological claim (this family is young/active/novel) needs curation (Goubert 2022 Mob DNA 13:7; TE-Aid; MCHelper).
Annotation quality is capped by assembly quality. Short-read de Bruijn assemblers collapse near-identical TE copies and drop the youngest (most identical, most biologically active) ones, so short-read assemblies systematically under-count TEs and bias the age distribution toward "old" - which masquerades as the real signal "this lineage has no recent activity." Software cannot recover what the assembler threw away. Always ask what assembly a "% repeat" came from; HiFi/T2T raised the ceiling (LAI measures it) but T2T satellite/centromere repeats still exceed what the standard TE toolchain can annotate.
| Tool | Citation | Role | When |
|------|----------|------|------|
| RepeatModeler2 | Flynn 2020 PNAS | de novo family discovery -> consensus library | discover genome-specific families (run -LTRStruct) |
| RepeatMasker | Smit/Hubley/Green (software) | annotate/mask a genome against a library | the masking step; does not discover families |
| EarlGrey | Baril 2024 MBE | wraps RepeatModeler2 + auto consensus-elongation + RepeatMasker + plots | non-model default; minimal hand-work |
| EDTA | Ou 2019 Genome Biol | structural LTR/TIR/Helitron discovery + filtering | plant / structurally-rich genomes |
| LTR_retriever | Ou & Jiang 2018 Plant Physiol | isolate intact LTR-RTs; feeds LAI | LTR focus / assembly-quality (LAI) |
| TRF | Benson 1999 NAR | tandem/satellite repeats | a different algorithm class from TE maskers |
| RepeatClassifier / DeepTE / TERL | Flynn 2020; Yan 2020 | classify unknown consensi | attack the "Unknown" fraction (validate; can mislabel) |
Dfam (open) vs RepBase (paywalled since 2019) is the database schism - modern open pipelines build on Dfam + de novo. Engine -e: rmblast (default, consensus FASTA) vs nhmmer (Dfam profile HMMs, more sensitive for ancient repeats, slower) - the same genome reads a higher % with HMM detection.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Non-model eukaryote, defensible answer, minimal hand-work | EarlGrey | RepeatModeler2 + auto-curation + clean outputs |
| Plant / structurally-rich TE genome | EDTA | best-in-class LTR/TIR/Helitron structural annotation + host-gene filtering |
| Well-covered vertebrate, just need masking | RepeatMasker -species against Dfam | curated families already exist |
| Mask before gene prediction | RepeatMasker -xsmall (soft) + decontaminated library | predictors need soft-masking |
| Publication-grade TE biology claim | de novo -> manual curation (Goubert protocol, TE-Aid) | automated library is the start, not the end |
| Tandem/satellite/centromeric repeats | TRF + satellite tools (not RepeatMasker) | library-based TE tools don't see tandem arrays |
| TE expression from RNA-seq | -> TEtranscripts/SQuIRE (EM multimapper handling) | see expression section |
| TE insertion polymorphisms from reads | -> variant-calling (MELT/TEPID) | out of scope here |
# 1. De novo family discovery -> mydb-families.fa
BuildDatabase -name mydb assembly.fa
RepeatModeler -database mydb -threads 16 -LTRStruct # -LTRStruct enables the LTR structural pipeline
# 2. (Recommended) decontaminate the library against host proteins, then UNION with Dfam clade
# -> pull any consensus whose best hit is a host gene with no transposase/RT/integrase domain
# 3. Soft-mask against (custom library) for gene prediction
RepeatMasker -lib mydb-families.fa -xsmall -gff -e rmblast -pa 16 -dir rm_out assembly.fa
-xsmall = soft-mask (lowercase) - the key flag, the one people get wrong. Default .masked output hard-masks with N; -x masks with X. -nolow skips low-complexity/simple repeats (often wanted before gene prediction - see below). Outputs: .masked, .out, .tbl (summary %), .align (needed for the landscape).
-nolow) - simple repeats overlap real coding microsatellites and low-complexity protein domains.Goal: Summarize masked content by class and plot the Kimura-divergence landscape (a relative within-genome age readout).
Approach: Parse the RepeatMasker .out file, group by class for bp and genome fraction, then histogram percent divergence stratified by major TE class (x = divergence-from-consensus ~ relative age).
import pandas as pd
def parse_repeatmasker_out(out_file):
records = []
with open(out_file) as f:
for i, line in enumerate(f):
if i < 3:
continue
parts = line.split()
if len(parts) < 15:
continue
records.append({'perc_div': float(parts[1]), 'seqid': parts[4],
'repeat_class': parts[10], 'length': int(parts[6]) - int(parts[5]) + 1})
return pd.DataFrame(records)
def repeat_summary(rm_df, genome_size):
by_class = rm_df.groupby('repeat_class')['length'].sum().sort_values(ascending=False)
total = rm_df['length'].sum()
print(f'Total masked: {total/genome_size:.1%} of genome (a LOWER bound; ancient copies decay past detection)')
return by_class / genome_size * 100
The landscape is right-censored - the most ancient TEs decayed past alignment detection, so "no old activity" can mean "old activity is invisible." A sharp left (low-divergence) peak is a recent/ongoing burst; treat presence of a recent peak as informative and absence of an old hump cautiously. A truncated/chimeric consensus distorts the whole x-axis (another reason curation matters); never compare landscapes across genomes annotated with different libraries.
A read from a young high-copy family maps equally to hundreds of near-identical loci. Unique-only mapping (standard RNA-seq QC) discards most TE signal and biases toward old, uniquely-mappable copies - measuring the least active elements. Use EM/probabilistic reassignment: TEtranscripts/TElocal (Jin 2015), SQuIRE (Yang 2019), Telescope (Bendall 2019). Subfamily-level (TEtranscripts: "L1 went up", high power, no locus) vs locus-level (SQuIRE/TElocal/Telescope: "this HERV-K on chr7 is on", noisy, mappability-sensitive) changes the conclusion, not just the resolution. The dominant false positive: a TE in an intron or downstream of an expressed gene is not "expressed" - read-through/intron-retention piles reads on it; distinguish autonomous transcription from passenger signal by strand and continuity (TEspeX filters embedded-TE reads). Be skeptical of any "TEs reactivated in disease/aging" headline that used unique-only mapping.
Trigger: running RepeatMasker without -xsmall (default hard-masks with N). Mechanism: masked sequence is destroyed. Symptom: genes overlapping repeats silently absent from the GFF. Fix: -xsmall; hand a soft-masked genome to the predictor.
Trigger: masking with an uncurated de novo library. Mechanism: multicopy gene families look repetitive and enter the library. Symptom: suspiciously few NLR/ZNF/OR genes; domesticated genes (RAG1, CENP-B) missing. Fix: BLAST the library against a protein DB; drop consensi hitting host genes with no TE domain.
Trigger: comparing TE content across studies/assemblies. Mechanism: short reads collapse/drop young copies; % depends on library+engine+assembly. Symptom: "low TE, all ancient" or non-comparable cross-study tables. Fix: check LAI/assembly type; report method + assembly with every number; never compare published % across papers.
Trigger: unique-only TE quantification. Mechanism: young high-copy families are not uniquely mappable. Symptom: most TE signal lost, bias to old elements. Fix: EM tools (TEtranscripts/SQuIRE/Telescope); separate read-through from autonomous transcription.
Trigger: shipping a 40%-Unknown library without inspection. Mechanism: classification is the hardest, last, most-skipped step. Symptom: weak biological annotation; possible gene-family contamination hiding in Unknown. Fix: RepeatClassifier/DeepTE to triage; curate; note DB-coverage limits.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| -xsmall soft-mask before gene prediction | predictor requirement | hard-mask truncates repeat-overlapping genes |
| TE content scales with genome size (human ~50%, maize ~85%, Arabidopsis ~20-25%, fungi ~1-20%) | clade norms (approx) | main driver of the C-value enigma; sanity-check vs genome size |
| "Unknown" ~<15% (mammal) vs 30-50% (non-model) | DB coverage | high Unknown bounds biological claims; very low on non-model = over-assignment |
| LAI <10 draft / 10-20 reference / >20 gold | Ou 2018 NAR | LTR-RT-resolution metric; only valid for LTR-rich genomes |
| Report library + engine + assembly with any % | reproducibility | % masked is non-comparable across methods |
| 80-80-80 (≥80% id over ≥80% length over ≥80 bp) | Wicker lineage | dereplication threshold, NOT a quality check |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Gene prediction finds too few genes | hard-masked, or over-masked low-complexity | -xsmall; -nolow before gene prediction |
| Suspiciously few NLR/ZNF/OR genes | uncurated library masked gene families | decontaminate library against a protein DB |
| Low masking percentage | novel repeats absent from DB | run RepeatModeler2 first; union de novo + Dfam |
| RepeatModeler very slow | normal for large genomes | -threads; consider EDTA (plants) or EarlGrey |
| "TE re-activated" result looks too clean | unique-only mapping / read-through | EM tools; check strand + continuity from neighbor |
| Cross-study % repeat disagree | different library/engine/assembly | re-annotate uniformly; report method |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.