genome-intervals/proximity-operations/SKILL.md
Performs proximity operations on genomic intervals with bedtools (closest, window, flank, slop) and pybedtools - nearest-feature queries with signed/strand-aware distance, fixed-radius window searches, strand-aware promoter construction, and interval extension. Covers the closest -d/-D a/b/ref/-t/-k/-io/-iu/-id flags, the -D ref strand sign-flip, silent chromosome-end clipping in slop/flank, -t all tie double-counting, and the critical distinction between a geometry answer (nearest TSS) and a biology answer (which gene an element regulates). Use when assigning peaks or variants to genes, defining promoters from a gene model, building distance-to-TSS distributions, finding features within a window, or extending intervals - and when deciding whether nearest-gene is a fair prior (GWAS locus) or a trap (distal enhancer).
npx skillsauth add GPTomics/bioSkills bio-genome-intervals-proximity-operationsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: bedtools 2.31+, pybedtools 0.10+.
Before using code patterns, verify installed versions match. If versions differ:
bedtools --version then bedtools <subcommand> --help to confirm flagspip show pybedtools then help(pybedtools.BedTool.closest) to check signaturesflank and slop REQUIRE a chrom-sizes (genome.txt, two columns: chrom<TAB>length) file via -g; closest requires both inputs coordinate-sorted (sort -k1,1 -k2,2n). If code throws an error, introspect the installed tool and adapt rather than retrying.
"Which gene is nearest to each peak, and is that the gene it regulates?" -> Compute interval geometry (nearest feature, signed distance, window membership, strand-aware promoters) with bedtools, then decide honestly whether geometry answers the biological question.
bedtools closest -D b -t first -a peaks.bed -b genes.bed, bedtools window -w 50000, bedtools slop -s -l 2000 -r 200 -g genome.txtpeaks.closest(genes.sort(), D='b', t='first'), peaks.window(genes, w=50000), tss.slop(g='genome.txt', s=True, l=2000, r=200) (pybedtools)bedtools closest answers "what is the nearest annotated TSS?" - a coordinate fact. The user almost always wants "which gene does this element regulate?" - a biology claim. For distal regulatory elements these disagree the majority of the time. In the CRISPRi-FlowFISH gold standard (Fulco 2019 Nat Genet 51:1664), assigning each tested distal element to the closest expressed gene gave only ~47% precision and ~37% recall - the nearest gene was the wrong target most of the time, and the method missed nearly two-thirds of real links. Enhancers routinely skip intervening genes: the canonical case is the obesity-associated FTO intron regulating IRX3 ~500 kb away, not FTO (Smemo 2014 Nature 507:371). Do the bedtools arithmetic flawlessly here, then route real enhancer->gene linking to activity/contact/QTL methods (ABC: Fulco 2019, Nasser 2021; PCHi-C; eQTL-coloc) at atac-seq/enhancer-gene-linking - never present "nearest gene" as a regulatory call for a distal element.
The deeper twist - two regimes, opposite advice, identical command:
Conflating the two regimes is the real error. The discriminator: is the question the target of an enhancer (distrust nearest) or the gene under a GWAS peak (nearest is a fine first pass)?
| Operation | What it computes | Strand-aware? | Needs genome file? |
|-----------|------------------|---------------|--------------------|
| closest | For each A, the nearest B (+ optional signed distance) | optional (-s/-S, -D a/-D b) | no |
| window | For each A, all B within +-W bp (fuzzy intersect) | optional (-sw/-sm/-Sm) | no |
| slop | Grow each interval by N bp, keeping it one feature | optional (-s) | yes (-g) |
| flank | Emit the regions BESIDE each interval, dropping the body | optional (-s) | yes (-g) |
closest/window are queries (A vs B); slop/flank are transforms (A only, + genome file). The slop-vs-flank distinction trips people: slop -b 1000 makes a peak 2 kb wider (one feature); flank -b 1000 returns only the left/right neighboring 1 kb regions and discards the peak itself (two features). window -w 0 is approximately intersect.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Nearest gene to a promoter-proximal mark (H3K4me3, Pol II, CAGE) | closest -D b -io -t first | the peak really is at the gene it marks; closest is honest here |
| Distal enhancer / ATAC peak -> which gene? | closest/window as candidates, then -> atac-seq/enhancer-gene-linking | nearest is wrong the majority of the time (ABC/PCHi-C/eQTL link it) |
| GWAS credible-set SNP -> implicated gene | closest to nearest protein-coding gene, then -> causal-genomics/colocalization-analysis | nearest-coding-gene is a ~50-65% prior; a fair first pass |
| All candidate genes near an element | window -w 50000 (or TAD-scale) | honest "candidate set", not a single call |
| Build promoters from a gene model | collapse to TSS, then slop -s -l UP -r DOWN -g | a promoter is an imposed definition, strand-aware, from the TSS |
| Distance-to-TSS distribution | closest -D b -d then plot signed distance | a distribution beats a binary "promoter vs distal" threshold |
| Upstream-only / downstream-only nearest | closest -D b -iu / -id | direction must be strand-relative (-D b), never -D ref |
| Peak-set GO enrichment from proximity | -> GREAT/rGREAT (regulatory-domain model) | avoids the -t all double-counting and distal mis-assignment |
| Regions flanking a feature (splice/boundary context) | flank -s -b N -g | the regions outside the feature, strand-aware |
| Peaks not yet called | -> chip-seq/peak-calling, atac-seq/atac-peak-calling | this skill operates on existing intervals |
Default: for each A, report the single nearest B; on ties, report ALL tied B (-t all is the default - the double-counting trap below). Both inputs must be sorted. When A's chromosome has no B feature, bedtools prints none for B columns and -1 for distance - filter this sentinel before any numeric summary.
# Nearest gene, signed distance by the GENE's strand, ignore overlaps, one row per peak
bedtools sort -i peaks.bed > peaks.sorted.bed
bedtools sort -i genes.bed > genes.sorted.bed
bedtools closest -a peaks.sorted.bed -b genes.sorted.bed -D b -io -t first > nearest.bed
# ^^^^ sign by gene strand (biology, not coordinates)
# ^^^ closest non-overlapping gene
# ^^^^^^^^ resolve ties deterministically (document this)
# k=3 nearest with unsigned distance (k>1 intentionally multiplies rows)
bedtools closest -a peaks.sorted.bed -b genes.sorted.bed -k 3 -d > top3.bed
import pybedtools
peaks = pybedtools.BedTool('peaks.bed').sort()
genes = pybedtools.BedTool('genes.bed').sort()
near = peaks.closest(genes, D='b', io=True, t='first') # -D b -io -t first
near = near.filter(lambda x: int(x.fields[-1]) != -1) # drop the no-feature sentinel
near.saveas('nearest.bed')
Key flags: -d unsigned distance (overlaps = 0); -D ref signed by coordinate only (strand-agnostic - see Failure Modes); -D a/-D b signed by A's / B's strand; -t all|first|last; -k N k-nearest; -io ignore overlapping B; -iu/-id ignore upstream/downstream (require -D); -fu/-fd first upstream/downstream; -s/-S same/opposite strand; -N require different names; -mdb each|all and -names/-filenames for multiple -b files.
window reports all B within a window around each A (default 1000 bp each side). Use it for the honest "candidate genes near this element" framing.
# All genes within 50 kb of each peak, counted per peak
bedtools window -a peaks.bed -b genes.bed -w 50000 -c > peak_gene_counts.bed
Flags: -w N symmetric (default 1000); -l N/-r N asymmetric (coordinate left/right); -sw define -l/-r BY STRAND; -sm/-Sm keep only same/opposite-strand B; -u boolean (A once if any B); -c count of B per A; -v A with no B in window. -sw controls where the window is; -sm/-Sm control which B count - distinct concerns.
Both need -g genome.txt precisely so they can clip at chromosome boundaries - extension past coordinate 0 or past chrom length is silently truncated (start floored at 0, end capped). Flags: -b N both sides; -l N/-r N per side (coordinate unless -s); -s strand-aware (on a --strand feature -l adds to the END, so -l always means "upstream of the feature"); -pct treat N as a fraction of feature length; -header echo input header.
Goal: Produce a promoter BED (TSS -2000 / +200 bp, strand-aware) that is correct for both strands - the right way to define "promoter", which is a choice imposed on a TSS, not an annotated feature.
Approach: Collapse genes to their TSS first (start for +, end-1 for -), THEN slop -s so "upstream" tracks strand. Running slop -b 2000 on a gene BODY is the wrong promoter (it grows the whole gene, ignores strand).
# 1) TSS BED from a BED6 gene model (strand-aware single base)
awk -v OFS='\t' '{ if ($6=="+") print $1,$2,$2+1,$4,$5,$6; else print $1,$3-1,$3,$4,$5,$6 }' genes.bed > tss.bed
# 2) Promoter = TSS -2000 / +200, strand-aware (-l is always the upstream side under -s)
bedtools slop -i tss.bed -g genome.txt -s -l 2000 -r 200 > promoters.bed
import pybedtools
UP = 2000 # bp upstream of TSS; common core-promoter convention, NOT a fact -- report it and tune per assay
DOWN = 200 # bp downstream of TSS; asymmetric on purpose (+1 nucleosome / 5'UTR sit downstream)
genes = pybedtools.BedTool('genes.bed')
tss = genes.each(lambda f: pybedtools.create_interval_from_list([f[0], str(f.start) if f.strand == '+' else str(f.end - 1), str(f.start + 1) if f.strand == '+' else str(f.end), f.name, f.score, f.strand])).saveas()
promoters = tss.slop(g='genome.txt', s=True, l=UP, r=DOWN).saveas('promoters.bed')
flank shares the flag vocabulary but emits the regions BESIDE each feature and drops the original (two intervals per input, used for splice/boundary context):
bedtools flank -i exons.bed -g genome.txt -s -b 1000 > exon_flanks.bed # 1 kb each side, strand-aware
Trigger: using closest -D ref and interpreting the sign as upstream/downstream. Mechanism: -D ref signs by genomic coordinate only (lower = negative); for a --strand gene the TSS is at the HIGHER coordinate, so "upstream" runs to higher coordinates and the coordinate sign is inverted relative to biology. Symptom: half the genes (the --strand ones) are folded the wrong way; symmetric QC (TSS-enrichment plot) still looks fine, but any "enhancers preferentially upstream" claim washes out or inverts. Fix: use -D b (sign by the gene's strand) for any upstream/downstream biology; reserve -D ref for pure left/right genomic distance.
Trigger: slop -b 2000 (or -l 2000 -r 0 without -s) on a gene-body BED, called "the promoter". Mechanism: it grows the window around the whole gene, not the TSS, and without -s adds the "upstream" side to the wrong (3') end on --strand genes. Symptom: a 100 kb gene becomes a 104 kb "promoter"; every --strand promoter is shifted into the gene body. Fix: collapse to TSS first, then slop -s -l UP -r DOWN.
Trigger: fixed-width windows near contig starts / telomeres. Mechanism: slop/flank truncate at 0 and chrom length with no warning. Symptom: a TSS 800 bp from a contig start yields a 1000-bp (not 2000-bp) upstream window - quietly asymmetric, biasing per-window normalization (reads/kb, motif density); flank can drop a region entirely, breaking a 2:1 feature->flank assumption. Fix: after slop verify end-start == requested width; after flank verify the per-feature flank count; treat chrom-end features as edge cases.
Trigger: letting default -t all rows flow into a per-gene tally, wc -l peak count, or GO/hypergeometric enrichment. Mechanism: a peak equidistant to two TSSs emits two rows; ties concentrate NON-randomly at bidirectional (head-to-head) promoters and gene-dense regions. Symptom: association counts inflated exactly where biology is most interesting; broken independence inflates significance. Fix: -t first (deterministic but arbitrary - document it) OR -t all then aggregate counting distinct PEAKS not rows; for enrichment prefer GREAT/rGREAT, whose regulatory-domain model exists to avoid this artifact.
Trigger: closest on a BED that was filtered/edited and not re-sorted. Mechanism: closest assumes coordinate-sorted input. Symptom: wrong nearest feature or an error. Fix: bedtools sort (or .sort() in pybedtools) both A and B first.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Promoter TSS -2000 / +200 bp (strand-aware) | common convention | a CHOICE, not a fact; asymmetric because core-promoter elements sit upstream and the +1 nucleosome / 5'UTR downstream. Report it; "% promoter-proximal" is sensitive to it | | GREAT basal domain 5 kb up / 1 kb down, extension <=1 Mb | McLean 2010 Nat Biotechnol 28:495 | the principled "proximity++": asymmetric basal domain + extension to the neighbor, a far better proximity heuristic than raw closest | | ChIPseeker default promoter +-3 kb | tool default | shows the convention spans an order of magnitude (+-500 bp to +-10 kb across tools) | | ABC candidate window 5 Mb | Fulco 2019 | activity-by-contact scores all elements within 5 Mb of a gene's promoter - "distal" is tens of kb to megabases | | Nearest gene precision/recall ~47% / ~37% (enhancers) | Fulco 2019 CRISPRi-FlowFISH | the empirical ceiling on nearest-gene for distal-enhancer targeting | | Nearest protein-coding gene ~50-65% right (GWAS loci) | fine-mapping/coloc literature | the GWAS-regime baseline; strong, hard to beat, but imperfect | | Distal flag |dist| > ~50-100 kb | field convention | beyond promoter scale; flag for ABC/PCHi-C/eQTL validation rather than trusting nearest |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Nearest-gene call wrong for an enhancer | geometry != regulation for distal elements | treat as candidate; route to atac-seq/enhancer-gene-linking (ABC/PCHi-C/eQTL) |
| Upstream/downstream asymmetry washes out or inverts | -D ref mis-signs --strand genes | use -D b |
| Promoter window includes the whole gene | slop -b on a gene body, not the TSS | collapse to TSS, then slop -s -l UP -r DOWN |
| Per-gene counts inflated near bidirectional promoters | -t all rows counted as peaks | -t first or aggregate by distinct peak; or use GREAT |
| Asymmetric "fixed-width" windows near contig ends | silent slop/flank clipping | verify end-start; treat chrom-end features as edge cases |
| none / -1 rows poison a mean distance | no B feature on that chromosome | filter the -1 sentinel before summarizing |
| Wrong nearest feature, or closest errors | unsorted input | bedtools sort both A and B |
| Empty output | chr1 vs 1 naming mismatch between A, B, genome file | harmonize chromosome naming across all files |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.