skills/tooluniverse-microbial-genome-characterization/SKILL.md
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-microbial-genome-characterizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Discover, quality-control, and structurally map genome ASSEMBLIES for any organism using the keyless NCBI Datasets genome tools. Organism/taxon in → assembly inventory, QC metrics, and chromosome/plasmid map out.
When uncertain about an accession, assembly level, replicon count, or N50, CALL the tool. Never report assembly statistics from memory — accessions and metrics change with each RefSeq release. A live NCBI Datasets answer is always more reliable than a guess.
When comparing multiple assemblies or ranking by quality, retrieve each via the tools, then write and run Python (pandas) over the returned JSON to sort, score, and tabulate. Don't describe what you would compute — execute it and report actual numbers.
Triggers:
Use Cases:
NOT this skill (point elsewhere):
tooluniverse-comparative-genomicstooluniverse-plant-genomics| Tool | Key params | Returns |
|------|-----------|---------|
| NCBIDatasets_suggest_taxonomy | query (organism name string) | candidate matches: scientific_name, tax_id, rank, group_name |
| NCBIDatasets_get_taxonomy | tax_id (string/int) | organism_name, rank, lineage, children |
| NCBIDatasets_list_genomes_by_taxon | taxon (name OR taxid), limit, reference_only (bool) | assembly list (accession, assembly_level, refseq_category, total_sequence_length, contig_n50, gc_percent, number_of_chromosomes, number_of_contigs); metadata.total_available = full count |
| NCBIDatasets_get_genome_assembly | accession (GCF_/GCA_) | full QC: total_sequence_length, number_of_chromosomes, number_of_contigs, contig_n50, scaffold_n50, gc_percent, assembly_level, assembly_status, refseq_category, release_date, submitter, annotation_provider |
| NCBIDatasets_get_sequence_reports | accession (GCF_/GCA_) | per-replicon list: chr_name, role, refseq_accession, genbank_accession, length, gc_percent |
Param note:
get_taxonomyrequirestax_id(NOTtaxon).list_genomes_by_taxonaccepts either a name or a taxid in itstaxonfield. Always pass an accession to the assembly/sequence-report tools.
If the user gives an organism name, resolve it to a tax id first:
NCBIDatasets_suggest_taxonomy {"query": "Escherichia coli"}
Pick the candidate whose scientific_name/rank matches the user's intent (species vs. a specific strain). Optionally confirm lineage/children with NCBIDatasets_get_taxonomy {"tax_id": "562"}.
If the user already gave a GCF_/GCA_ accession, skip to Phase 2.
List what exists for the taxon. Start reference_only: true to surface the curated reference/representative genome(s); set it to false to see the full set.
NCBIDatasets_list_genomes_by_taxon {"taxon": "562", "limit": 5, "reference_only": true}
Read metadata.total_available for the true count (large taxa return thousands — the data array is only the first limit rows). Note each candidate's assembly_level, refseq_category, contig_n50, and number_of_contigs.
Prefer, in order:
refseq_category == "reference genome" (NCBI's single designated reference)refseq_category == "representative genome"assembly_level (Complete Genome > Chromosome > Scaffold > Contig)contig_n50 and lowest number_of_contigs among same-level candidatesNCBIDatasets_get_genome_assembly {"accession": "GCF_000005845.2"}
Report: total length, # chromosomes, # contigs, contig N50, scaffold N50, GC%, assembly level, RefSeq category, release date, annotation provider.
NCBIDatasets_get_sequence_reports {"accession": "GCF_000005845.2"}
Each row is one replicon. Distinguish chromosomes from plasmids by chr_name / role: a row named like pO157, pOSAK1, or with a plasmid-style name is a plasmid; chromosome rows are chromosomes. To answer "how many plasmids", count the non-chromosome assembled-molecule rows.
When the user wants the best of several assemblies, fetch each accession, build a pandas table, and sort by (assembly_level rank, then contig_n50 desc, then number_of_contigs asc). Report the winner with the metrics that decided it.
Assembly level (contiguity, best → worst):
| Level | Meaning | |-------|---------| | Complete Genome | Every replicon (each chromosome + each plasmid) fully resolved as one gapless sequence. Gold standard. | | Chromosome | Chromosome(s) assembled to near-complete, but may contain gaps; plasmids/organelles may be incomplete. | | Scaffold | Contigs ordered/oriented into scaffolds using gap-spanning evidence; gaps remain. Draft. | | Contig | Only contiguous stretches; no scaffolding. Most fragmented draft. |
Contiguity metrics (a typical bacterial genome is 2–6 Mb):
RefSeq category:
| Value | Meaning | |-------|---------| | reference genome | NCBI's single, most-curated assembly for the taxon — the default to cite. | | representative genome | A high-quality assembly chosen to represent the species when no formal reference is designated. | | (null / none) | An ordinary submitted assembly, not specially designated. |
GCF_ vs GCA_: GCF_ = RefSeq (NCBI-curated, consistent annotation). GCA_ = GenBank (as submitted by the author). They share the numeric core (e.g., GCF_000005845.2 / GCA_000005845.2); prefer GCF_ when both exist.
NCBIDatasets_suggest_taxonomy {"query":"Escherichia coli"} → species tax id 562.NCBIDatasets_list_genomes_by_taxon {"taxon":"562","limit":5,"reference_only":true} → top hit GCF_000005845.2 (E. coli str. K-12 substr. MG1655), assembly_level Complete Genome, refseq_category reference genome, total 4,641,652 bp, contig_n50 4,641,652, GC 51%. metadata.total_available = 2 reference-grade.NCBIDatasets_get_genome_assembly {"accession":"GCF_000005845.2"} → 4.64 Mb, 1 chromosome, 1 contig, contig N50 = scaffold N50 = 4,641,652 (the entire genome is one gapless contig), GC 51%, Complete Genome, released 2013-09-26, annotated by NCBI RefSeq.NCBIDatasets_get_sequence_reports {"accession":"GCF_000005845.2"} → one replicon: chromosome, RefSeq NC_000913.3 (GenBank U00096.3), 4,641,652 bp, GC 51%. Zero plasmids.Answer: The E. coli K-12 reference genome is GCF_000005845.2 — a 4.64 Mb Complete Genome with a single chromosome (NC_000913.3), no plasmids, GC 51%.
NCBIDatasets_get_sequence_reports {"accession":"GCF_000008865.2"} → three replicons:
chromosome — NC_002695.2 — 5,498,578 bppOSAK1 (plasmid) — NC_002127.1 — 3,306 bppO157 (plasmid) — NC_002128.1 — 92,721 bpAnswer: 1 chromosome + 2 plasmids (pOSAK1 ~3.3 kb, pO157 ~92.7 kb). Note: the assembly's number_of_chromosomes field reports 3 (it counts all assembled molecules); the sequence report is authoritative for telling chromosomes from plasmids by name/role.
NCBIDatasets_list_genomes_by_taxon {"taxon":"Mycobacterium tuberculosis","limit":3,"reference_only":false} → metadata.total_available = 16,311 assemblies; first rows include GCA_000195955.2 and its RefSeq pair GCF_000195955.2 (both Complete Genome, reference genome, contig N50 4,411,532, 1 contig). Use reference_only:true to cut 16k assemblies down to the curated reference; never page through all of them.
number_of_chromosomes counts assembled molecules, not strictly chromosomes — for some bacteria it includes plasmids. Always use get_sequence_reports to separate chromosomes from plasmids by replicon name/role.list_genomes_by_taxon returns only limit rows; trust metadata.total_available for the count and refine with reference_only:true rather than fetching thousands.Before answering, confirm you have:
metadata.total_available when reporting "how many genomes exist"get_genome_assembly callget_sequence_reports (not number_of_chromosomes) to count chromosomes vs plasmidstools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
--- name: tooluniverse-fastq-qc description: FASTQ quality control and adapter/quality-trimming decisions with local NGS tools — run FastQC on raw reads, summarize a project with MultiQC, interpret per-base sequence quality, per-base N content, adapter content, overrepresented sequences, sequence duplication and GC content, and decide whether (and how) to trim with fastp / Cutadapt before downstream analysis. seqkit for read counts/stats/subsampling. Use when someone asks "run QC on my FASTQs",