skills/tooluniverse-sequence-retrieval/SKILL.md
Retrieve DNA/RNA/protein sequences from NCBI and ENA with disambiguation. Quality hierarchy: RefSeq (NM_/NP_) > RefSeq predicted (XM_/XP_) > GenBank submissions. Use for fetching specific sequences by accession, gene-symbol-to-sequence lookup, transcript-isoform retrieval, and curated-vs-raw-submission preference.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-sequence-retrievalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.
IMPORTANT: Always use English terms in tool calls. Only try original-language terms as fallback. Respond in the user's language.
LOOK UP DON'T GUESS: Never assume accession numbers or sequence versions. Always retrieve and verify from NCBI or ENA.
Sequence quality hierarchy: RefSeq (NM_/NP_ = curated) > RefSeq predicted (XM_/XP_) > GenBank (submitted). Prefer the MANE Select transcript for human canonical isoforms. Check version numbers -- annotations improve across versions.
Phase 0: Clarify (if needed) → Phase 1: Disambiguate Gene/Organism → Phase 2: Search & Retrieve → Phase 3: Report
Ask ONLY if: gene exists in multiple organisms, sequence type unclear, or strain matters. Skip for: specific accessions, clear organism+gene combos, complete genome requests with organism.
| Prefix | Type | Use With | |--------|------|----------| | NC_/NM_/NR_/NP_/XM_ | RefSeq | NCBI only | | U*/M*/K*/X*/CP*/NZ_ | GenBank | NCBI or ENA | | EMBL format | EMBL | ENA preferred |
CRITICAL: Never try ENA tools with RefSeq accessions -- they return 404.
Retrieve silently. Do NOT narrate the search process.
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search", organism=organism, gene=gene,
strain=strain, keywords=keywords, seq_type=seq_type, limit=10
)
# Get accessions from UIDs
accessions = tu.tools.NCBI_fetch_accessions(operation="fetch_accession", uids=result["data"]["uids"])
# Retrieve sequence (FASTA or GenBank format)
sequence = tu.tools.NCBI_get_sequence(operation="fetch_sequence", accession=accession, format="fasta")
# ENA alternative (non-RefSeq accessions only)
entry = tu.tools.ena_get_entry(accession=accession)
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
| Primary | Fallback | Notes | |---------|----------|-------| | NCBI_get_sequence | ENA (if GenBank format) | NCBI unavailable | | ena_get_entry | NCBI_get_sequence | ENA doesn't have RefSeq | | NCBI_search_nucleotide | Try broader keywords | No results |
Present as a Sequence Profile Report. Hide search process. Include:
| Tier | Prefix | Description | |------|--------|-------------| | RefSeq Reference (best) | NC_, NM_, NP_ | NCBI-curated, gold standard | | RefSeq Predicted | XM_, XP_, XR_ | Computationally predicted | | GenBank Validated | Various | Submitted, some curation | | GenBank Direct | Various | Direct submission | | Third Party | TPA_ | Third-party annotation |
Sequence quality: Prefer RefSeq over GenBank. Check version numbers. Sequences with "PREDICTED" in definition are not experimentally validated.
Accession guidance: RefSeq = NCBI-only. GenBank = mirrored in ENA/EMBL. Default to RefSeq mRNA (NM_) for human/model organisms; most complete genome assembly for microbial queries.
Cross-database reconciliation: Same sequence may have different accessions (e.g., GenBank U00096 = RefSeq NC_000913 for E. coli K-12). Always report both when available. Discrepancies between GenBank/RefSeq typically indicate RefSeq curation corrected submission errors.
| Error | Response | |-------|----------| | "No search criteria provided" | Add organism, gene, or keywords | | "ENA 404 error" | Likely RefSeq -- use NCBI only | | "No results found" | Broaden search, check spelling, try synonyms | | "Sequence too large" | Note size, provide download link instead |
NCBI Tools: NCBI_search_nucleotide (search), NCBI_fetch_accessions (UID→accession), NCBI_get_sequence (retrieve)
ENA Tools (GenBank/EMBL only): ena_get_entry (metadata), ena_get_sequence_fasta (FASTA), ena_get_entry_summary (summary)
NCBI_search_nucleotide: operation="search", organism (scientific name), gene (symbol), strain, keywords, seq_type (complete_genome/mrna/refseq), limit
NCBI_get_sequence: operation="fetch_sequence", accession, format (fasta/genbank)
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.