scientific-skills/jaspar-database/SKILL.md
Query JASPAR for transcription factor binding site (TFBS) profiles (PWMs/PFMs). Search by TF name, species, or class; scan DNA sequences for TF binding sites; compare matrices; essential for regulatory genomics, motif analysis, and GWAS regulatory variant interpretation.
npx skillsauth add K-Dense-AI/claude-scientific-skills jaspar-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
JASPAR (https://jaspar.elixir.no/) is the gold-standard open-access database of curated, non-redundant transcription factor (TF) binding profiles stored as position frequency matrices (PFMs). JASPAR 2024 contains 1,210 non-redundant TF binding profiles for 164 eukaryotic species. Each profile is experimentally derived (ChIP-seq, SELEX, HT-SELEX, protein binding microarray, etc.) and rigorously validated.
Key resources:
jaspar (via Biopython) or direct APIUse JASPAR when:
Base URL: https://jaspar.elixir.no/api/v1/
import requests
BASE_URL = "https://jaspar.elixir.no/api/v1"
def jaspar_get(endpoint, params=None):
url = f"{BASE_URL}/{endpoint}"
response = requests.get(url, params=params, headers={"Accept": "application/json"})
response.raise_for_status()
return response.json()
def search_jaspar(
tf_name=None,
species=None,
collection="CORE",
tf_class=None,
tf_family=None,
page=1,
page_size=25
):
"""Search JASPAR for TF binding profiles."""
params = {
"collection": collection,
"page": page,
"page_size": page_size,
"format": "json"
}
if tf_name:
params["name"] = tf_name
if species:
params["species"] = species # Use taxonomy ID or name, e.g., "9606" for human
if tf_class:
params["tf_class"] = tf_class
if tf_family:
params["tf_family"] = tf_family
return jaspar_get("matrix", params)
# Examples:
# Search for human CTCF profile
ctcf = search_jaspar("CTCF", species="9606")
print(f"Found {ctcf['count']} CTCF profiles")
# Search for all homeobox TFs in human
hox_tfs = search_jaspar(tf_class="Homeodomain", species="9606")
# Search for a TF family
nfkb = search_jaspar(tf_family="NF-kappaB")
def get_matrix(matrix_id):
"""Fetch a specific JASPAR matrix by ID (e.g., 'MA0139.1' for CTCF)."""
return jaspar_get(f"matrix/{matrix_id}/")
# Example: Get CTCF matrix
ctcf_matrix = get_matrix("MA0139.1")
# Matrix structure:
# {
# "matrix_id": "MA0139.1",
# "name": "CTCF",
# "collection": "CORE",
# "tax_group": "vertebrates",
# "pfm": { "A": [...], "C": [...], "G": [...], "T": [...] },
# "consensus": "CCGCGNGGNGGCAG",
# "length": 19,
# "species": [{"tax_id": 9606, "name": "Homo sapiens"}],
# "class": ["C2H2 zinc finger factors"],
# "family": ["BEN domain factors"],
# "type": "ChIP-seq",
# "uniprot_ids": ["P49711"]
# }
import numpy as np
def get_pwm(matrix_id, pseudocount=0.8):
"""
Fetch a PFM from JASPAR and convert to PWM (log-odds).
Returns numpy array of shape (4, L) in order A, C, G, T.
"""
matrix = get_matrix(matrix_id)
pfm = matrix["pfm"]
# Convert PFM to numpy
pfm_array = np.array([pfm["A"], pfm["C"], pfm["G"], pfm["T"]], dtype=float)
# Add pseudocount
pfm_array += pseudocount
# Normalize to get PPM
ppm = pfm_array / pfm_array.sum(axis=0, keepdims=True)
# Convert to PWM (log-odds relative to background 0.25)
background = 0.25
pwm = np.log2(ppm / background)
return pwm, matrix["name"]
# Example
pwm, name = get_pwm("MA0139.1") # CTCF
print(f"PWM for {name}: shape {pwm.shape}")
max_score = pwm.max(axis=0).sum()
print(f"Maximum possible score: {max_score:.2f} bits")
import numpy as np
from typing import List, Tuple
NUCLEOTIDE_MAP = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
'a': 0, 'c': 1, 'g': 2, 't': 3}
def scan_sequence(sequence: str, pwm: np.ndarray, threshold_pct: float = 0.8) -> List[dict]:
"""
Scan a DNA sequence for TF binding sites using a PWM.
Args:
sequence: DNA sequence string
pwm: PWM array (4 x L) in ACGT order
threshold_pct: Fraction of max score to use as threshold (0-1)
Returns:
List of hits with position, score, and matched sequence
"""
motif_len = pwm.shape[1]
max_score = pwm.max(axis=0).sum()
min_score = pwm.min(axis=0).sum()
threshold = min_score + threshold_pct * (max_score - min_score)
hits = []
seq = sequence.upper()
for i in range(len(seq) - motif_len + 1):
subseq = seq[i:i + motif_len]
# Skip if contains non-ACGT
if any(c not in NUCLEOTIDE_MAP for c in subseq):
continue
score = sum(pwm[NUCLEOTIDE_MAP[c], j] for j, c in enumerate(subseq))
if score >= threshold:
relative_score = (score - min_score) / (max_score - min_score)
hits.append({
"position": i + 1, # 1-based
"score": score,
"relative_score": relative_score,
"sequence": subseq,
"strand": "+"
})
return hits
# Example: Scan a promoter sequence for CTCF binding sites
promoter = "AGCCCGCGAGGNGGCAGTTGCCTGGAGCAGGATCAGCAGATC"
pwm, name = get_pwm("MA0139.1")
hits = scan_sequence(promoter, pwm, threshold_pct=0.75)
for hit in hits:
print(f" Position {hit['position']}: {hit['sequence']} (score: {hit['score']:.2f}, {hit['relative_score']:.0%})")
def reverse_complement(seq: str) -> str:
complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C', 'N': 'N'}
return ''.join(complement.get(b, 'N') for b in reversed(seq.upper()))
def scan_both_strands(sequence: str, pwm: np.ndarray, threshold_pct: float = 0.8):
"""Scan forward and reverse complement strands."""
fwd_hits = scan_sequence(sequence, pwm, threshold_pct)
for h in fwd_hits:
h["strand"] = "+"
rev_seq = reverse_complement(sequence)
rev_hits = scan_sequence(rev_seq, pwm, threshold_pct)
seq_len = len(sequence)
for h in rev_hits:
h["strand"] = "-"
h["position"] = seq_len - h["position"] - len(h["sequence"]) + 2 # Convert to fwd coords
all_hits = fwd_hits + rev_hits
return sorted(all_hits, key=lambda x: x["position"])
def variant_tfbs_impact(ref_seq: str, alt_seq: str, pwm: np.ndarray,
tf_name: str, threshold_pct: float = 0.7):
"""
Assess impact of a SNP on TF binding by comparing ref vs alt sequences.
Both sequences should be centered on the variant with flanking context.
"""
ref_hits = scan_both_strands(ref_seq, pwm, threshold_pct)
alt_hits = scan_both_strands(alt_seq, pwm, threshold_pct)
max_ref = max((h["score"] for h in ref_hits), default=None)
max_alt = max((h["score"] for h in alt_hits), default=None)
result = {
"tf": tf_name,
"ref_max_score": max_ref,
"alt_max_score": max_alt,
"ref_has_site": len(ref_hits) > 0,
"alt_has_site": len(alt_hits) > 0,
}
if max_ref and max_alt:
result["score_change"] = max_alt - max_ref
result["effect"] = "gained" if max_alt > max_ref else "disrupted"
elif max_ref and not max_alt:
result["effect"] = "disrupted"
elif not max_ref and max_alt:
result["effect"] = "gained"
else:
result["effect"] = "no_site"
return result
import requests, numpy as np
# 1. Get relevant TF matrices (e.g., all human TFs in CORE collection)
response = requests.get(
"https://jaspar.elixir.no/api/v1/matrix/",
params={"species": "9606", "collection": "CORE", "page_size": 500, "page": 1}
)
matrices = response.json()["results"]
# 2. For each matrix, compute PWM and scan promoter
promoter = "CCCGCCCGCCCGCCGCCCGCAGTTAATGAGCCCAGCGTGCC" # Example
all_hits = []
for m in matrices[:10]: # Limit for demo
pwm_data = requests.get(f"https://jaspar.elixir.no/api/v1/matrix/{m['matrix_id']}/").json()
pfm = pfm_data["pfm"]
pfm_arr = np.array([pfm["A"], pfm["C"], pfm["G"], pfm["T"]], dtype=float) + 0.8
ppm = pfm_arr / pfm_arr.sum(axis=0)
pwm = np.log2(ppm / 0.25)
hits = scan_sequence(promoter, pwm, threshold_pct=0.8)
for h in hits:
h["tf_name"] = m["name"]
h["matrix_id"] = m["matrix_id"]
all_hits.extend(hits)
print(f"Found {len(all_hits)} TF binding sites")
for h in sorted(all_hits, key=lambda x: -x["score"])[:5]:
print(f" {h['tf_name']} ({h['matrix_id']}): pos {h['position']}, score {h['score']:.2f}")
| Collection | Description | Profiles |
|------------|-------------|----------|
| CORE | Non-redundant, high-quality profiles | ~1,210 |
| UNVALIDATED | Experimentally derived but not validated | ~500 |
| PHYLOFACTS | Phylogenetically conserved sites | ~50 |
| CNE | Conserved non-coding elements | ~30 |
| POLII | RNA Pol II binding profiles | ~20 |
| FAM | TF family representative profiles | ~170 |
| SPLICE | Splice factor profiles | ~20 |
from Bio import motifsdevelopment
Spectral similarity and compound identification for metabolomics. Use for comparing mass spectra, computing similarity scores (cosine, modified cosine), and identifying unknown compounds from spectral libraries. Best for metabolite identification, spectral matching, library searching. For full LC-MS/MS proteomics pipelines use pyopenms.
development
Convert files and office documents to Markdown. Supports PDF, DOCX, PPTX, XLSX, images (with OCR), audio (with transcription), HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs and more.
development
Generate comprehensive market research reports (50+ pages) in the style of top consulting firms (McKinsey, BCG, Gartner). Features professional LaTeX formatting, extensive visual generation with scientific-schematics and generate-image, deep integration with research-lookup for data gathering, and multi-framework strategic analysis including Porter Five Forces, PESTLE, SWOT, TAM/SAM/SOM, and BCG Matrix.
testing
Comprehensive markdown and Mermaid diagram writing skill. Use when creating any scientific document, report, analysis, or visualization. Establishes text-based diagrams as the default documentation standard with full style guides (markdown + mermaid), 24 diagram type references, and 9 document templates.