skills/cell-biology/pathml/SKILL.md
Computational pathology toolkit for whole-slide images (WSIs): load slides, extract tiles, stain normalization, nuclear segmentation, feature extraction, and ML training. Supports H&E and multiplex. For end-to-end pipelines from raw WSIs to quantitative outputs.
npx skillsauth add jaechang-hits/sciagent-skills pathmlInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
PathML is a Python toolkit designed for computational pathology workflows on whole-slide images (WSIs). It provides a unified pipeline from raw slide files (SVS, NDPI, MRXS, TIFF) through tile extraction, preprocessing (stain normalization, nuclear segmentation, tissue detection), feature extraction, and machine learning. PathML integrates with popular Python ML and image processing libraries while abstracting the complexity of WSI handling through its SlideData and Pipeline abstractions.
scikit-image or cellpose directly without PathML overhead.pathml, torch, torchvision, numpy, scikit-image, openslide-python# Install system dependency first
conda install -c conda-forge openslide
# Install PathML
pip install pathml
# For GPU support
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118
from pathml.core import SlideData
from pathml.preprocessing import Pipeline
from pathml.preprocessing.transforms import BoxBlur, TissueDetectionHE
# Load → build pipeline → tile → preprocess
slide = SlideData("tumor.svs", name="demo")
pipeline = Pipeline([BoxBlur(kernel_size=3), TissueDetectionHE(mask_name="tissue")])
slide.run(pipeline, tile_size=256, tile_stride=256)
# Inspect tiles
from pathml.core import Tile
tiles = [t for t in slide.tiles if t.masks["tissue"].any()]
print(f"Tissue tiles: {len(tiles)} of {len(slide.tiles)}")
from pathml.core import SlideData
# Load an H&E whole-slide image
slide = SlideData("path/to/slide.svs", name="tumor_slide_001")
print(f"Slide name: {slide.name}")
print(f"Slide shape: {slide.slide.shape}")
print(f"Slide properties: {slide.slide.properties}")
from pathml.preprocessing import Pipeline
from pathml.preprocessing.transforms import (
BoxBlur,
TissueDetectionHE,
HEStainNormalization,
)
# Build a preprocessing pipeline for H&E slides
pipeline = Pipeline([
BoxBlur(kernel_size=5), # smooth image
TissueDetectionHE(mask_name="tissue"), # detect tissue regions
HEStainNormalization(target="normalize"), # normalize H&E staining
])
print(f"Pipeline steps: {len(pipeline.transforms)}")
from pathml.core import TileDataset
# Tile the slide into 256x256 patches at 20x magnification
slide.generate_tiles(
shape=(256, 256),
stride=(256, 256),
pad=False,
level=0, # pyramid level 0 = highest resolution
coords_format="fractional",
)
print(f"Total tiles generated: {len(slide.tiles)}")
# Apply preprocessing pipeline to all tiles
slide.run(pipeline, distributed=False, tile_pad=False)
print("Pipeline complete — tiles preprocessed")
# Inspect a single tile
tile = slide.tiles[0]
print(f"Tile shape: {tile.image.shape}") # (256, 256, 3)
print(f"Tile masks: {list(tile.masks.keys())}")
from pathml.preprocessing.transforms import NuclearSegmentation
# Run Hematoxylin-channel nuclear segmentation
seg_pipeline = Pipeline([
TissueDetectionHE(mask_name="tissue"),
NuclearSegmentation(mask_name="nuclei"),
])
slide.run(seg_pipeline, distributed=False)
# Count nuclei per tile
for tile in list(slide.tiles)[:5]:
n_nuclei = tile.masks["nuclei"].max()
print(f"Tile {tile.coords}: {n_nuclei} nuclei detected")
import numpy as np
from pathml.core import SlideDataset
features = []
for tile in slide.tiles:
if "tissue" in tile.masks and tile.masks["tissue"].any():
img = tile.image
feat = {
"mean_r": img[:, :, 0].mean(),
"mean_g": img[:, :, 1].mean(),
"mean_b": img[:, :, 2].mean(),
"std_r": img[:, :, 0].std(),
"n_nuclei": int(tile.masks["nuclei"].max()) if "nuclei" in tile.masks else 0,
"tile_x": tile.coords[0],
"tile_y": tile.coords[1],
}
features.append(feat)
import pandas as pd
df = pd.DataFrame(features)
df.to_csv("slide_features.csv", index=False)
print(f"Extracted features from {len(df)} tissue tiles -> slide_features.csv")
import h5py
# Save slide data (tiles + masks) to HDF5
slide.write("processed_slide.h5")
print("Slide saved to processed_slide.h5")
# Reload for downstream use
from pathml.core import SlideData
slide_loaded = SlideData.read("processed_slide.h5")
print(f"Reloaded: {len(slide_loaded.tiles)} tiles")
| Parameter | Default | Range / Options | Effect |
|-----------|---------|-----------------|--------|
| shape | (256, 256) | (64,64) – (1024,1024) | Tile dimensions in pixels |
| stride | equals shape | any tuple ≤ shape | Step between tile centers; stride < shape gives overlapping tiles |
| level | 0 | 0 – max pyramid level | Pyramid resolution level (0 = full resolution) |
| kernel_size | 5 | odd integers 3–21 | Smoothing kernel size in BoxBlur |
| mask_name | required | any string | Name of output mask stored in tile.masks |
| distributed | False | True, False | Enable Dask distributed processing for large slides |
| pad | False | True, False | Pad edge tiles to full shape size |
When to use: Exclude background tiles to reduce memory and computation in downstream steps.
# Filter tiles to only tissue regions after running tissue detection pipeline
tissue_tiles = [t for t in slide.tiles if "tissue" in t.masks and t.masks["tissue"].mean() > 0.5]
print(f"Tissue tiles: {len(tissue_tiles)} / {len(slide.tiles)} total")
When to use: Create a labeled tile dataset for training a custom classifier in PyTorch.
from PIL import Image
import numpy as np
from pathlib import Path
output_dir = Path("tiles_png")
output_dir.mkdir(exist_ok=True)
for i, tile in enumerate(slide.tiles):
if "tissue" in tile.masks and tile.masks["tissue"].mean() > 0.5:
img = Image.fromarray(tile.image.astype(np.uint8))
img.save(output_dir / f"tile_{i:05d}_x{tile.coords[0]}_y{tile.coords[1]}.png")
print(f"Saved {i+1} tiles to {output_dir}/")
When to use: Running the same preprocessing pipeline on a directory of WSI files.
from pathlib import Path
from pathml.core import SlideData
from pathml.preprocessing import Pipeline
from pathml.preprocessing.transforms import TissueDetectionHE, HEStainNormalization
pipeline = Pipeline([
TissueDetectionHE(mask_name="tissue"),
HEStainNormalization(target="normalize"),
])
wsi_dir = Path("slides/")
for wsi_path in sorted(wsi_dir.glob("*.svs")):
slide = SlideData(str(wsi_path), name=wsi_path.stem)
slide.generate_tiles(shape=(256, 256), stride=(256, 256), level=0)
slide.run(pipeline, distributed=False)
slide.write(f"processed/{wsi_path.stem}.h5")
print(f"Processed {wsi_path.name}: {len(slide.tiles)} tiles")
slide.tiles — iterable of Tile objects, each with .image (numpy array) and .masks (dict of numpy arrays)slide_features.csv — tabular per-tile features (color statistics, nucleus counts, coordinates)processed_slide.h5 — HDF5 file with tiles, masks, and metadata for downstream useImageFolder dataset loading| Problem | Cause | Solution |
|---------|-------|----------|
| openslide.lowlevel.OpenSlideUnsupportedFormatError | OpenSlide C library not installed or WSI format unsupported | conda install -c conda-forge openslide; check format compatibility |
| CUDA out of memory during segmentation | Tile size too large for GPU | Reduce tile shape to (128, 128) or run with distributed=False on CPU |
| slide.tiles is empty after generate_tiles | Level index out of range or all tiles filtered | Use level=0; check slide pyramid with slide.slide.level_count |
| Stain normalization produces black tiles | Source slide too low contrast or failed tissue detection | Apply TissueDetectionHE before normalization; inspect tissue mask coverage |
| KeyError: 'nuclei' in tile.masks | Segmentation pipeline not yet run | Run the NuclearSegmentation pipeline with slide.run() before accessing masks |
| Very slow tile generation | High-resolution level 0 on large SVS | Use a lower pyramid level (level=1 or level=2) for faster prototyping |
| AttributeError: SlideData has no attribute 'write' | Old PathML version | pip install --upgrade pathml to get HDF5 save/load support |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.