skills/scientific-computing/exploratory-data-analysis/SKILL.md
Methodology for exploratory data analysis on scientific files. Decision frameworks by data type (tabular, sequence, image, spectral, structural, omics), quality assessment, report generation, format detection across 200+ formats. Use when given a data file for initial exploration or to pick an analysis before a pipeline.
npx skillsauth add jaechang-hits/scicraft exploratory-data-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Exploratory data analysis (EDA) is the systematic examination of scientific data files to understand their structure, content, quality, and characteristics before formal analysis. This knowhow covers methodology for detecting file types, selecting appropriate analysis approaches, assessing data quality, and generating comprehensive reports across all major scientific data domains.
| Category | Common Formats | Typical Analysis | Key Libraries | |----------|---------------|-----------------|---------------| | Tabular | CSV, TSV, XLSX, Parquet | Summary statistics, distributions, correlations, missing values | pandas, polars | | Sequence | FASTA, FASTQ, SAM/BAM | Length distribution, quality scores, GC content, alignment stats | BioPython, pysam | | Image/Microscopy | TIFF, ND2, CZI, DICOM | Dimensions (XYZCT), intensity stats, metadata, calibration | tifffile, aicsimageio, nd2reader | | Spectral | mzML, SPC, JCAMP, FID | Peak detection, baseline, S/N ratio, resolution | pymzml, nmrglue, pyteomics | | Structural | PDB, CIF, MOL, SDF | Atom counts, bond validation, B-factors, completeness | BioPython, RDKit, MDAnalysis | | Array/Tensor | NPY, HDF5, Zarr, NetCDF | Shape, dtype, value range, NaN/Inf check, chunk structure | numpy, h5py, zarr, xarray | | Omics | H5AD, MTX, VCF, BED | Feature/sample counts, sparsity, annotation completeness | scanpy, pyranges, cyvcf2 |
\x89HDF, GZIP: \x1f\x8b).ome.tiff, .nii.gz, .tar.gz by checking from the rightmost extension inwardData file received
├── What is the file type?
│ ├── Known extension → Look up in format reference
│ ├── Unknown extension → Magic bytes / content sniffing
│ └── Directory (e.g., .d, .zarr) → Check internal structure
│
├── What category does it belong to?
│ ├── Tabular → Summary stats, distributions, correlations
│ ├── Sequence → Length/quality distributions, composition
│ ├── Image → Dimensions, channels, intensity, metadata
│ ├── Spectral → Peaks, baseline, resolution, S/N
│ ├── Structural → Atom/bond validation, geometry checks
│ ├── Array → Shape, dtype, value range, sparsity
│ └── Omics → Feature counts, sample QC, annotation check
│
├── How large is the file?
│ ├── Small (<100 MB) → Load fully, comprehensive analysis
│ ├── Medium (100 MB–1 GB) → Sample or lazy evaluation
│ └── Large (>1 GB) → Stream/chunk, representative sampling
│
└── What is the analysis goal?
├── Pre-pipeline QC → Focus on completeness, format compliance
├── Data understanding → Statistics, distributions, patterns
├── Troubleshooting → Compare against expected format/values
└── Documentation → Full report with recommendations
| Data Type | First Check | Core Analysis | Visualization | |-----------|------------|---------------|---------------| | Tabular | dtypes, shape, nulls | describe(), correlations, outliers | histograms, scatter, heatmap | | Sequence | record count, format | length dist., quality, composition | quality plots, length histogram | | Image | dimensions, bit depth | intensity stats, channel info | thumbnail, histogram | | Spectral | scan count, m/z range | peak detection, TIC, baseline | spectrum plot, TIC chromatogram | | Structural | atom/residue count | B-factors, missing residues | Ramachandran, contact map | | Array | shape, dtype | statistics, NaN check | slice visualization | | Omics | genes × cells matrix | sparsity, QC metrics | violin plots, PCA |
pl.scan_parquet(), h5py dataset slicing, pysam indexed access prevent memory overflowsAssuming CSV means clean tabular data — CSV files can have inconsistent delimiters, mixed encodings, embedded newlines, or malformed quoting. How to avoid: Use pd.read_csv(engine='python') for robustness; check encoding with chardet
Ignoring missing value encoding — scientific data uses diverse null representations: NA, NaN, -999, empty string, #N/A, .. How to avoid: Specify na_values parameter; check for sentinel values in numeric columns
Drawing conclusions from truncated files — large file transfers can fail silently. How to avoid: Check file size, verify record counts against expected values, check for EOF markers
Applying wrong reader to file — some extensions are ambiguous (.raw = Thermo MS, XRD, or image; .d = Agilent directory or generic data). How to avoid: Use magic bytes and context (source instrument) to disambiguate
Memory overflow on large datasets — loading a 10 GB CSV into a pandas DataFrame will fail. How to avoid: Check file size first; use chunked reading, lazy evaluation, or sampling for files >100 MB
Ignoring coordinate systems and units — microscopy data may use pixels vs microns; spectroscopy data may use wavelength vs wavenumber vs energy. How to avoid: Extract and report units from metadata; verify calibration information
Treating all columns as independent — scientific tabular data often has hierarchical structure (replicates nested within conditions). How to avoid: Identify experimental design from column names and metadata before computing correlations
Skipping format-specific quality metrics — generic statistics miss domain-specific issues (e.g., Phred quality scores in FASTQ, R-factors in crystallography, mass accuracy in MS). How to avoid: Consult the format reference for domain-specific QC metrics
Overinterpreting small samples — EDA on first 1000 rows may not represent the full dataset's distribution. How to avoid: Sample from multiple positions in the file; report sample size and sampling method
Not checking for duplicates — duplicate records are common in merged datasets and database exports. How to avoid: Check for exact and near-duplicates early; report the duplication rate
.ome.tiff, .nii.gz)pip install command)Generate a structured markdown report containing:
Save as {original_filename}_eda_report.md.
references/file_format_reference.md — Quick-reference catalog of the most common scientific file formats across all 6 categories (bioinformatics, chemistry, microscopy, spectroscopy, proteomics/metabolomics, general), with extension, description, Python library, and key EDA approach for each formatNot migrated from original: The 6 category-specific format catalog files (3,616 lines total) contained detailed entries for 200+ formats. The bundled reference consolidates the ~50 most commonly encountered formats. For rare or vendor-specific formats, consult official library documentation.
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.