skills/genomics-bioinformatics/star-rna-seq-aligner/SKILL.md
Splice-aware RNA-seq aligner producing sorted BAM and splice junction tables. Builds genome index, runs two-pass alignment for better junctions. Outputs sorted BAM, junctions (SJ.out.tab), stats (Log.final.out), optional gene counts. Use Salmon for fast pseudoalignment; STAR when a BAM is needed for variant calling, IGV, or ENCODE pipelines.
npx skillsauth add jaechang-hits/scicraft star-rna-seq-alignerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
STAR (Spliced Transcripts Alignment to a Reference) aligns RNA-seq reads to a genome in a splice-aware manner, identifying novel and annotated splice junctions in a single pass. It generates coordinate-sorted BAM files compatible with samtools, IGV, deeptools, and GATK. STAR's 2-pass mode re-aligns reads using junctions discovered in the first pass, improving sensitivity for novel splice sites. With --quantMode GeneCounts, STAR simultaneously produces gene-level read count tables without requiring a separate featureCounts or HTSeq step.
--quantMode GeneCounts--outFilterMismatchNmaxCheck before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v STARfirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run STARrather than bareSTAR.
# Install with conda (recommended)
conda install -c bioconda star
# Verify
STAR --version
# STAR_2.7.11a
# Or compile from source
git clone https://github.com/alexdobin/STAR
cd STAR/source && make STAR
# 1. Generate genome index (~30 min, run once)
STAR --runMode genomeGenerate \
--runThreadN 8 \
--genomeDir genome/star_index \
--genomeFastaFiles genome/GRCh38.fa \
--sjdbGTFfile genome/gencode.v47.gtf \
--sjdbOverhang 100 # ReadLength - 1
# 2. Align paired-end reads (~10-20 min)
STAR --runThreadN 8 \
--genomeDir genome/star_index \
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/sample/
# 3. Index the BAM
samtools index results/sample/Aligned.sortedByCoord.out.bam
Download a genome FASTA and matching GTF annotation (same assembly version).
# Download GRCh38 genome and GENCODE annotation
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz gencode.v47.primary_assembly.annotation.gtf.gz
mkdir -p genome/star_index
echo "Genome and GTF ready."
ls -lh GRCh38.primary_assembly.genome.fa gencode.v47.primary_assembly.annotation.gtf
Build the STAR genome index — required once per genome/read-length combination.
# Standard human genome index (requires ~32 GB RAM)
STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir genome/star_index/ \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v47.primary_assembly.annotation.gtf \
--sjdbOverhang 100
# For small genomes (e.g., E. coli ~4.6 Mb), reduce genomeSAindexNbases
# STAR --runMode genomeGenerate \
# --genomeSAindexNbases 11 \
# --genomeDir genome/ecoli_index/ ...
echo "Index complete: $(ls genome/star_index/ | wc -l) files"
Align single-end or paired-end FASTQ files to the indexed genome.
# Single-end alignment
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS NM MD \
--outFileNamePrefix results/sample1/
# Paired-end alignment
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS NM MD \
--outFileNamePrefix results/sample1/
echo "BAM: results/sample1/Aligned.sortedByCoord.out.bam"
Two-pass mode collects splice junctions from the first pass and uses them as annotation for the second pass.
# First pass — collect splice junctions
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype None \
--outFileNamePrefix pass1/sample1/
# Second pass — realign with all junctions from pass 1
SJ_FILES=$(ls pass1/*/SJ.out.tab | tr '\n' ' ')
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--sjdbFileChrStartEnd $SJ_FILES \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/sample1/
# Alternative: single-command 2-pass
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--twopassMode Basic \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix results/sample1/
Parse the alignment log to assess mapping rate and read quality.
# View the alignment summary
cat results/sample1/Log.final.out
# Parse key metrics with python
python3 - << 'EOF'
import re, sys
from pathlib import Path
log = Path("results/sample1/Log.final.out").read_text()
metrics = {}
for line in log.splitlines():
if "|" in line:
key, _, val = line.partition("|")
metrics[key.strip()] = val.strip()
print(f"Unique mapping: {metrics.get('Uniquely mapped reads %', 'N/A')}")
print(f"Multi-mapping: {metrics.get('% of reads mapped to multiple loci', 'N/A')}")
print(f"Too many mismatches:{metrics.get('% of reads unmapped: too many mismatches', 'N/A')}")
print(f"Total input reads: {metrics.get('Number of input reads', 'N/A')}")
EOF
Enable simultaneous gene counting during alignment using --quantMode GeneCounts.
# Align and count simultaneously
STAR --runThreadN 8 \
--genomeDir genome/star_index/ \
--readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFileNamePrefix results/sample1/
# ReadsPerGene.out.tab has 4 columns:
# gene_id unstranded stranded_fwd stranded_rev
head results/sample1/ReadsPerGene.out.tab
# Load into pandas (select column based on library strandedness)
python3 - << 'EOF'
import pandas as pd
df = pd.read_csv("results/sample1/ReadsPerGene.out.tab",
sep="\t", header=None, skiprows=4,
names=["gene_id", "unstranded", "fwd", "rev"])
# For unstranded library: use column 2 (unstranded)
counts = df.set_index("gene_id")["unstranded"]
print(f"Genes with counts > 0: {(counts > 0).sum()}")
print(counts[counts > 0].sort_values(ascending=False).head())
EOF
| Parameter | Default | Range/Options | Effect |
|-----------|---------|---------------|--------|
| --runThreadN | 1 | 1–64 | CPU threads for alignment |
| --sjdbOverhang | 99 | ReadLength-1 | Splice junction overhang; set to ReadLength-1 |
| --outSAMtype | SAM | BAM SortedByCoordinate, BAM Unsorted | Output format and sort order |
| --outFilterMismatchNmax | 10 | 0–33 | Max mismatches per read; lower for stricter mapping |
| --outFilterMultimapNmax | 10 | 1–9999 | Max genomic loci per read; reads exceeding limit marked unmapped |
| --quantMode | – | GeneCounts, TranscriptomeSAM | Enable gene counting or transcriptome BAM |
| --twopassMode | None | None, Basic | Enable 2-pass alignment for novel junction discovery |
| --alignIntronMax | 1000000 | 1–1e9 | Maximum intron length; reduce for bacterial genomes |
| --outReadsUnmapped | None | Fastx | Write unmapped reads to FASTQ |
| --genomeSAindexNbases | 14 | 10–14 | SA index size; set log2(GenomeSize)/2 − 1 for small genomes |
#!/bin/bash
# Align all paired-end samples in a directory
SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2)
INDEX="genome/star_index"
DATA="data"
OUT="results"
THREADS=12
mkdir -p "$OUT"
for sample in "${SAMPLES[@]}"; do
echo "Aligning: $sample"
mkdir -p "$OUT/$sample"
STAR --runThreadN "$THREADS" \
--genomeDir "$INDEX" \
--readFilesIn "$DATA/${sample}_R1.fastq.gz" "$DATA/${sample}_R2.fastq.gz" \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--twopassMode Basic \
--outFileNamePrefix "$OUT/$sample/"
samtools index "$OUT/$sample/Aligned.sortedByCoord.out.bam"
echo "Done: $sample — $(grep 'Uniquely mapped reads %' $OUT/$sample/Log.final.out | awk '{print $NF}')"
done
import pandas as pd
from pathlib import Path
results_dir = Path("results")
samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
strandedness = "unstranded" # or "fwd" / "rev"
col_map = {"unstranded": 1, "fwd": 2, "rev": 3}
col = col_map[strandedness]
counts = {}
for sample in samples:
count_file = results_dir / sample / "ReadsPerGene.out.tab"
df = pd.read_csv(count_file, sep="\t", header=None, skiprows=4)
counts[sample] = df.set_index(0)[col]
matrix = pd.DataFrame(counts)
matrix = matrix[matrix.sum(axis=1) > 0] # drop zero-count genes
matrix.to_csv("gene_count_matrix.tsv", sep="\t")
print(f"Count matrix: {matrix.shape} (genes × samples)")
print(matrix.head())
import pandas as pd
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
# Load count matrix from STAR output
counts = pd.read_csv("gene_count_matrix.tsv", sep="\t", index_col=0).T
metadata = pd.DataFrame({
"condition": ["control", "control", "treated", "treated"]
}, index=counts.index)
# Run DESeq2
dds = DeseqDataSet(counts=counts, metadata=metadata,
design_factors="condition",
inference=DefaultInference(n_cpus=4))
dds.deseq2()
print("DESeq2 complete — see dds.varm['LFC'] for results")
| Output | Format | Description |
|--------|--------|-------------|
| Aligned.sortedByCoord.out.bam | BAM | Coordinate-sorted aligned reads; index with samtools index |
| SJ.out.tab | TSV | Splice junction table with coverage, motif, and novelty flags |
| Log.final.out | Text | Alignment statistics: unique mapping %, multimappers %, etc. |
| ReadsPerGene.out.tab | TSV | Gene counts (4 columns: unstranded/fwd/rev) when --quantMode GeneCounts |
| Unmapped.out.mate1/2 | FASTQ | Unmapped reads (when --outReadsUnmapped Fastx) |
| Log.out | Text | Verbose run log; check for warnings and parameter echoes |
| Problem | Cause | Solution |
|---------|-------|----------|
| Unique mapping < 60% | Wrong genome assembly or species contamination | Verify genome FASTA matches sample species; run FastQC to check overrepresented sequences |
| Fatal error: genome files not found | Wrong --genomeDir path or incomplete index | Re-run genomeGenerate; check genomeDir contains Genome, SA, SAindex files |
| Out of memory during genome generation | Not enough RAM for genome SA index | Add --genomeSAindexNbases 13 (or lower) for small genomes; request ≥32 GB RAM for human |
| .gz files not decompressed | Missing --readFilesCommand zcat | Add --readFilesCommand zcat for gzip-compressed inputs |
| Error: number of input files differ | R1/R2 read count mismatch | Verify FASTQ files with zcat file.fastq.gz | wc -l; re-download if corrupted |
| ReadsPerGene.out.tab missing | --quantMode GeneCounts not set | Re-run with --quantMode GeneCounts or use featureCounts on BAM |
| Very high multimapping (>20%) | Highly repetitive genome or wrong --outFilterMultimapNmax | Reduce --outFilterMultimapNmax; use --outSAMmultNmax 1 to output only one alignment per read |
| Genome index takes too long | Large genome + slow disk | Use SSD storage; pre-built indices available from ENCODE and Ensembl |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.