skills/hpc-deployment/hpc-runtime-doctor/SKILL.md
Diagnose HPC runtime and scheduler problems for materials simulations, including MPI/OpenMP/GPU layout, modules, CUDA/Kokkos hints, scratch paths, walltime, job arrays, restart strategy, scheduler portability, and resource mismatch. Use when jobs fail, run slowly, get killed, or behave differently on a cluster than on a workstation.
npx skillsauth add HeshamFS/materials-simulation-skills hpc-runtime-doctorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Turn cluster symptoms into a resource-layout diagnosis, environment checklist, and safe retry plan.
| Input | Description | Example |
|-------|-------------|---------|
| Scheduler | SLURM, PBS, LSF, local | slurm |
| Nodes/tasks/threads | Runtime layout | 2 nodes, 128 tasks, 2 threads |
| GPUs | GPUs requested | 4 |
| Symptoms | Observed failure | oom,killed,slow-gpu |
| MPI/OpenMP/GPU use | Parallel modes | mpi+openmp+gpu |
| Walltime | Requested time | 12:00:00 |
| Scratch | Whether scratch is used | true |
scripts/hpc_runtime_doctor.py emits:
resource_layoutdiagnosesenvironment_checksretry_planscheduler_notespython3 skills/hpc-deployment/hpc-runtime-doctor/scripts/hpc_runtime_doctor.py \
--scheduler slurm \
--nodes 2 \
--tasks 128 \
--cpus-per-task 2 \
--gpus 4 \
--symptoms oom,slow-gpu \
--uses-mpi \
--uses-openmp \
--uses-gpu \
--json
Invalid resource counts stop with exit code 2. Unknown symptoms are preserved as custom items for human review.
This skill does not query a live scheduler. It diagnoses from the submitted layout and symptoms.
Bash only to run its bundled script.references/hpc_runtime_patterns.md for scheduler and runtime diagnosis patterns.development
Plan verification and validation campaigns for simulation codes using manufactured solutions, canonical benchmark problems, grid/time refinement, uncertainty propagation, and pass/fail acceptance criteria. Use when an agent needs to prove a solver, model, or result is trustworthy rather than only plausible.
testing
Map computational materials tasks onto workflow engines such as atomate2, jobflow, AiiDA, pyiron, or a simple one-off script. Use when deciding how to structure a reproducible campaign, DAG, restart strategy, provenance record, storage layout, or migration path from ad hoc scripts to managed workflows.
development
Plan molecular dynamics post-processing for materials simulations, including RDF, MSD and diffusion, VACF/VDOS, coordination numbers, bond-angle distributions, stress-strain curves, equilibration detection, PBC unwrapping, and trajectory format choices. Use before writing MD analysis scripts or trusting trajectory-derived results.
development
Triage cross-code simulation failures and propose safe retry ladders for nonconvergence, NaN/Inf, exploding energies, unstable timesteps, pressure blow-up, missing potentials, bad pseudopotentials, corrupted output, and incomplete runs. Use when an agent sees a failed or suspicious materials simulation and needs a defensible first response.