engineering/devops/skills/docker-kubernetes/SKILL.md
This skill should be used when the user asks about "Dockerfile", "Docker image", "containerize", "docker build", "docker-compose", "Kubernetes", "k8s", "kubectl", "pod", "deployment", "service", "ingress", "namespace", "Helm chart", "ConfigMap", "Secret", "PersistentVolume", "RBAC", "resource limits", "liveness probe", "readiness probe", "pod scheduling", "node affinity", "taint", "toleration", "StatefulSet", "DaemonSet", "CronJob", "HPA", "VPA", or "cluster". Also trigger for "why is my pod crashing", "OOMKilled", "CrashLoopBackOff", "ImagePullBackOff", or "Pending" pod states.
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library docker-kubernetesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-grade guidance for building container images and orchestrating workloads on Kubernetes.
Pin base image versions. Never use latest. Use node:20.11-alpine3.19 not node:latest. Unpinned tags change without warning and break reproducible builds.
Use distroless or Alpine for production. Smaller attack surface, faster pulls, less to patch. gcr.io/distroless/nodejs20-debian11 is a good default for Node.js.
Never run as root. Add a non-root user and switch to it before the final CMD:
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser
One process per container. Don't run nginx + app + cron in one container. Use separate containers and orchestrate with Kubernetes.
Use .dockerignore. Exclude node_modules/, .git/, *.log, **/*.test.ts, .env. A bloated build context slows everything.
# Stage 1: Build
FROM node:20.11-alpine3.19 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
# Stage 2: Runtime (no dev deps, no source)
FROM node:20.11-alpine3.19 AS runtime
WORKDIR /app
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
Order Dockerfile instructions from least-changed to most-changed:
apt-get, apk)package.json, go.mod, requirements.txt)COPY . .)This way, a source code change only invalidates layers 5+, not the expensive dependency install.
Always scan images before pushing to production:
# Trivy (recommended — free, fast)
trivy image <image>:<tag>
# Docker Scout (built into Docker Desktop)
docker scout cves <image>:<tag>
Fail CI/CD pipelines on HIGH or CRITICAL vulnerabilities.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: production
labels:
app: api-service
version: "1.2.3"
spec:
replicas: 3
selector:
matchLabels:
app: api-service
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # At most 1 pod down at a time
maxSurge: 1 # At most 1 extra pod during rollout
template:
metadata:
labels:
app: api-service
version: "1.2.3"
spec:
# Always set a termination grace period
terminationGracePeriodSeconds: 30
# Security context — no root, read-only filesystem
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
containers:
- name: api-service
image: registry.example.com/api-service:1.2.3
ports:
- containerPort: 3000
# Resource requests and limits — ALWAYS set both
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Readiness: traffic only sent when ready
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
# Liveness: pod restarted if unhealthy
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
# Startup probe for slow-starting apps
startupProbe:
httpGet:
path: /health/ready
port: 3000
failureThreshold: 30
periodSeconds: 10
# Env from ConfigMap and Secret — never hardcode
envFrom:
- configMapRef:
name: api-service-config
- secretRef:
name: api-service-secrets
# Read-only root filesystem (strong security posture)
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
# Writable temp dir if needed
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
# Avoid running all replicas on the same node
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["api-service"]
topologyKey: kubernetes.io/hostname
| Service Type | CPU Request | CPU Limit | Mem Request | Mem Limit | |---|---|---|---|---| | Lightweight API | 50m | 250m | 64Mi | 256Mi | | Standard API | 100m | 500m | 128Mi | 512Mi | | CPU-intensive (ML inference) | 500m | 2000m | 512Mi | 2Gi | | Worker/queue consumer | 100m | 1000m | 256Mi | 1Gi |
Rule: Requests = what the scheduler guarantees. Limits = the hard ceiling. Set limits to 3–5× requests for burstable workloads.
Readiness probe — "Am I ready to serve traffic?"
GET /health/readyLiveness probe — "Am I alive (not deadlocked)?"
GET /health/live (simpler check than readiness)Common mistake: Using the same endpoint for both. If your DB is down, liveness should still return 200 (the app is alive, just degraded). Readiness should return 503.
Every workload gets its own namespace. Never use default for production workloads.
# Namespace per environment
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
env: production
Use ServiceAccounts with least-privilege RBAC. A pod should only have the permissions it actually needs:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: api-service-role
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: api-service-binding
namespace: production
subjects:
- kind: ServiceAccount
name: api-service-sa
namespace: production
roleRef:
kind: Role
name: api-service-role
apiGroup: rbac.authorization.k8s.io
Never store secrets in plain YAML committed to git. Use:
# External Secrets Operator example
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: api-service-secrets
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secretsmanager
kind: ClusterSecretStore
target:
name: api-service-secrets
creationPolicy: Owner
data:
- secretKey: DATABASE_URL
remoteRef:
key: production/api-service
property: database_url
Use Helm for reusable, parameterized manifests. Structure:
charts/
└── api-service/
├── Chart.yaml
├── values.yaml # Default values
├── values-staging.yaml # Staging overrides
├── values-prod.yaml # Production overrides
└── templates/
├── deployment.yaml
├── service.yaml
├── hpa.yaml
├── ingress.yaml
└── _helpers.tpl
Always lint before deploying: helm lint charts/api-service
Template rendering preview: helm template api-service charts/api-service -f values-prod.yaml
| Symptom | Likely Cause | Fix |
|---------|-------------|-----|
| CrashLoopBackOff | App crashes on startup | Check kubectl logs <pod> --previous |
| OOMKilled | Memory limit too low | Increase limits.memory or fix memory leak |
| ImagePullBackOff | Wrong image tag or registry auth | Verify image exists; check imagePullSecrets |
| Pending | Insufficient node resources | Check kubectl describe pod for events |
| High P99 latency | CPU throttling | Increase limits.cpu or scale horizontally |
| Readiness failing | Dependency not ready | Check dependency health; add retry logic |
| Pods evicted | Node under memory pressure | Check kubectl get events for eviction events |
For production-grade manifests and container hardening templates, see:
references/k8s-manifests.md — complete Deployment, Service, HPA, PDB, NetworkPolicy, and RBAC manifests ready to adaptreferences/docker-patterns.md — multi-stage Dockerfiles, secrets handling, and container security hardening patternstesting
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.