infrastructure/servers/gpu-server-management/SKILL.md
Set up and manage NVIDIA GPU servers for AI workloads — driver installation, CUDA toolkit, container toolkit, MIG partitioning, GPU health monitoring, and multi-GPU configuration for LLM inference and training.
npx skillsauth add bagelhole/devops-security-agent-skills gpu-server-managementInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Provision, configure, and monitor NVIDIA GPU servers for AI inference and training workloads.
Use this skill when:
# Remove old drivers
sudo apt purge -y 'nvidia*' 'cuda*' 'libcuda*'
sudo apt autoremove -y
# Add NVIDIA package repository
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
# Install latest driver (560.x as of 2025)
sudo apt install -y nvidia-driver-560 cuda-toolkit-12-6
# Install NVIDIA Container Toolkit (Docker GPU support)
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify
nvidia-smi
nvcc --version
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
# Enable persistence mode (reduces driver initialization latency)
sudo nvidia-smi -pm 1
# Set power limit (reduce heat/noise on inference servers)
sudo nvidia-smi -pl 350 # watts; check TDP for your GPU model
# Disable ECC on inference servers (frees ~6% VRAM, less safe)
sudo nvidia-smi --ecc-config=0 # requires reboot
# Enable P2P for multi-GPU NVLink training
sudo nvidia-smi topo -m # check NVLink topology
# Real-time monitoring (like htop for GPUs)
watch -n 1 nvidia-smi
# Detailed stats
nvidia-smi --query-gpu=index,name,temperature.gpu,utilization.gpu,\
utilization.memory,memory.used,memory.free,power.draw,clocks.current.graphics \
--format=csv --loop=1
# DCGM — production monitoring daemon (for clusters)
sudo apt install -y datacenter-gpu-manager
sudo systemctl start dcgm
dcgmi discovery -l # list GPUs
dcgmi diag -r 1 # quick health check
dcgmi diag -r 3 # full diagnostic (takes ~20 min)
# Check GPU errors (XID errors — important for stability)
sudo dmesg | grep -i "NVRM\|nvidia\|XID"
nvidia-smi --query-gpu=ecc.errors.corrected.volatile.total \
--format=csv,noheader
# Deploy DCGM Exporter for Prometheus scraping
docker run -d \
--name dcgm-exporter \
--gpus all \
--cap-add SYS_ADMIN \
-p 9400:9400 \
--restart unless-stopped \
nvcr.io/nvidia/k8s/dcgm-exporter:latest
# Key metrics exposed:
# DCGM_FI_DEV_GPU_UTIL - GPU utilization %
# DCGM_FI_DEV_MEM_COPY_UTIL - Memory bandwidth utilization
# DCGM_FI_DEV_FB_USED - Framebuffer memory used (MB)
# DCGM_FI_DEV_SM_CLOCK - SM clock speed (MHz)
# DCGM_FI_DEV_GPU_TEMP - Temperature (°C)
# DCGM_FI_DEV_POWER_USAGE - Power draw (W)
# DCGM_FI_DEV_XID_ERRORS - XID error count (0 = healthy)
MIG (Multi-Instance GPU) allows slicing one GPU into isolated smaller GPUs.
# Enable MIG mode (requires reboot or restart of all processes)
sudo nvidia-smi -mig 1
sudo systemctl restart nvidia-persistenced
# List available MIG profiles (A100 80GB example)
nvidia-smi mig -lgip
# 1g.10gb — 1 slice, 10GB (max 7 instances)
# 2g.20gb — 2 slices, 20GB (max 3 instances)
# 3g.40gb — 3 slices, 40GB (max 2 instances)
# 7g.80gb — full GPU, 80GB (max 1 instance)
# Create MIG instances (e.g., 3× 2g.20gb + 1× 2g.20gb = multi-tenant)
sudo nvidia-smi mig -cgi 2g.20gb,2g.20gb,2g.20gb,2g.20gb -C
# List created instances
nvidia-smi mig -lgi
nvidia-smi mig -lcgi
# Use in Docker
docker run --gpus '"device=MIG-GPU-xxx/0/0"' ...
# Disable MIG
sudo nvidia-smi mig -i 0 -dci
sudo nvidia-smi mig -i 0 -dgi
sudo nvidia-smi -mig 0
# Increase file descriptor limits
echo '* soft nofile 1048576' | sudo tee -a /etc/security/limits.conf
echo '* hard nofile 1048576' | sudo tee -a /etc/security/limits.conf
# Disable transparent huge pages (reduces latency jitter)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# Persist via rc.local or systemd unit:
cat <<'EOF' | sudo tee /etc/rc.local
#!/bin/bash
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
nvidia-smi -pm 1
exit 0
EOF
sudo chmod +x /etc/rc.local
# PCIe performance mode
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi --auto-boost-permission=0
# Check NVLink and PCIe topology
nvidia-smi topo -m
# Output shows interconnect type:
# NV4 = NVLink 4.0 (H100 SXM)
# NV2 = NVLink 2.0 (A100 SXM)
# PHB = PCIe bus (slower; avoid for tensor parallel training)
# PIX = same PCIe switch (fast)
# Bandwidth test between GPUs
/usr/local/cuda/samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest
| Issue | Cause | Fix |
|-------|-------|-----|
| nvidia-smi: command not found | Driver not installed | Follow driver installation steps above |
| Driver version mismatch | CUDA/driver incompatibility | Check compatibility matrix at developer.nvidia.com |
| GPU temperature >85°C | Poor airflow or fan failure | Check nvidia-smi -q -d TEMPERATURE; reseat cooler |
| XID 79 errors | GPU hardware error | Run dcgmi diag -r 3; may need GPU replacement |
| failed to open device in container | Container toolkit not configured | Run nvidia-ctk runtime configure --runtime=docker |
| Low PCIe bandwidth | Wrong slot or power limit | Check nvidia-smi -q | grep PCIe; use x16 slot |
nvidia-smi -pm 1) — reduces first-request latency.development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.