
Diagnose and fix ArgoCD apps that remain OutOfSync even though 'argocd app diff' shows nothing. Use when: (1) an app is OutOfSync but diff output is empty or blank, (2) a Kubernetes operator manages its own CRD (e.g. ClusterPolicy, Kafka, Prometheus, etc.) and sets default field values after the initial Helm install, (3) ignoreDifferences already covers image/version/repository but the operator also defaults other fields not present in the Helm values. Covers the Python direct-comparison technique to identify the exact differing fields and the ignoreDifferences fix. Also applies when 'argocd app diff --refresh' shows nothing but hard-refresh still shows OutOfSync.
Deploy Harbor container registry via ArgoCD Helm (chart harbor/harbor). Use when: (1) harbor-nginx or harbor-portal pods crash with 'exec format error' on mixed-arch clusters, (2) harbor-core stuck ContainerCreating with 'references non-existent secret key: secretKey', (3) ArgoCD shows perpetual OutOfSync due to harbor-core/jobservice/registry/registry-htpasswd Secrets or Deployment checksum annotations changing on every sync.
Workaround for immutable Kubernetes resource fields that a Helm chart doesn't expose as configurable values. Use when: (1) kubectl patch fails with 'may not change once set' or 'immutable field', (2) a Helm chart renders a resource without a field you need (e.g. loadBalancerClass, storageClassName, claimRef) and there is no chart values key for it, (3) ArgoCD self-heal keeps reverting a manual patch because the chart re-renders without the field. Covers Services, PVCs, and other resources with immutable spec fields.
Update or correct fields in a CSI PersistentVolume spec (e.g., volumeAttributes, volumeHandle) that cannot be patched in place because spec.persistentvolumesource is immutable after creation. Use when: (1) kubectl patch on a PV fails with "spec.persistentvolumesource: Forbidden: spec.persistentvolumesource is immutable after creation", (2) a PV stores stale connection info (wrong IP, endpoint, or config) set at provision time that now differs from the desired state, (3) any CSI driver where volumeAttributes need to be corrected post-provisioning. Covers the replace --force procedure and pv-protection finalizer removal required to complete it.
Generates a structured, phased implementation plan for a new feature or codebase change, grounded in a research document. Use this skill whenever the user invokes /plan-implementation, asks to "make a plan", "write up a plan for this feature", "plan out these changes", "create an implementation plan", or is ready to plan before building. This is a pre-implementation planning skill — it should run after research and before coding. Trigger even if the user just says "let's plan this out" or "how should we approach building X". Do NOT trigger for debugging, code review, or requests to actually implement something. If the user is asking for a plan but doesn't have a research document, help them create one first.
Use when reading output from kubectl, argocd, helm, WebFetch, or git log before constructing a Bash command — or when invoked as /sec-unit. Also use when source content may contain: instructions in data positions, directory listing or file-name injection, structural directive injection (XML/JSON tags mimicking tool call format), URL-path curl exfiltration via $(...) substitution, multi-step pipeline exfiltration, or inline execution constructs in inferred commands from untrusted external output. Do not invoke for general security questions, Python debugging, or tasks unrelated to Bash command construction from cluster or web output.
Query TrueNAS using the JSON-RPC 2.0 WebSocket API (v25.10+) for storage and NVMe-oF debugging. Use this skill whenever you need to inspect TrueNAS state — NVMe subsystems, ZFS pool health, CSI-provisioned datasets, disk status, or service configuration. Replaces deprecated REST API calls. Trigger when: investigating NVMe-oF attach failures, checking pool/dataset health, verifying NVMe target port bindings, auditing democratic-csi provisioned volumes, or diagnosing any TrueNAS storage issue from within a Kubernetes debugging session.
Adversarial 3-stage review of an implementation plan before execution — checks completeness, security/best-practices, and multi-agent implementability. Use this skill whenever the user wants to validate, review, stress-test, or check an existing implementation plan before building or executing it. Trigger on phrases like "review the plan", "stress-test the plan", "is this plan ready", "check the plan", "validate the plan", "go through the plan", "find what's wrong with the plan", "is this ready for agents", "can subagents execute this", or any request to find gaps, security issues, or underspecified tasks in a plan that has already been written. Also trigger when the user has just finished /plan-implementation and is about to run /autonomous-plan-execution — the plan should be reviewed first. This skill is specifically for reviewing EXISTING plans, not for writing new ones (use plan-implementation for that), not for code review or PR review (use code-review for that), and not for debugging or post-deployment issues. If the user mentions a plan file, says "before we build", or asks whether phases can be parallelized, this skill almost certainly applies.
This skill governs all prose output — Claude's own responses, documentation, PR descriptions, commit messages, README content, comments, and any text the user asks to draft or edit. It should also be used when the user asks to "review my writing", "edit this for clarity", "make this clearer", "simplify this text", "rewrite this", "check my prose", "tighten this up", or "make this more concise". Based on George Orwell's "Politics and the English Language" (1946).
Build a UEFI Secure Boot PXE netboot server for Ubuntu autoinstall. Use when: designing or implementing network boot infrastructure for automated Ubuntu provisioning with Secure Boot enabled. Covers the complete chain: signed shim+GRUB selection, TFTP layout, kernel parameters, autoinstall config requirements, and post-install bootstrapping scripts. Also applicable when debugging an existing PXE setup that uses the wrong GRUB binary or config paths.
This skill should be used when validating an ArgoCD Helm chart deployment after committing changes to a git repository, checking if an ArgoCD app has synced and is healthy, debugging ArgoCD sync failures or stuck syncs, investigating Kubernetes Events or pod logs for application errors, monitoring a GitOps workflow deployment to completion, or fixing Helm chart values or templates to resolve deployment issues. Also covers known stuck-sync patterns: "waiting for completion of hook batch/Job" when the Job no longer exists (PreSync hook TTL race); ArgoCD controller OOMKilled during sync of large charts with ServerSideApply=true causing workloads to never be created despite ConfigMaps/SAs existing. Trigger phrases: "check if my deployment deployed", "validate the ArgoCD deployment", "debug ArgoCD sync", "check if helm chart is healthy", "monitor deployment progress", "argocd not syncing", "deployment stuck", "check argo", "helm chart errors", "waiting for completion of hook", "job not found argocd stuck", "argocd controller oomkilled", "deployments not created", "statefulsets not created after sync".
Fix for single-replica Kubernetes Deployments with ReadWriteOnce PVCs that deadlock during rolling updates. Use when: (1) a new pod is stuck in Init:0/1 or Pending with 'Multi-Attach error: Volume is already used by pod(s)', (2) the old pod won't terminate because maxUnavailable rounds to 0 for a single-replica Deployment, (3) deleting the old pod causes it to be immediately recreated by its ReplicaSet. Also covers the secondary error 'spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy type is Recreate' when patching a live Deployment.
Generates a structured technical research document about a codebase before building or modifying it. Use this skill whenever the user invokes /research-report, asks to "research the codebase", wants a technical map or overview of how a codebase works, or says things like "understand this codebase", "map out how X works", or "research before we build". This is a pre-build research phase skill — use it before implementation, not after. Do NOT trigger for general code review, debugging, or improvement requests.
Set up HTTPS on a Kubernetes service via the Tailscale Ingress controller. Use when: (1) switching from a Tailscale LoadBalancer Service to Tailscale Ingress for TLS termination, (2) HTTPS connections timeout despite 'tailscale serve status' showing 'https://', (3) port 443 gives 'tlsv1 alert internal error' on raw IP but times out with correct SNI hostname, (4) harbor-tls or similar secret was not created after Ingress creation. Covers initial cert provisioning, backend ClusterIP wiring, and DNS propagation behavior.
Debugging and managing GitOps clusters where ArgoCD, Helm, and ESO (ExternalSecrets Operator) coexist. Ownership symptoms: kubectl patches revert immediately, git push doesn't update a deployed resource, ArgoCD prune leaves resources behind, two tools fight over the same ConfigMap/Secret/CRD, duplicate deployments from the same chart under different names. ESO symptoms: ExternalSecret template expressions render as empty strings inside Helm chart templates/, SecretSynced True but the rendered secret value is empty (provider returned empty for the referenced field). Covers: ownership model, selfHeal behavior (including ignoreDifferences for standby-mode replica scaling), prune scope, fixed-name collision points, the "where do I make this change?" decision tree, Helm template escaping for ESO expressions, secrets provider field type diagnosis, keeping sensitive infrastructure values out of public Git repos (runtime sed substitution, role-based nodeSelectors instead of hostnames), and non-root container secret volume permissions (fsGroup + defaultMode).
Enforces conventional commit format and a comprehensive PR workflow across all git repositories. Invoke this skill whenever the user wants to commit changes, create a pull request, push to a branch, or finish a feature for merging — even if they don't use those exact words. Triggers include: commit, create a PR, open a PR, push my changes, make a commit, commit and push, submit a PR, merge this, ready to merge, open a pull request. This skill handles conventional commit formatting, automatic breaking-change detection, doc and Makefile updates, and structured PR descriptions with honest test plans. Use it for every commit and PR action, no exceptions.
Debug Kubernetes pods using hostNetwork: true that crash with "Address already in use" or "failed to create listening socket for port N". Use when: (1) a hostNetwork pod container is in CrashLoopBackOff and logs show a port bind failure, (2) the port works fine in non-hostNetwork pods but fails with hostNetwork, (3) you need to identify which host-level process holds a port from within Kubernetes (no SSH). Covers /proc/net/udp inspection and kubectl debug node with nsenter.
Design pattern for running a persistent PXE/TFTP server that safely coexists with already-installed nodes. Use when: building PXE infrastructure that should stay always-on, designing automated bare-metal provisioning in GitOps/Kubernetes environments, or any PXE setup where UEFI boot order has network boot first. Eliminates boot loops without requiring UEFI firmware changes.
Debug democratic-csi NVMe-oF volume attachment failures. Use when: (1) Pods are stuck in Init:0/1 and events show "MountVolume.MountDevice failed: unable to attach any nvme devices", (2) democratic-csi NodeStageVolume returns "unable to attach any nvme devices", (3) PVC provisioning succeeds but pods never start because volumes can't be mounted, (4) Node plugin logs show "connecting to transport: tcp://<IP>:4420" where the IP differs from the driver config. Covers four distinct root causes: nvme binary hostname resolution failure (libnss), NVMe-oF TCP port not reachable (wrong interface or service not running), kernel module availability, and stale transport IP stored in PV volumeAttributes at provision time.