skills/gitops-master/SKILL.md
GitOps operations master for ArgoCD + Kargo. Diagnose stuck stages, verify deployments, manage promotions, configure verification and retry. Triggers: 'stage stuck', 'sync failed', 'promote', 'verify stage', 'kargo', 'argocd', 'gitops', 'pipeline', 'freight'
npx skillsauth add developerinlondon/agent-skills gitops-masterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a GitOps operations specialist combining four specializations:
Analyze the user's request to determine operation mode:
| User Request Pattern | Mode | Jump To | | ------------------------------------------------ | -------- | ----------- | | "stage stuck", "sync failed", "pods crashing", | DIAGNOSE | Phase D1-D8 | | "error", "debug", "broken", "not working" | | | | "check stage", "verify", "health", "is it up" | VERIFY | Phase V1-V4 | | "promote", "pipeline status", "freight", | PROMOTE | Phase P1-P4 | | "advance stage", "push through" | | | | "add verification", "configure", "retry config", | SETUP | Phase S1-S4 | | "RBAC", "new stage", "analysis template" | | |
CRITICAL: Don't default to DIAGNOSE mode. Parse the actual request.
Before running ANY kubectl command, you MUST discover the environment. Follow this cascade:
.gitops-config.yaml in the project root# Expected format:
ssh_command: "MISE_ENV=test mise run server:ssh" # How to reach the cluster (omit if local kubectl works)
kargo_namespace: "kargo-my-project-test" # Kargo project namespace
kargo_controller_namespace: "kargo" # Where Kargo controller runs (may differ from default)
argocd_namespace: "infra" # ArgoCD apps namespace
argocd_controller_namespace: "argocd" # Where ArgoCD controller runs (may differ from default)
monitoring_namespace: "monitoring" # Monitoring namespace
app_namespace: "my-app-test" # Application namespace
domain: "example.com" # Cluster domain
kargo_project: "my-project" # Kargo project name
warehouse_name: "platform-apps" # Kargo warehouse name
# Discover Kargo project namespaces
kubectl get projects.kargo.akuity.io -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.namespace}{"\n"}{end}'
# Discover stages to confirm namespace
kubectl get stages -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}'
# Discover ArgoCD namespace
kubectl get applications -A --no-headers | head -1 | awk '{print $1}'
# Discover domain from ingress/configmap
kubectl get ingress -A -o jsonpath='{range .items[*]}{.spec.rules[*].host}{"\n"}{end}' | head -1 | sed 's/^[^.]*\.//'
# Check if direct kubectl works or SSH is needed
kubectl get nodes 2>/dev/null && echo "Direct kubectl works" || echo "May need SSH tunnel or proxy"
Present this template and ask them to fill in the values:
I need your cluster details to proceed:
- How do I reach kubectl? (direct / SSH command / proxy)
- Kargo project namespace?
- ArgoCD apps namespace?
- Monitoring namespace? (if applicable)
- Application namespace?
- Cluster domain?
Once discovered, use these throughout ALL commands:
${SSH_CMD} = SSH/proxy prefix (empty if direct kubectl works)
${KARGO_NS} = Kargo project namespace (stages, promotions, freight)
${KARGO_CONTROLLER_NS} = Kargo controller namespace (defaults to "kargo", may differ)
${ARGOCD_NS} = ArgoCD applications namespace
${ARGOCD_CONTROLLER_NS} = ArgoCD controller namespace (defaults to "argocd", may differ)
${MONITORING_NS} = Monitoring namespace
${APP_NS} = Application namespace
${DOMAIN} = Cluster domain
${KARGO_PROJECT} = Kargo project name
${WAREHOUSE} = Kargo warehouse name
NEVER hardcode namespace names or domains in commands.
Execute ALL of the following commands IN PARALLEL to minimize latency:
# Group 1: Kargo state
${SSH_CMD} kubectl get stages -n ${KARGO_NS} -o wide
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} --sort-by=.metadata.creationTimestamp
# Group 2: ArgoCD state
${SSH_CMD} kubectl get applications -n ${ARGOCD_NS}
# Group 3: Controller health
${SSH_CMD} kubectl get pods -n ${KARGO_CONTROLLER_NS} -l app.kubernetes.io/name=kargo
${SSH_CMD} kubectl get pods -n ${ARGOCD_CONTROLLER_NS} -l app.kubernetes.io/name=argocd-repo-server
# Group 4: Recent events
${SSH_CMD} kubectl get events -n ${KARGO_NS} --sort-by=.lastTimestamp --field-selector=type!=Normal
Capture these data points simultaneously:
See references/kargo-state-machine.md for the 8 rules of Kargo's internal state machine and retry configuration guidelines.
See references/argocd-troubleshooting.md for the complete error pattern table and debugging commands.
Stage stuck?
|
+- 1. Check stage.status.phase
| |
| +- "Failed" or stuck not-Verified
| | +-- Go to step 2
| |
| +-- "Verified" / "Healthy"
| +-- Stage is fine. Problem is elsewhere.
|
+- 2. Check stage.status.conditions[0]
| |
| +- reason: "LastPromotionErrored"
| | +-- Check lastPromotion.name
| | +- Non-timestamp name (manually created via kubectl)
| | | +-- FIX: Delete Stage CR entirely, let ArgoCD recreate from git
| | | This clears poisoned lastPromotion state
| | +-- Auto-generated name
| | +-- Check promotion step errors
| | +- "connection refused :8081" -> Add retry config (see references/kargo-state-machine.md)
| | +- "sync tasks failed" -> Check ArgoCD app operationState
| | +-- Other -> Fix root cause, push new commit to trigger new freight
| |
| +- reason: "NoFreight"
| | +-- Check freightHistory
| | +- Empty -> No successful promotion ever ran
| | | +-- Check Warehouse: is it creating freight?
| | | +- No freight -> Warehouse subscription misconfigured
| | | +-- Freight exists -> Check if stage has auto-promote or needs manual
| | +-- Stale -> lastPromotion stuck, blocking freight tracking
| | +-- Same fix as LastPromotionErrored
| |
| +- reason: "VerificationFailed"
| | +-- Check AnalysisRun
| | +-- kubectl get analysisruns -n ${KARGO_NS} --sort-by=.metadata.creationTimestamp
| | +- Job failed -> kubectl logs job/<job-name> -n ${KARGO_NS}
| | | +- RBAC error ("forbidden") -> Fix ClusterRole (see V4)
| | | +- HTTP check failed -> Check service/ingress/SSO
| | | +-- Pod check failed -> Check namespace, pod selectors
| | +-- No AnalysisRun exists -> Verification SA or template misconfigured
| |
| +- reason: "ReconcileError"
| | +-- Usually transient
| | +-- Force refresh:
| | kubectl annotate stage <name> -n ${KARGO_NS} --overwrite \
| | kargo.akuity.io/refresh=$(date +%s)
| |
| +-- reason: "ActivePromotion"
| +-- A promotion is currently running
| +- If running for > 15 min -> Likely stuck
| | +-- Check controller logs (step 3)
| +-- If < 15 min -> Wait for it
|
+- 3. Check Kargo controller logs
| |
| +-- kubectl logs -n ${KARGO_CONTROLLER_NS} deploy/kargo-controller --tail=200 | grep <stage-name>
| |
| +- "Promotion already exists for Freight"
| | +-- Delete old errored promotions for that freight:
| | kubectl delete promotions -n ${KARGO_NS} -l kargo.akuity.io/freight=<freight-name>
| |
| +- "current Freight needs to be verified"
| | +-- Verification is blocking new promotions
| | +-- Fix verification first (check AnalysisRun)
| |
| +- "Stage has not passed health checks"
| | +-- Health gate is blocking verification
| | +-- Check ArgoCD app health for all apps in this stage
| |
| +-- "error reconciling Stage"
| +-- Check full error message for specifics
|
+-- 4. Nuclear option (LAST RESORT)
|
+-- Delete Stage CR + all its Promotions, let ArgoCD recreate from git
|
+- kubectl delete stage <name> -n ${KARGO_NS}
+- kubectl delete promotions -n ${KARGO_NS} --all
+-- Wait for ArgoCD to recreate Stage from Helm templates
Then new freight will auto-promote through clean state
<critical_warning>
Promotions created via kubectl create or kubectl apply:
currentPromotion on the StageUse: Kargo UI, Kargo CLI, or auto-promotion. NEVER kubectl.
ArgoCD keeps the last operationState indefinitely. An error message from 3 hours ago does NOT mean
the app is currently failing. Always check finishedAt timestamp.
# Check when the last operation actually finished
${SSH_CMD} kubectl get app <name> -n ${ARGOCD_NS} \
-o jsonpath='{.status.operationState.finishedAt}'
After ArgoCD sync completes, Kargo waits 10 seconds before trusting health status. This prevents false positives from stale ArgoCD health cache. Don't panic if health check doesn't pass immediately after sync.
If the verification Job's ServiceAccount lacks permissions, the Job fails but the error is buried in Job logs. The AnalysisRun just shows "Failed" with no useful message.
Always check: kubectl logs job/<analysisrun-job> -n ${KARGO_NS}
Kargo's Warehouse checks the git tree hash, not the commit hash. An empty commit (no file changes) produces the same tree hash and Kargo won't create new Freight.
Workaround: Touch a file or add a meaningful change.
ArgoCD's selfHeal: true recreates deleted resources, but NOT immediately. The reconciliation loop
runs every 3 minutes by default. If you delete a resource to "fix" it, it may take up to 3 minutes
to come back.
The monitoring stage (kube-prometheus-stack + loki) typically requires:
timeout: 10m on argocd-update steps (default 5m is too short)errorThreshold: 3 (repo-server often OOM-restarts during prometheus chart render)If your foundation stage contains ArgoCD and Kargo themselves, it should use auto-sync (ArgoCD manages itself) rather than Kargo-controlled promotion. Kargo can't promote itself. </critical_warning>
| Problem | Fix Command / Action | When to Use |
| ------------------------ | -------------------------------------------------------------------------------------------------- | ------------------------------------------------ |
| Poisoned stage state | Delete Stage CR: kubectl delete stage <name> -n ${KARGO_NS} | lastPromotion stuck on manual/errored promotion |
| Stale errored promotions | Delete Promotions: kubectl delete promotions -n ${KARGO_NS} -l kargo.akuity.io/stage=<stage> | Auto-promotion says "already exists for freight" |
| Verification RBAC | Add resources to ClusterRole bound to verification SA | AnalysisRun Job fails with "forbidden" |
| Transient sync failure | Add retry: {timeout: 10m, errorThreshold: 3} to argocd-update step | ArgoCD repo-server restarts, connection refused |
| Force stage refresh | kubectl annotate stage <name> -n ${KARGO_NS} --overwrite kargo.akuity.io/refresh=$(date +%s) | Stage not reconciling, stale status |
| Force warehouse refresh | kubectl annotate warehouse <name> -n ${KARGO_NS} --overwrite kargo.akuity.io/refresh=$(date +%s) | New commits not detected as freight |
| ArgoCD app stuck syncing | kubectl patch app <name> -n ${ARGOCD_NS} --type merge -p '{"operation": null}' | operationState stuck, blocking new syncs |
| CRD ordering issue | Move CRD-dependent app to later stage (operators -> platform) | "no matches for kind" errors during sync |
| All else fails | Delete Stage + all Promotions, let ArgoCD recreate | Multiple overlapping issues, unclear root cause |
After running the decision tree, you MUST output:
DIAGNOSIS
=========
Stage: <stage-name>
Phase: <current phase>
Health: <health status>
Root Cause: <one-line summary>
Evidence: <what command output showed this>
Recommended Fix:
1. <specific action with exact command>
2. <verification command to confirm fix>
Risk Level: LOW | MEDIUM | HIGH
LOW = Safe to apply, easily reversible
MEDIUM = Requires careful execution, may cause brief disruption
HIGH = Nuclear option, will cause temporary pipeline disruption
See references/verification-taxonomy.md for the 4-level verification taxonomy and per-stage verification matrices.
# Run these in parallel for fast feedback:
${SSH_CMD} kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
${SSH_CMD} kubectl get applications -n ${ARGOCD_NS}
${SSH_CMD} kubectl get stages -n ${KARGO_NS}
STOP at first failure level. Don't check L3 if L2 is broken.
See the RBAC checklist in references/verification-taxonomy.md.
After running verification, you MUST output:
VERIFICATION REPORT
===================
Stage: <stage-name>
Timestamp: <when checks ran>
Level 1 (Infrastructure): PASS | FAIL
[x] Pods running: <count>/<expected>
[x] CRDs installed: <list>
[ ] Secrets present: <missing-secret> FAILED
Level 2 (Service Health): PASS | FAIL | SKIPPED
[x] Health endpoints responding
[ ] Prometheus targets: 0 active FAILED
Level 3 (Ingress & Auth): PASS | FAIL | SKIPPED
...
Level 4 (Functional): PASS | FAIL | SKIPPED
...
Overall: PASS | FAIL at Level <N>
Action Required: <what to fix, or "None">
Before promoting freight through ANY stage, verify:
# 1. Check current pipeline state
${SSH_CMD} kubectl get stages -n ${KARGO_NS} -o wide
# 2. Check available freight
${SSH_CMD} kubectl get freight -n ${KARGO_NS} \
--sort-by=.metadata.creationTimestamp
# 3. Check if target stage has pending/running promotions
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} \
-l kargo.akuity.io/stage=<target-stage> --sort-by=.metadata.creationTimestamp
# 4. Check upstream stage is Verified for this freight
${SSH_CMD} kubectl get stage <upstream-stage> -n ${KARGO_NS} \
-o jsonpath='{.status.lastVerification}'
DO NOT PROMOTE IF:
Most stages should have auto-promotion enabled. If freight is verified in the upstream stage, it automatically promotes to the next stage. No manual action needed.
Use the Kargo dashboard at kargo.${DOMAIN} to:
kargo promote --project ${KARGO_PROJECT} --stage <target-stage> --freight <freight-name>
<critical_warning>
NEVER create Promotions via kubectl. kubectl-created promotions:
After promotion starts, monitor progress:
# Watch promotion status (repeat every 30s)
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} \
-l kargo.akuity.io/stage=<stage> --sort-by=.metadata.creationTimestamp -o wide
# Watch ArgoCD sync progress
${SSH_CMD} kubectl get app <app-name> -n ${ARGOCD_NS} \
-o jsonpath='{.status.sync.status} {.status.health.status}'
# Watch stage phase transition
${SSH_CMD} kubectl get stage <stage> -n ${KARGO_NS} \
-o jsonpath='{.status.phase} {.status.health}'
Expected progression:
Promotion: Pending -> Running -> Succeeded
ArgoCD: OutOfSync -> Syncing -> Synced+Healthy
Stage: NotApplicable -> Healthy -> (verification runs) -> Verified
If promotion takes > 10 minutes:
After promotion succeeds, run verification for the target stage (jump to VERIFY mode V1-V4).
Confirm:
VerifiedfreightHistoryBefore making ANY configuration changes, scan the existing GitOps architecture:
# 1. Discover existing stages and their dependency chain
${SSH_CMD} kubectl get stages -n ${KARGO_NS} \
-o jsonpath='{range .items[*]}{.metadata.name}: {.spec.requestedFreight[*].sources}{"\n"}{end}'
# 2. Discover which ArgoCD apps each stage promotes
# Check the Kargo stage definitions in your gitops directory
# 3. Discover existing verification templates
${SSH_CMD} kubectl get analysistemplates -n ${KARGO_NS}
# 4. Discover verification RBAC
${SSH_CMD} kubectl get clusterroles -l app=kargo-verification
${SSH_CMD} kubectl get serviceaccounts -n ${KARGO_NS}
requestedFreight dependencypromotionTemplate with argocd-update steps for each appkargo.akuity.io/authorized-stage annotation to ApplicationSet template- apiVersion: kargo.akuity.io/v1alpha1
kind: Stage
metadata:
name: <stage-name>
namespace: ${KARGO_NS}
spec:
requestedFreight:
- origin:
kind: Warehouse
name: ${WAREHOUSE}
sources:
stages:
- <upstream-stage-name>
promotionTemplate:
spec:
steps:
- uses: argocd-update
config:
apps:
- name: <app-name>
namespace: ${ARGOCD_NS}
retry:
timeout: 10m
errorThreshold: 3
verification:
analysisTemplates:
- name: <stage-name>-health
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: <stage-name>-health
namespace: ${KARGO_NS}
spec:
metrics:
- name: <check-name>
provider:
job:
spec:
backoffLimit: 1
template:
spec:
serviceAccountName: kargo-verification
restartPolicy: Never
containers:
- name: verify
image: alpine/k8s:latest
command: ["/bin/bash", "-c"]
args:
- |
# Level 1: Check pods
kubectl get pods -n <namespace> --no-headers | grep -v Running && exit 1
# Level 2: Check health
kubectl exec -n <namespace> deploy/<name> -- wget -qO- http://localhost:<port>/health || exit 1
echo "All checks passed"
exit 0
apiVersion: v1
kind: ServiceAccount
metadata:
name: kargo-verification
namespace: ${KARGO_NS}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kargo-verification
labels:
app: kargo-verification
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["get", "list"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["get", "list"]
- apiGroups: ["argoproj.io"]
resources: ["applications"]
verbs: ["get", "list"]
# Add more resources as verification needs grow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kargo-verification
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kargo-verification
subjects:
- kind: ServiceAccount
name: kargo-verification
namespace: ${KARGO_NS}
| App Type | Timeout | ErrorThreshold | Rationale | | -------------------------------- | ------- | -------------- | -------------------------------- | | Simple (configmap, small deploy) | 3m | 2 | Fast sync, few resources | | Medium (ESO, Crossplane) | 5m | 2 | CRD installation takes time | | Heavy (kube-prometheus-stack) | 10m | 3 | Huge chart, repo-server OOM risk | | Very Heavy (loki with PVCs) | 10m | 3 | PVC binding + large statefulset |
| Symptom | Likely Fix | | -------------------------------------------- | ---------------------------------- | | Promotion fails intermittently | Increase errorThreshold to 3 | | "connection refused :8081" in promotion logs | Add retry + check repo-server RAM | | Promotion times out on first attempt | Increase timeout to 10m | | Same promotion succeeds on manual retry | Definitely needs errorThreshold >1 |
kargo.akuity.io/authorized-stage annotation -> Promotion will be rejectedAll commands assume environment variables from the ENVIRONMENT DISCOVERY step.
# Kargo stages overview
${SSH_CMD} kubectl get stages -n ${KARGO_NS} -o wide
# Recent promotions
${SSH_CMD} kubectl get promotions -n ${KARGO_NS} \
--sort-by=.metadata.creationTimestamp -o wide
# Available freight
${SSH_CMD} kubectl get freight -n ${KARGO_NS} \
--sort-by=.metadata.creationTimestamp
# ArgoCD apps
${SSH_CMD} kubectl get applications -n ${ARGOCD_NS}
# Kargo controller logs
${SSH_CMD} kubectl logs -n ${KARGO_CONTROLLER_NS} deploy/kargo-controller --tail=100
# ArgoCD repo-server logs (common failure point)
${SSH_CMD} kubectl logs -n ${ARGOCD_CONTROLLER_NS} deploy/argocd-repo-server --tail=50
# AnalysisRuns (verification results)
${SSH_CMD} kubectl get analysisruns -n ${KARGO_NS} \
--sort-by=.metadata.creationTimestamp
# Force stage reconcile
${SSH_CMD} kubectl annotate stage <name> -n ${KARGO_NS} \
--overwrite kargo.akuity.io/refresh=$(date +%s)
# Force warehouse reconcile
${SSH_CMD} kubectl annotate warehouse <name> -n ${KARGO_NS} \
--overwrite kargo.akuity.io/refresh=$(date +%s)
# Check all unhealthy pods
${SSH_CMD} kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
These come from ENVIRONMENT DISCOVERY. Common defaults:
| Variable | Common Default | Contains |
| ------------------------- | -------------- | ------------------------------------------ |
| ${ARGOCD_CONTROLLER_NS} | argocd | ArgoCD server, ApplicationSets |
| ${KARGO_CONTROLLER_NS} | kargo | Kargo controller, API |
| ${KARGO_NS} | varies | Kargo Project, Stages, Promotions, Freight |
| ${ARGOCD_NS} | infra | ArgoCD Applications |
| ${MONITORING_NS} | monitoring | Prometheus, Grafana, Loki, Alloy |
| ${APP_NS} | varies | Application workloads |
Your actual chain is discovered from the cluster. A typical pattern:
Warehouse
|
v
foundation (auto-sync, ArgoCD self-manages)
| ArgoCD, Kargo, Traefik, cert-manager, etc.
|
v
operators (Kargo-controlled)
| ESO, Crossplane, etc.
|
v
monitoring (Kargo-controlled)
| kube-prometheus-stack, Loki, Alloy, etc.
|
v
platform (Kargo-controlled)
| Databases, Auth, Workflow engines, etc.
|
v
services (Kargo-controlled)
Application services
development
Enforces strict Test-Driven Development (TDD) workflow: RED-GREEN-REFACTOR cycle. Tests MUST be written BEFORE implementation. Every change starts with a failing test. Applies to any language (Rust, TypeScript, Python, Go, etc.). Triggers: writing new features, fixing bugs, adding endpoints, refactoring, any code change.
content-media
Structured project planning: break down a new project idea into plan files covering architecture, file structure, and implementation roadmap. Triggers: 'new project', 'plan a feature', 'break down', 'architecture', 'roadmap', 'design a system'.
development
Raises well-structured GitLab issues with root cause analysis, proposed solutions, and correct assignees based on git history. Adapts to any GitLab instance and project conventions automatically. Triggers: raising issues, reporting bugs, creating tickets, filing defects, feature requests, refactoring proposals.
development
Documentation standards: ASCII box-drawing diagrams (not Mermaid), structured plan format, compact tables for comparisons. Use when writing docs, plans, READMEs, or architecture documents in any project.