Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

seaworld008/cc-devops-skills

Name: cc-devops-skills
Author: seaworld008

openclaw-skills/cc-devops-skills/SKILL.md

npx skillsauth add seaworld008/commonly-used-high-value-skills cc-devops-skills

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

CC DevOps Skills

Use this skill when working on infrastructure, delivery pipelines, Kubernetes operations, observability, PromQL, incident response, Terraform, Docker, shell automation, and reliability engineering. It provides a unified SRE/DevOps operating model rather than a single vendor-specific command set.

The skill is inspired by the Apache-2.0 cc-devops-skills repository, but this version is self-contained for this curated skill catalog.

When to Use

Designing, reviewing, or fixing CI/CD pipelines.
Creating or validating Kubernetes manifests, Helm values, Kustomize overlays, or deployment workflows.
Debugging pods, services, ingress, DNS, network policy, probes, autoscaling, or rollout issues.
Writing PromQL queries, alert rules, recording rules, SLO dashboards, or runbooks.
Building Dockerfiles, Compose stacks, image hardening, or multi-stage builds.
Reviewing Terraform, Terragrunt, Ansible, or infrastructure-as-code changes.
Handling incidents, postmortems, operational readiness, release safety, and rollback planning.
Improving reliability, deployment frequency, recovery time, observability, and operational toil.

Skip When

The task is pure application logic without deployment, runtime, reliability, or operational impact.
The user asks for business strategy, product design, or frontend-only polish.
A cloud-provider-specific skill already covers the whole task more precisely, such as a deep Azure Kubernetes operation.

Core Capabilities

Translate product and engineering needs into reliable delivery systems.
Build CI/CD workflows with clear stages, caching, artifact flow, gates, and rollback paths.
Design Kubernetes resources with probes, requests, limits, disruption budgets, security context, and deployment strategy.
Debug live clusters using repeatable evidence gathering.
Write PromQL that respects labels, cardinality, windows, and alert semantics.
Review infrastructure-as-code for drift, blast radius, secrets, and lifecycle risk.
Harden containers and supply-chain paths.
Produce incident runbooks and postmortem-ready timelines.

Operating Principles

Automate the path, but make the failure mode visible.
Prefer declarative infrastructure and reproducible builds.
Treat secrets as toxic data: never print, commit, or echo them.
Every alert needs an owner, severity, symptom, impact, and action.
Every deployment needs a rollback or forward-fix decision point.
Production changes should be observable before they are trusted.
CI should fail early on cheap checks and reserve expensive checks for later gates.
Kubernetes readiness is not the same as liveness; do not use one probe for both.
PromQL queries must be tested against expected label sets and time windows.

CI/CD Workflow

Use this pipeline shape unless the repo already has a stronger local convention:

stages:
  - lint
  - unit-test
  - build
  - security-scan
  - integration-test
  - package
  - deploy-staging
  - smoke-test
  - promote-production

For each stage, define:

Inputs and outputs.
Cache keys and invalidation rules.
Required secrets and their scope.
Failure ownership.
Timeout.
Retry policy.
Artifact retention.
Required status checks before merge.

CI/CD Review Checklist

Build is deterministic and does not depend on local developer state.
Lockfiles are respected.
Tests run in the same major runtime version used in production.
Secrets are read from the platform secret store, not committed files.
Deployment jobs require protected environments or approvals when needed.
The pipeline uploads test results, coverage, logs, and build artifacts.
Rollbacks are documented and tested.
Concurrency controls prevent two production deploys racing.
Scheduled jobs and branch filters cannot deploy unreviewed code.
Container images are pinned by digest for production where feasible.

Kubernetes Readiness Checklist

resources.requests and resources.limits are set with realistic values.
readinessProbe checks whether the pod can receive traffic.
livenessProbe checks whether the process should be restarted.
startupProbe protects slow boot paths.
PodDisruptionBudget exists for replicated workloads.
Deployment strategy is compatible with state and traffic behavior.
securityContext drops unnecessary privileges.
Service account permissions are least privilege.
ConfigMaps and Secrets are mounted or injected intentionally.
Ingress, service, and pod selectors match.
HPA metrics are stable and not based on noisy low-volume signals.
NetworkPolicy does not block required DNS, egress, or service traffic.

Kubernetes Debug Flow

Use a read-only evidence path first:

kubectl get deploy,rs,pod,svc,ingress -n <namespace> -o wide
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl get events -n <namespace> --sort-by=.lastTimestamp
kubectl rollout status deploy/<name> -n <namespace>

Then isolate by layer:

Scheduling: pending pods, taints, node pressure, quotas.
Image: pull errors, registry auth, platform mismatch.
Config: missing env vars, invalid secret keys, wrong mount paths.
Runtime: crash loops, OOMKilled, failed probes, dependency timeouts.
Network: service selector, endpoints, DNS, ingress, TLS, network policy.
Capacity: CPU throttling, memory pressure, queue depth, saturation.

PromQL Patterns

Use rate windows that match scrape interval and user impact.

sum by (service) (
  rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (service) (
  rate(http_requests_total[5m])
)

For SLO burn alerts, prefer multi-window checks:

(
  job:slo_errors_per_request:ratio_rate5m > 14.4 * 0.001
and
  job:slo_errors_per_request:ratio_rate1h > 14.4 * 0.001
)
or
(
  job:slo_errors_per_request:ratio_rate30m > 6 * 0.001
and
  job:slo_errors_per_request:ratio_rate6h > 6 * 0.001
)

PromQL Review Checklist

Query uses rate() or increase() for counters.
Aggregation keeps only labels needed for routing or diagnosis.
Regex matchers do not explode cardinality.
Alert window is long enough for the scrape interval.
Alert has for: where short spikes should not page.
Dashboard query and alert query agree on units.
Recording rules name the unit and aggregation.
Missing metrics are handled when absence itself is meaningful.

Terraform and IaC Checks

Pin provider versions.
Keep state backend remote, locked, and encrypted.
Review plan output for destructive actions before apply.
Use modules for repeated infrastructure, not for single-use complexity.
Keep secrets out of variables files and state where possible.
Add lifecycle rules only with a clear reason.
Detect drift before assuming code matches production.
Prefer small, reviewable plans over giant mixed changes.

Incident Response Flow

1. Declare incident and assign roles.
2. Define user impact and start timeline.
3. Stabilize: rollback, disable feature, scale, or shed load.
4. Gather evidence without destroying state.
5. Communicate status on a fixed cadence.
6. Resolve or mitigate.
7. Capture follow-up actions with owners and dates.

Anti-Patterns

Paging on symptoms nobody can act on.
Using CPU percentage alone as a service health signal.
Deploying without smoke tests or rollback instructions.
Running production migrations as an unobserved CI side effect.
Giving CI broad cloud credentials across all branches.
Using latest image tags in production.
Adding Kubernetes liveness probes that restart slow but healthy apps.
Writing PromQL with unbounded high-cardinality labels.

Output Format

For reviews:

## Findings
- Severity:
- Evidence:
- Impact:
- Fix:

## Validation
- Commands:
- Expected result:

For implementation:

## Plan
- Delivery path:
- Rollback:
- Observability:
- Security:

Boundaries

Do not run destructive cloud or cluster operations without explicit user approval. Prefer read-only inspection first. Never print or persist secrets.

seaworld008/cc-devops-skills

openclaw-skills/cc-devops-skills/SKILL.md

SRE, DevOps, Kubernetes, CI/CD, PromQL, Terraform, Docker, and incident operations playbook for building reliable delivery and operations workflows.

48 stars

development

Updated Jun 29, 2026

$ install --global

skillsauth

npx skillsauth add seaworld008/commonly-used-high-value-skills cc-devops-skills

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 29, 2026, 5:06 AM325.8s1 file scanned

SKILL.md

name:: cc-devops-skills
description:: SRE, DevOps, Kubernetes, CI/CD, PromQL, Terraform, Docker, and incident operations playbook for building reliable delivery and operations workflows.
zh_description:: 用于cc、DevOps、技能，支持部署、监控、排障和发布管理。
version:: 1.0.0
author:: seaworld008
source:: github:akin-ozer/cc-devops-skills
source_url:: https://github.com/akin-ozer/cc-devops-skills
license:: Apache-2.0
tags:: [sre, devops, kubernetes, cicd, promql, terraform, docker, observability, incident-response]
created_at:: 2026-06-03
updated_at:: 2026-06-03
quality:: 4
complexity:: advanced

CC DevOps Skills

The skill is inspired by the Apache-2.0 cc-devops-skills repository, but this version is self-contained for this curated skill catalog.

When to Use

Designing, reviewing, or fixing CI/CD pipelines.
Creating or validating Kubernetes manifests, Helm values, Kustomize overlays, or deployment workflows.
Debugging pods, services, ingress, DNS, network policy, probes, autoscaling, or rollout issues.
Writing PromQL queries, alert rules, recording rules, SLO dashboards, or runbooks.
Building Dockerfiles, Compose stacks, image hardening, or multi-stage builds.
Reviewing Terraform, Terragrunt, Ansible, or infrastructure-as-code changes.
Handling incidents, postmortems, operational readiness, release safety, and rollback planning.
Improving reliability, deployment frequency, recovery time, observability, and operational toil.

Skip When

The task is pure application logic without deployment, runtime, reliability, or operational impact.
The user asks for business strategy, product design, or frontend-only polish.
A cloud-provider-specific skill already covers the whole task more precisely, such as a deep Azure Kubernetes operation.

Core Capabilities

Translate product and engineering needs into reliable delivery systems.
Build CI/CD workflows with clear stages, caching, artifact flow, gates, and rollback paths.
Design Kubernetes resources with probes, requests, limits, disruption budgets, security context, and deployment strategy.
Debug live clusters using repeatable evidence gathering.
Write PromQL that respects labels, cardinality, windows, and alert semantics.
Review infrastructure-as-code for drift, blast radius, secrets, and lifecycle risk.
Harden containers and supply-chain paths.
Produce incident runbooks and postmortem-ready timelines.

Operating Principles

Automate the path, but make the failure mode visible.
Prefer declarative infrastructure and reproducible builds.
Treat secrets as toxic data: never print, commit, or echo them.
Every alert needs an owner, severity, symptom, impact, and action.
Every deployment needs a rollback or forward-fix decision point.
Production changes should be observable before they are trusted.
CI should fail early on cheap checks and reserve expensive checks for later gates.
Kubernetes readiness is not the same as liveness; do not use one probe for both.
PromQL queries must be tested against expected label sets and time windows.

CI/CD Workflow

Use this pipeline shape unless the repo already has a stronger local convention:

stages:
  - lint
  - unit-test
  - build
  - security-scan
  - integration-test
  - package
  - deploy-staging
  - smoke-test
  - promote-production

For each stage, define:

Inputs and outputs.
Cache keys and invalidation rules.
Required secrets and their scope.
Failure ownership.
Timeout.
Retry policy.
Artifact retention.
Required status checks before merge.

CI/CD Review Checklist

Build is deterministic and does not depend on local developer state.
Lockfiles are respected.
Tests run in the same major runtime version used in production.
Secrets are read from the platform secret store, not committed files.
Deployment jobs require protected environments or approvals when needed.
The pipeline uploads test results, coverage, logs, and build artifacts.
Rollbacks are documented and tested.
Concurrency controls prevent two production deploys racing.
Scheduled jobs and branch filters cannot deploy unreviewed code.
Container images are pinned by digest for production where feasible.

Kubernetes Readiness Checklist

resources.requests and resources.limits are set with realistic values.
readinessProbe checks whether the pod can receive traffic.
livenessProbe checks whether the process should be restarted.
startupProbe protects slow boot paths.
PodDisruptionBudget exists for replicated workloads.
Deployment strategy is compatible with state and traffic behavior.
securityContext drops unnecessary privileges.
Service account permissions are least privilege.
ConfigMaps and Secrets are mounted or injected intentionally.
Ingress, service, and pod selectors match.
HPA metrics are stable and not based on noisy low-volume signals.
NetworkPolicy does not block required DNS, egress, or service traffic.

Kubernetes Debug Flow

Use a read-only evidence path first:

kubectl get deploy,rs,pod,svc,ingress -n <namespace> -o wide
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl get events -n <namespace> --sort-by=.lastTimestamp
kubectl rollout status deploy/<name> -n <namespace>

Then isolate by layer:

Scheduling: pending pods, taints, node pressure, quotas.
Image: pull errors, registry auth, platform mismatch.
Config: missing env vars, invalid secret keys, wrong mount paths.
Runtime: crash loops, OOMKilled, failed probes, dependency timeouts.
Network: service selector, endpoints, DNS, ingress, TLS, network policy.
Capacity: CPU throttling, memory pressure, queue depth, saturation.

PromQL Patterns

Use rate windows that match scrape interval and user impact.

sum by (service) (
  rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (service) (
  rate(http_requests_total[5m])
)

For SLO burn alerts, prefer multi-window checks:

(
  job:slo_errors_per_request:ratio_rate5m > 14.4 * 0.001
and
  job:slo_errors_per_request:ratio_rate1h > 14.4 * 0.001
)
or
(
  job:slo_errors_per_request:ratio_rate30m > 6 * 0.001
and
  job:slo_errors_per_request:ratio_rate6h > 6 * 0.001
)

PromQL Review Checklist

Query uses rate() or increase() for counters.
Aggregation keeps only labels needed for routing or diagnosis.
Regex matchers do not explode cardinality.
Alert window is long enough for the scrape interval.
Alert has for: where short spikes should not page.
Dashboard query and alert query agree on units.
Recording rules name the unit and aggregation.
Missing metrics are handled when absence itself is meaningful.

Terraform and IaC Checks

Pin provider versions.
Keep state backend remote, locked, and encrypted.
Review plan output for destructive actions before apply.
Use modules for repeated infrastructure, not for single-use complexity.
Keep secrets out of variables files and state where possible.
Add lifecycle rules only with a clear reason.
Detect drift before assuming code matches production.
Prefer small, reviewable plans over giant mixed changes.

Incident Response Flow

1. Declare incident and assign roles.
2. Define user impact and start timeline.
3. Stabilize: rollback, disable feature, scale, or shed load.
4. Gather evidence without destroying state.
5. Communicate status on a fixed cadence.
6. Resolve or mitigate.
7. Capture follow-up actions with owners and dates.

Anti-Patterns

Paging on symptoms nobody can act on.
Using CPU percentage alone as a service health signal.
Deploying without smoke tests or rollback instructions.
Running production migrations as an unobserved CI side effect.
Giving CI broad cloud credentials across all branches.
Using latest image tags in production.
Adding Kubernetes liveness probes that restart slow but healthy apps.
Writing PromQL with unbounded high-cardinality labels.

Output Format

For reviews:

## Findings
- Severity:
- Evidence:
- Impact:
- Fix:

## Validation
- Commands:
- Expected result:

For implementation:

## Plan
- Delivery path:
- Rollback:
- Observability:
- Security:

Boundaries

Do not run destructive cloud or cluster operations without explicit user approval. Prefer read-only inspection first. Never print or persist secrets.

Related Skills

seaworld008/omen

development

VerifiedTrustedCommunity

Enumerating failure modes via pre-mortem analysis. Systematically identifies failure scenarios for plans, designs, and features, scoring them with RPN/AP. Does not write code.

62SKILL.mdUpdated Jun 16, 2026

seaworld008/nexus

testing

VerifiedTrustedCommunity

Orchestrating specialist AI agent teams as a meta-coordinator. Decomposes requests into minimum viable chains, spawns each as an independent session in AUTORUN modes, and drives to final output. Use when a task spans multiple specialist domains, requires parallel agent execution, or needs hub-and-spoke routing across the skill ecosystem.

62SKILL.mdUpdated Jun 11, 2026

seaworld008/morph

development

VerifiedTrustedCommunity

Converting document formats (Markdown/Word/Excel/PDF/HTML). Converts specs from Scribe and reports from Harvest into distributable formats; generates reusable conversion scripts. Use when converting documents, building accessibility-compliant PDFs, or creating Pandoc/LibreOffice pipelines.

62SKILL.mdUpdated Jun 11, 2026

seaworld008/lore

testing

VerifiedTrustedCommunity

Curating cross-agent knowledge and guarding institutional memory. Extracts patterns from agent journals into METAPATTERNS.md, detects knowledge decay, propagates best practices, prevents organizational forgetting. Use when consolidating cross-agent insights, curating memory, or auditing knowledge decay.

62SKILL.mdUpdated Jun 11, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/seaworld008/commonly-used-high-value-skills.git

# Copy into Claude Code skills folder (global)
cp -r commonly-used-high-value-skills/openclaw-skills/cc-devops-skills ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

seaworld008/commonly-used-high-value-skills

48 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT