skills/cognitive-platform-engineering-autonomous/SKILL.md
Build autonomous cloud operations using a four-plane cognitive architecture (Sensing, Reasoning, Orchestration, Experience) with Kubernetes, Terraform, OPA, and ML anomaly detection. Use when: 'set up self-healing Kubernetes infrastructure', 'add anomaly detection to my cloud platform', 'create OPA policies for autonomous remediation', 'build a cognitive operations pipeline', 'implement intent-based infrastructure management', 'add intelligent auto-scaling with feedback loops'.
npx skillsauth add ndpvt-web/arxiv-claude-skills cognitive-platform-engineering-autonomousInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement autonomous cloud operations systems using the four-plane Cognitive Platform Engineering architecture from Punniyamoorthy et al. (2026). Instead of reactive, rule-based DevOps automation, this approach embeds sensing, reasoning, and autonomous action directly into the platform lifecycle through a continuous feedback loop. The result is infrastructure that detects anomalies via ML, evaluates responses through OPA policies, executes remediation via Kubernetes operators and Terraform, and surfaces decisions to humans through an experience layer -- reducing mean time to resolution, improving resource efficiency, and maintaining compliance without manual intervention.
The core insight is structuring autonomous operations into four cooperating planes rather than a flat automation pipeline. Traditional DevOps tools (Prometheus alerts -> PagerDuty -> human -> kubectl) create a serial chain with humans as the bottleneck. The cognitive architecture replaces this with parallel, feedback-driven planes:
Sensing Plane collects telemetry (metrics via Prometheus, logs via Fluentd/Loki, traces via Jaeger/OpenTelemetry) and normalizes it into a unified data model. The key innovation is not just collection but correlation -- linking a latency spike in service A to a memory pressure event on node B. Reasoning Plane applies ML models (isolation forests for anomaly detection, statistical deviation models for trend analysis) to the correlated telemetry, producing structured assessments: anomaly type, confidence score, probable root cause, and recommended action. This replaces static threshold alerts with contextual understanding. Orchestration Plane evaluates recommendations against OPA policies (business rules, compliance constraints, blast radius limits) and executes approved actions via Kubernetes operators or Terraform applies. Crucially, policies gate what the system may do autonomously vs. what requires human approval. Experience Plane presents decisions, rationale, and outcomes to operators, and feeds human corrections back into the Reasoning Plane for continuous improvement.
The continuous feedback loop is what distinguishes this from one-shot automation: every autonomous action's outcome is measured, and the system learns whether the action actually improved the situation or made it worse. This closes the loop between action and observation, enabling reinforcement-learning-style improvement over time.
Audit the existing infrastructure stack -- Identify what telemetry sources exist (Prometheus, CloudWatch, Datadog), what IaC tool manages resources (Terraform, Pulumi, Helm), and what policy enforcement is in place. Map these to the four planes to find gaps.
Implement the Sensing Plane -- Deploy or configure Prometheus with ServiceMonitor CRDs for metrics, set up structured logging with correlation IDs, and configure OpenTelemetry collectors. Write a telemetry normalizer that emits a unified event schema:
# Unified telemetry event schema
apiVersion: cognitive.platform/v1
kind: TelemetryEvent
metadata:
source: prometheus
correlationId: "abc-123"
spec:
metric: container_memory_usage_bytes
value: 1.8Gi
threshold: 2Gi
node: worker-3
namespace: production
timestamp: "2026-01-24T10:30:00Z"
Build the Reasoning Plane -- Implement an anomaly detection service that consumes normalized telemetry. Use isolation forests for multivariate anomaly detection and EWMA (Exponentially Weighted Moving Average) for trend detection. Output structured assessments:
{
"anomalyId": "anom-456",
"type": "memory_pressure",
"confidence": 0.87,
"affectedResources": ["pod/api-server-7b9c", "node/worker-3"],
"probableCause": "memory_leak_in_container",
"recommendedAction": "restart_pod",
"severity": "warning"
}
Define OPA policies for the Orchestration Plane -- Write Rego policies that gate autonomous actions based on confidence thresholds, blast radius, time-of-day, and compliance requirements:
package cognitive.orchestration
default allow_autonomous_action = false
allow_autonomous_action {
input.assessment.confidence > 0.80
input.assessment.severity != "critical"
blast_radius_acceptable
within_change_window
}
blast_radius_acceptable {
count(input.assessment.affectedResources) <= 3
}
within_change_window {
hour := time.clock(time.now_ns())[0]
hour >= 6
hour <= 22
}
Build Kubernetes operators for autonomous remediation -- Create a custom controller that watches for approved assessments and executes remediation actions (pod restarts, HPA adjustments, node cordoning, rollbacks). Use controller-runtime or Kopf (Python):
# Kopf-based remediation operator (simplified)
import kopf
import kubernetes
@kopf.on.create('cognitive.platform', 'v1', 'remediationorders')
def handle_remediation(spec, **kwargs):
action = spec.get('action')
target = spec.get('target')
if action == 'restart_pod':
api = kubernetes.client.CoreV1Api()
api.delete_namespaced_pod(target['name'], target['namespace'])
elif action == 'scale_hpa':
# Adjust HPA min/max replicas
...
Wire Terraform for infrastructure-level remediation -- For actions that go beyond Kubernetes (scaling node groups, adjusting cloud resources), generate Terraform variable overrides and trigger applies through a CI pipeline or Atlantis:
# Auto-generated by Orchestration Plane
variable "node_group_desired_size" {
default = 5 # Adjusted from 3 based on assessment anom-789
}
Implement the feedback loop -- After every autonomous action, measure the outcome against the original anomaly signal. Store action-outcome pairs for the Reasoning Plane to learn from. Create a CRD that tracks remediation effectiveness:
apiVersion: cognitive.platform/v1
kind: RemediationOutcome
spec:
remediationId: "rem-101"
anomalyId: "anom-456"
actionTaken: "restart_pod"
outcomeMetrics:
anomalyResolved: true
timeToResolution: "45s"
sideEffects: []
feedback: "effective"
Build the Experience Plane dashboard -- Create a lightweight UI or CLI tool that shows: active anomalies, pending and executed remediations, policy decisions (allowed/denied with reasons), and feedback loop metrics. Expose this via a Kubernetes service or integrate into Grafana.
Add escalation paths -- For actions blocked by OPA policies (low confidence, critical severity, outside change window), generate structured alerts with full context (anomaly assessment, recommended action, policy denial reason) so humans can make informed decisions quickly.
Validate end-to-end with chaos engineering -- Use Litmus or Chaos Mesh to inject failures (pod kills, CPU stress, network partitions) and verify that the four-plane loop detects, reasons about, remediates, and learns from each scenario.
Example 1: Self-healing memory leak detection
User: "Set up autonomous detection and remediation for memory leaks in our production Kubernetes cluster."
Approach:
container_memory_usage_bytes and container_memory_working_set_bytes scraping at 15s intervalsRemediationOrder CRDs and executes rolling restart of the affected deploymentOutput structure:
sensing/
prometheus-servicemonitor.yaml # Metrics collection for memory
telemetry-normalizer-deployment.yaml
reasoning/
anomaly-detector/
model.py # Isolation forest on memory time series
assessment-publisher.py # Emits TelemetryAssessment CRDs
orchestration/
opa-policies/
memory-remediation.rego # Confidence, blast radius, readiness gates
remediation-operator/
handler.py # Kopf operator for pod restarts
crds/
remediationorder-crd.yaml
remediationoutcome-crd.yaml
experience/
grafana-dashboard.json # Anomaly + remediation visibility
Example 2: Intent-based auto-scaling with OPA guardrails
User: "I want to declare that my API should maintain p95 latency under 200ms and the system should auto-adjust replicas and resources."
Approach:
apiVersion: cognitive.platform/v1
kind: OperationalIntent
metadata:
name: api-latency-intent
spec:
target: deployment/api-server
objectives:
- metric: http_request_duration_seconds_p95
threshold: 0.2
operator: "<="
constraints:
maxReplicas: 20
maxCpuPerPod: "2000m"
maxMonthlyCost: 5000
http_request_duration_seconds histogram and computes p95Output: The system autonomously scales from 3 to 7 replicas when p95 crosses 180ms, then records that this action reduced p95 to 120ms, reinforcing "scale replicas" as the preferred action for latency-driven anomalies.
Example 3: Compliance drift detection and auto-correction
User: "Add continuous compliance checking that automatically fixes Terraform drift for our security policies."
Approach:
terraform plan on a schedule, capturing drift as structured eventsOutput:
Compliance Report - 2026-01-24
Drifts Detected: 4
- s3://data-bucket: PublicAccessBlock disabled [CRITICAL] -> Auto-corrected
- ebs/vol-abc: Encryption disabled [CRITICAL] -> Auto-corrected
- ec2/i-xyz: Tag "owner" missing [LOW] -> Queued for batch
- rds/prod-db: Backup retention 5d vs policy 7d [MEDIUM] -> Queued for review
Auto-corrections applied: 2
Pending human review: 2
Compliance score: 94% -> 98%
feedback: "failed", and create a structured alert. Do not retry automatically -- infrastructure changes that fail may indicate a deeper problem.Punniyamoorthy, V., Saksena, N., Sankiti, S.R., Chockalingam, N., & Kirubakaran, A.M. (2026). Cognitive Platform Engineering for Autonomous Cloud Operations. arXiv:2601.17542v1. Key sections: the four-plane reference architecture diagram (Section 3), OPA policy integration patterns (Section 4), and MTTR/resource efficiency results from the Kubernetes prototype (Section 5).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".