
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
Diagnose Kubernetes native ResourceQuota and LimitRange admission rejections (exceeded quota, forbidden by LimitRange, FailedCreate). Checks namespace quotas, current usage, LimitRange constraints, and ReplicaSet events to identify why pods cannot be created. Not applicable to Volcano Queue — use volcano-queue-diagnose for gang scheduling clusters.
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers SKILL.md format, script execution modes, and best practices.
Diagnose Volcano Job status and issues. Check Job phases, task statuses, PodGroup associations, and overall job health.
Resource shortage diagnostic guide for Volcano. Use when seeing Insufficient cpu/memory events, OOMKilled pods, or nodes with zero allocatable resources.
Diagnose Deployment rollout failures (stuck rollouts, ProgressDeadlineExceeded, replica mismatch). Checks rollout status, ReplicaSets, and new pod health to identify why an update is failing.
Diagnose pod crash failures (CrashLoopBackOff, OOMKilled, Error, RunContainerError). Checks pod status, events, and previous logs to identify root cause.
Diagnose pod scheduling failures (Pending, Unschedulable). Checks events, node resources, taints, affinity, and PVC bindings to identify why a pod cannot be scheduled.
Diagnose Service connectivity issues (empty Endpoints, selector mismatch, port mismatch, no backend pods). Checks Service, Endpoints, and target pods to identify why traffic is not reaching backends.
Diagnose StatefulSet rollout and scaling failures (ordered update stuck, OnDelete not updating, partition misconfiguration, PVC binding deadlocks). Checks update strategy, pod ordinal progression, PVC bindings, and ordered startup to identify why a StatefulSet is not progressing.
Query cluster node resources for Volcano scheduling. Check allocatable CPU, memory, GPU, and current usage.
View Volcano scheduler configuration. Check scheduler ConfigMap, actions, plugins, and tier settings.
Retrieve and analyze Volcano scheduler logs. Filter by keyword, time range, or pod name to debug scheduling decisions.
Structured hypothesis-driven investigation for complex infrastructure issues. Triggered ONLY by user action: /dp command, Ctrl+I, or [Deep Investigation] UI toggle.
Diagnose DNS resolution failures in the cluster (NXDOMAIN, timeouts, SERVFAIL). Checks CoreDNS health, service endpoints, and DNS configuration.
Diagnose container image pull failures (ErrImagePull / ImagePullBackOff). Checks pod status, containerd logs, and events to identify root cause.
Diagnose Ingress failures (rules not matching, backend unreachable, TLS errors, no address assigned). Checks Ingress resources, IngressClass, backend Services, and controller health to identify why external traffic is not routed correctly.
Check node health and diagnose node-level issues (NotReady, DiskPressure, MemoryPressure, PIDPressure). Inspects node conditions, resource allocation, and real-time usage.
Diagnose NetworkPolicy-related connectivity issues (traffic unexpectedly blocked, default-deny effects, egress blocking DNS). Identifies which NetworkPolicies affect a pod, checks ingress/egress rules, and verifies CNI support.
Show the gateway for a network interface on a Kubernetes node. Runs `ip -j route` on the host via node_script tool.
Ping a pod's gateway for a given network interface. Auto-detects gateway IP from the routing table, then pings it. First resolve_pod_netns, then node_script with netns param.
Show the gateway for a network interface in a Kubernetes pod. Reads the routing table via `ip -j route` from the pod's network namespace. First resolve_pod_netns, then node_script with netns param.
Diagnose PersistentVolumeClaim failures (Pending PVC, StorageClass not found, PV binding issues, capacity mismatch). Checks PVC, PV, StorageClass, and provisioner events to identify why storage is not available.
Interactive session feedback — reviews the diagnostic process with the user, identifies decision points, and saves structured feedback to improve diagnostic capabilities.
Diagnose Volcano-managed Pod scheduling issues. Checks Pod status, PodGroup, events, and Queue to identify scheduling failures.
Gang Scheduling diagnostic guide for Volcano. Use when PodGroup cannot schedule completely, member Pods remain Pending, or minAvailable/minMember constraints are not satisfied.
Diagnose Volcano Queue status and resource allocation. Check queue weights, deserved resources, allocated resources, and identify queue-related scheduling bottlenecks.
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.
Ping a node's gateway for a given network interface. Auto-detects gateway IP from the routing table, then pings it. Execute via node_script tool.
Analyze cluster-wide Kubernetes events to identify issues and patterns. Aggregates Warning events, detects high-frequency patterns, and correlates related events.
Diagnose HorizontalPodAutoscaler failures (not scaling, metrics unavailable, target mismatch). Checks HPA status, metrics-server health, and scaling events to identify why autoscaling is not working.
Fuzzy-match Kubernetes nodes by keyword. Equivalent to `kubectl get nodes -o wide | grep <keyword>`, with header preserved. Use this instead of listing all nodes to keep context minimal.
Diagnose Job and CronJob failures (BackoffLimitExceeded, DeadlineExceeded, pods failing, CronJob not triggering). Checks Job status, pod logs, and CronJob schedule to identify why batch workloads are failing.