plugins/ci/skills/analyze-disruption/SKILL.md
Analyze and compare disruption across one or more Prow CI job runs by examining interval data, audit logs, pod logs, and CPU metrics
npx skillsauth add openshift-eng/ai-helpers analyze-disruptionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill analyzes disruption events recorded in Prow CI job runs. It downloads interval/timeline data, audit logs, and pod logs, then correlates disruption across backends and job runs to identify root causes.
gcloud CLI Installation
which gcloudtest-platform-results bucket is publicly accessible — no authentication requiredPython 3 (3.7 or later)
The user will provide:
One or more Prow job URLs (required, at least 1)
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn/1983307151598161920--backends flag (optional) — comma-separated list of backend names to focus on
--backends kube-api,oauth-api,openshift-apiExtract job URLs and flags
--backends flag if present, split on comma to get backend filter listParse each URL to extract bucket path, job name, and build ID
prow.ci.openshift.org and gcsweb-ci URL formatsbuild_id and job_name from each URLConstruct deep links for each job run — these go inline throughout the report wherever the run or a specific artifact is referenced, not in a separate table:
Run-level links (use when first mentioning a run):
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job_name}/{build_id}https://sippy.dptools.openshift.org/sippy-ng/job_runs/{build_id}/{job_name}/intervalsGCS artifact deep links (use when citing specific evidence):
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/{job_name}/{build_id}/artifacts/{gcs_base}{target}/openshift-e2e-test/artifacts/junit/e2e-timelines_spyglass_{timestamp}.json{gcs_base}{target}/gather-extra/artifacts/audit_logs/{gcs_base}{target}/gather-extra/artifacts/pods/openshift-etcd/{gcs_base}{target}/gather-extra/artifacts/journal_logs/{gcs_base}{target}/gather-extra/artifacts/must-gather/Where {target} is the ci-operator target extracted from prowjob.json (e.g., e2e-azure-ovn-upgrade).
Inline linking style: When discussing evidence, link directly to the artifact.
For example: "Run 1 ([Prow][prow1] | [Intervals][int1]) showed 11 disruptions in
the [timeline data][timeline1]..." — where [timeline1] links to the specific
e2e-timelines_spyglass_*.json file on gcsweb.
Compute {date} as today's date in YYYY-MM-DD format (e.g., 2026-03-23).
For each job run:
mkdir -p .work/disruption-analysis/{date}/{build_id}/logs
mkdir -p .work/disruption-analysis/{date}/{build_id}/tmp
Check for existing artifacts first. If .work/disruption-analysis/{date}/{build_id}/logs/ exists with content,
ask user whether to reuse or re-download.
Use the fetch-prowjob-json skill for each job run URL.
.work/disruption-analysis/{date}/{build_id}/logs/prowjob.jsonJOB_NAME from .spec.job--target= value from ci-operator argsFor each job run:
Important GCS bucket note: Prow URLs may contain origin-ci-test in the path (e.g.,
/view/gs/origin-ci-test/logs/...), but the actual GCS bucket is always test-platform-results.
Always use gs://test-platform-results/... for gcloud storage commands.
Recommended approach — use gcloud storage ls then download individually:
The artifact search script and wildcard gcloud storage cp are unreliable for finding timeline
files, especially in upgrade jobs where files are nested under multiple workflow step directories.
Instead, list files first, then download each one:
# Step 1: List all timeline files in the job's artifact tree
gcloud storage ls "gs://test-platform-results/logs/{job_name}/{build_id}/artifacts/**/e2e-timelines_spyglass_*.json"
This returns the full GCS paths for each timeline file. Then download each one individually:
# Step 2: Download each file
gcloud storage cp "gs://test-platform-results/logs/{job_name}/{build_id}/artifacts/{target}/openshift-e2e-test/artifacts/junit/e2e-timelines_spyglass_{timestamp}.json" \
.work/disruption-analysis/{date}/{build_id}/logs/ --no-user-output-enabled
Timeline file locations vary by job type:
Non-upgrade jobs: Usually one timeline file at
artifacts/{target}/openshift-e2e-test/artifacts/junit/e2e-timelines_spyglass_{timestamp}.json
Upgrade jobs: Usually two timeline files (one per phase — upgrade and conformance), which
may be under different workflow step directories. The gcloud storage ls approach handles this
automatically.
Fallback — artifact search script:
If gcloud storage ls doesn't find files, try the artifact search script:
python3 plugins/ci/skills/prow-job-artifact-search/prow_job_artifact_search.py \
<prow-url> search "**/e2e-timelines_spyglass_*.json"
Note: The GCS URIs returned by this script may not always be directly downloadable with
gcloud storage cp. If downloads fail, extract the path components and construct the
gs://test-platform-results/... URI manually, or use gcloud storage ls to verify the
actual file locations.
Use the included parse_disruption.py script to extract and classify disruption events:
python3 plugins/ci/skills/analyze-disruption/parse_disruption.py \
.work/disruption-analysis/{date}/{build_id}/logs/e2e-timelines_spyglass_*.json \
--backends {backend_filter} \
--window 60 \
--format text
Use --format json when you need structured data for further analysis. Omit --backends to
analyze all disrupted backends.
The script automatically:
phase_breakdown) and on each disruption event.--window seconds)Review the parser output and use it as the foundation for the analysis. The parser handles Steps 4.2 through 4.6 below.
Each item in the timeline JSON has this structure:
{
"level": "Error",
"source": "Disruption",
"locator": {
"type": "Disruption",
"keys": {
"backend-disruption-name": "host-to-host-new-connections",
"connection": "new",
"disruption": "host-to-host-from-node-...-worker-X-to-node-...-master-0-endpoint-10.0.0.5"
}
},
"message": {
"reason": "DisruptionBegan",
"humanMessage": "... stopped responding to GET requests over new connections",
"annotations": { "reason": "DisruptionBegan" }
},
"from": "2026-03-21T21:50:24Z",
"to": "2026-03-21T21:50:26Z"
}
Key fields:
source: Event category. "Disruption" for disruption events. Other useful sources:
OVSVswitchdLog, CPUMonitor, CloudMetrics, EtcdLog, EtcdDiskCommitDuration,
EtcdDiskWalFsyncDuration, AuditLog, Alert, NodeMonitor, MachineMonitor,
ClusterVersion, ClusterOperator, E2ETest, KubeletLoglevel: "Error", "Warning", "Info". Disruption events are Error or Warning.locator.keys.backend-disruption-name: The backend being monitoredlocator.keys.disruption: For host-to-host backends, encodes source node, target node,
and endpoint IP in the format host-to-host-from-node-{src}-to-node-{dst}-endpoint-{ip}locator.keys.connection: "new" or "reused"message.reason: "DisruptionBegan" or "DisruptionEnded"message.humanMessage: Human-readable description with error detailsThe parser classifies backends automatically. For reference:
cache → likely etcd or global networking problemKey diagnostic pattern: When all 4 variants of a backend fail simultaneously (e.g.,
openshift-api-new-connections, openshift-api-reused-connections, cache-openshift-api-new-connections,
cache-openshift-api-reused-connections), the root cause is almost always control plane node
resource exhaustion (disk I/O → etcd stalls → apiserver timeouts), not a networking issue.
Look for etcd slow fdatasync, apply took too long, and ExtremelyHighIndividualControlPlaneCPU
alerts as confirming evidence.
The parser detects source-node patterns automatically. Key patterns:
When a single-source-fan-out pattern is detected, focus the investigation on that specific node: check its CPU, disk I/O, OVS vswitchd logs, and whether it was running heavy workloads.
The parser extracts concurrent events from these sources within the disruption window:
| Source | What it tells you |
|--------|-------------------|
| OVSVswitchdLog | OVS packet processing stalls (poll intervals >500ms = networking frozen) |
| CPUMonitor | Nodes with CPU >95% (starves OVS and other system processes) |
| CloudMetrics | Azure disk IOPS saturation, queue depth, bandwidth (disk I/O pressure) |
| EtcdLog | apply took too long, slow fdatasync, ReadIndex delays |
| EtcdDiskCommitDuration | etcd disk commit above 25ms threshold |
| EtcdDiskWalFsyncDuration | etcd WAL fsync above 10ms threshold |
| AuditLog | API request failures during disruption |
| Alert | Firing Prometheus alerts (ExtremelyHighIndividualControlPlaneCPU, etc.) |
| NodeMonitor / MachineMonitor | Node NotReady, machine phase changes |
| ClusterVersion / ClusterOperator | Upgrade progress, operator status |
| E2ETest | Active test phase (upgrade vs post-upgrade e2e tests) |
If the parser output is insufficient for a particular signal, you can query the timeline JSON directly for deeper investigation.
The parser output from Step 4.2 already includes audit log events, etcd events, CPU warnings, OVS stalls, and cloud metrics extracted from the timeline files. For most analyses, this is sufficient — the timeline files aggregate the same data that would be found in separate artifact downloads.
Review the parser's concurrent_events and key_signals sections and assess:
The timeline files contain AuditLog entries showing request failures during disruption windows
(e.g., "1 requests made during this time failed out of 611 total").
For kube-api, oauth-api, and openshift-api disruption, check whether:
The timeline files contain EtcdLog, EtcdDiskCommitDuration, and EtcdDiskWalFsyncDuration entries.
Key messages to look for:
"apply request took too long" — etcd under write pressure"slow fdatasync" — disk I/O bottleneck"waiting for ReadIndex response took too long" — etcd read latencyThe timeline files contain CPUMonitor (>95% threshold) and CloudMetrics (Azure disk IOPS,
queue depth, bandwidth, latency) entries.
Key patterns:
OVSVswitchdLog entries report "Unreasonably long poll interval" warnings when OVS cannot
process packets. Poll intervals >1000ms mean OVS was essentially frozen — no packets forwarded.
This is the most direct cause of host-to-host and pod-to-host disruption.
Query the timeline files for E2ETest source items that overlap the disruption window. The test
name is in locator.keys.e2e-test. For each test active during disruption, note:
level: "Info") or failed (level: "Error")For multi-run analysis: Cross-reference tests active during disruption across runs. Tests appearing in 3+ runs during the disruption window are especially interesting — they may be triggering the resource pressure that causes disruption (e.g., tests that create many resources, run heavy workloads, or cause pod evictions). Include a table of correlated tests in the report with pass/fail status per run.
Note: Tests that fail during the disruption window are usually victims of the disruption, not causes. Tests that pass but consistently appear during disruption across runs are more likely to be contributing to the resource pressure that triggers it.
Only perform this step if the parser output from Step 4.2 is insufficient for root cause determination — for example, when you need to see the full audit log request details or etcd log context beyond what the timeline summaries provide.
gcloud storage cp -r "gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-extra/artifacts/audit_logs/" \
.work/disruption-analysis/{date}/{build_id}/logs/audit_logs/ --no-user-output-enabled 2>/dev/null || true
Query for sampler requests during disruption windows to identify request gaps.
gcloud storage cp -r "gs://test-platform-results/{bucket-path}/artifacts/{target}/gather-extra/artifacts/pods/openshift-etcd/" \
.work/disruption-analysis/{date}/{build_id}/logs/etcd-pods/ --no-user-output-enabled 2>/dev/null || true
Search for leader changes, write delays, member issues, and disk problems.
If the analysis needs live cluster metrics (not available in artifacts), provide these queries:
-- Top CPU consumers across all nodes
topk(25, sum by (namespace) (rate(container_cpu_usage_seconds_total{container!="",pod!=""}[5m])))
-- CPU on a specific node
topk(25, sum by (namespace) (rate(container_cpu_usage_seconds_total{container!="",pod!="",node="<node-name>"}[5m])))
-- E2E test CPU on a specific node
topk(10, sum by (namespace) (rate(container_cpu_usage_seconds_total{container!="",pod!="",node="<node-name>",namespace=~"^e2e-.*"}[5m])))
If disruption coincides with node events, check:
readyz=false as expected when the node was shutting down?Look for these signals in interval files and node-related logs.
Check audit logs for endpoint slice modification events during disruption windows:
endpointslices resourcesWhen multiple job run URLs are provided:
For each backend that shows disruption across multiple runs:
Look for common patterns:
Produce a structured Markdown report with inline deep links throughout. Links go where the evidence is discussed, not in a separate section at the end. Use Markdown reference-style links to keep the text readable.
([Prow]({prow_url}) | [Intervals]({sippy_url})) after
the build ID or run number[timeline data]({gcsweb_timeline_url}) when discussing disruption events[audit logs]({gcsweb_audit_url}) when discussing request gaps[etcd pod logs]({gcsweb_etcd_url}) when discussing etcd pressure[OVS vswitchd logs]({gcsweb_journal_url}) when discussing OVS stallsperiodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade, not periodic-ci-...-e2e-azure-ovn-upgrade)For single run:
# Disruption Analysis
## Job Information
- **Prow Job**: [{job-name}]({prow_url})
- **Build ID**: {build_id}
- **Target**: {target}
- **Sippy Intervals**: [View intervals]({sippy_intervals_url})
## Disruption Summary
{Disruption count, backend classification, network-liveness assessment}
## Disruption Timeline
- **{from} — {to}** ({duration}s): {message}
- Concurrent activity from [timeline]({gcsweb_timeline_url}): {events}
- [Audit logs]({gcsweb_audit_url}): {gap analysis}
## Cluster Activity Correlation
{Reference specific artifacts inline, e.g.:}
The [timeline data]({gcsweb_timeline_url}) shows OVS vswitchd poll intervals up to 9s...
[etcd pod logs]({gcsweb_etcd_url}) confirm apply-too-long warnings at 03:56:00Z...
## Root Cause Hypothesis
{Analysis with inline links to supporting evidence}
## Other Disrupted Backends
{When --backends filter was used, list other backends that were disrupted during the
same time window and due to the same root cause. This helps readers understand the full
blast radius — e.g., if openshift-api was requested but kube-api, oauth-api, and
metrics-api were also disrupted simultaneously, that confirms a control plane problem
rather than an openshift-api-specific issue. Only include backends whose disruption
overlaps the same window; exclude unrelated disruption at other times.}
For multiple runs — use the same inline linking pattern:
# Disruption Analysis: {backend_names}
## Runs Analyzed
| # | Build ID | Job | Disrupted Backends | Network Liveness |
|---|----------|-----|-------------------|------------------|
| 1 | {build_id_1} ([Prow]({prow_url}) \| [Intervals]({sippy_url})) | {job} | {backends} | {status} |
| 2 | {build_id_2} ([Prow]({prow_url}) \| [Intervals]({sippy_url})) | {job} | {backends} | {status} |
## Disruption Events
### Run 1 ({build_id_1})
Phase: {upgrade|conformance} — Disruption details with [timeline]({gcsweb_timeline_url}) links
## Cluster Activity Correlation
Run 1 [timeline]({gcsweb_timeline_url_1}) shows OVS stalls at 21:50:24Z...
Run 2 [timeline]({gcsweb_timeline_url_2}) shows disk IOPS at 100% ([cloud metrics]({gcsweb_timeline_url_2}))...
## Cross-Run Comparison
{Pattern analysis referencing specific runs with inline links}
## Root Cause Hypothesis
{Synthesis with links to key evidence}
## Other Disrupted Backends
{When --backends filter was used, list other backends that were disrupted during the
same time window and due to the same root cause — not all disruption in the run, just
what overlaps the identified disruption event. Show a consolidated table with backend
name, type (cache/non-cache/cloud/canary), and how many runs (out of N) showed that
backend disrupted in the same window. Sort by runs-affected descending, then by count.
This reveals the full blast radius and helps confirm root cause — e.g., if every API
backend fails together, the problem is control-plane-wide, not backend-specific.}
Save the report using a filename that references the backends being analyzed:
.work/disruption-analysis/{date}/{backend_names}-analysis.md.work/disruption-analysis/{date}/{backend_names}-analysis.mdWhere {backend_names} is a kebab-case join of the disrupted backend base names (e.g.,
image-registry-new-connections-analysis.md or kube-api-oauth-api-analysis.md).
If all backends are analyzed (no --backends filter), use the backends that actually showed
disruption. If the resulting filename would be excessively long (more than 5 backends),
truncate to the first 5 and append -and-more (e.g., kube-api-oauth-api-openshift-api-cache-oauth-api-cache-openshift-api-and-more-analysis.md).
No disruption found — If interval files show no disruption events, report that the run is clean and no disruption was detected. This is a valid result, not an error.
Audit logs not available — Some jobs may not have audit logs. Note this in the report and continue analysis with available data.
etcd logs not available — If etcd pod logs are not present in gather-extra, note this and skip etcd analysis.
Interval files not found — If no interval/timeline files are found for a job run, this is a critical error for that run. Report it and skip that run if analyzing multiple runs.
gcloud errors — When gcloud storage commands fail, log the error, report which artifacts could not be downloaded, and continue analysis with the remaining available data.
--max-bytes limits when fetching large log files to avoid excessive downloadsresearch
Shared engine for analyzing Jira issue activity and generating status summaries
testing
Snapshot OpenShift payload data (release controller, PR diffs, comments, CI jobs, JUnit results, regression tracking) to a local directory for offline analysis
development
Analyze a payload snapshot to identify root causes of blocking job failures, score candidate PRs, and produce an HTML report with revert recommendations
tools
Create TRT JIRA bugs, open revert PRs, and trigger payload jobs for high-confidence revert candidates