plugins/ci/skills/prow-job-analyze-test-failure/SKILL.md
Analyze failed Prow CI tests by inspecting test code, downloading artifacts, and optionally integrating must-gather cluster diagnostics for root cause analysis
npx skillsauth add openshift-eng/ai-helpers prow-job-analyze-test-failureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill analyzes test failures by downloading Prow CI artifacts, checking test logs, inspecting resources and events, analyzing test source code, and optionally integrating cluster diagnostics from must-gather data.
Identical with "prow-job-analyze-resource" skill.
The user will provide:
Prow job URL - gcsweb URL containing test-platform-results/
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_hypershift/6731/pull-ci-openshift-hypershift-main-e2e-aws/1962527613477982208Test name (optional) - specific test name that failed
TestKarpenter/EnsureHostedCluster/ValidateMetricsAreExposedTestCreateClusterCustomConfigThe openshift-console downloads pods [apigroup:console.openshift.io] should be scheduled on different nodesOptional flags (optional):
--fast - Skip must-gather extraction and analysis; continue with log/JUnit/step-artifact analysis only (scope is preserved: test-level when test-name is provided, step-level across all failed CI steps when omitted)Use the "Parse and Validate URL" steps from "prow-job-analyze-resource" skill
Check for existing artifacts first
.work/prow-job-analyze-test-failure/{build_id}/logs/ directory exists and has contentrm -rf .work/prow-job-analyze-test-failure/{build_id}/logs/rm -rf .work/prow-job-analyze-test-failure/{build_id}/tmp/Create directory structure
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/logs
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/tmp
.work/prow-job-analyze-test-failure/ as the base directory (already in .gitignore)logs/ subdirectory for all downloadstmp/ subdirectory for temporary files (intermediate JSON, etc.).work/prow-job-analyze-test-failure/{build_id}/Use the fetch-prowjob-json skill to fetch the prowjob.json for this job. See plugins/ci/skills/fetch-prowjob-json/SKILL.md for complete implementation details.
fetch-prowjob-json skill).work/prow-job-analyze-test-failure/{build_id}/logs/prowjob.json--target=([a-zA-Z0-9-]+) in the ci-operator argse2e-aws-ovn) from the --target= argJOB_NAME from .spec.job in prowjob.json (the artifact directory key):
JOB_NAME=$(jq -r '.spec.job' .work/prow-job-analyze-test-failure/{build_id}/logs/prowjob.json)
{target} and {JOB_NAME} often differ. All artifact-path lookups
(JUnit XML, step logs, step artifacts) must use {JOB_NAME}, not {target}Aggregated jobs run the same job in parallel (typically 10 times) and perform statistical
analysis of test results. Detect aggregation by checking for aggregated- prefix in the job
name or an aggregator container/step in prowjob.json.
If the job is aggregated, the failure modes are different from normal jobs:
Statistically significant test failure — The test itself fails frequently enough across runs to be flagged as a regression. This is a real regression that needs investigation.
Insufficient completed runs — Not enough runs completed successfully to perform the statistical test (e.g., only 5 of 10 jobs produced results). This manifests as mass test failures across many unrelated tests. The root cause is whatever prevented the other runs from completing — this could be infrastructure issues, install failures, or an actual product bug causing crashes. Investigate the underlying job runs that did not complete to determine why.
Non-deterministic test presence — A test only ran in a small subset of completed jobs, even though the other jobs completed successfully. The failure message will say something like "Passed 1 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success". Every test must produce results in every job run; it is a bug if it does not. This is a regression — someone introduced a test that doesn't produce results deterministically. Investigate which test is non-deterministic and why it only runs in some jobs.
When analyzing an aggregated job failure, first determine which failure mode applies before diving into individual test analysis. For mode 2, focus on why runs failed rather than individual test results. For mode 3, investigate why the test only ran in some jobs.
Finding underlying job run URLs:
The aggregated junit XML contains links to every underlying job run. Download it from:
gs://test-platform-results/{bucket-path}/artifacts/release-analysis-aggregator/openshift-release-analysis-aggregator/artifacts/release-analysis-aggregator/{job-name}/{payload-tag}/junit-aggregated.xml
Each <testcase> element has a <system-out> with YAML-formatted data including passes:,
failures:, and skips: lists. Each entry has:
jobrunid: The build ID of the underlying job runhumanurl: Prow URL for the job run (e.g., https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job-name}/{jobrunid})gcsartifacturl: Direct link to GCS artifactsUse the humanurl links to investigate individual job run failures with the normal
(non-aggregated) analysis steps below.
gcloud storage cp gs://test-platform-results/{bucket-path}/build-log.txt .work/prow-job-analyze-test-failure/{build_id}/logs/build-log.txt --no-user-output-enabled
.work/prow-job-analyze-test-failure/{build_id}/logs/build-log.txtBranch on whether {test_name} was provided:
This is the common case for multi-step CI workflows (e.g., HyperShift, bare-metal, OADP jobs) where failures occur in CI steps rather than named unit tests.
Discover failed CI steps from JUnit XML
Use {JOB_NAME} (from Step 3, extracted from .spec.job) for all GCS artifact paths —
not {target}. On PR jobs these values differ and {target} will miss the artifacts.
Search for JUnit XML files under the artifacts directory:
gcloud storage ls "gs://test-platform-results/{bucket-path}/artifacts/{JOB_NAME}/**/junit*.xml" 2>/dev/null
Download all found JUnit XML files:
gcloud storage cp "gs://test-platform-results/{bucket-path}/artifacts/{JOB_NAME}/**/junit*.xml" \
.work/prow-job-analyze-test-failure/{build_id}/logs/ --no-user-output-enabled --recursive 2>/dev/null || true
Parse each downloaded XML file and collect every <testcase> element where:
<failure> or <error> child element is present, OR<testcase> has attribute status="failed"For each failed testcase record:
step_name: value of classname or name attribute (whichever identifies the CI step)failure_message: text content of the <failure> or <error> elementjunit_file: path of the XML file it came fromClassify each failed step by phase
The ci-operator JUnit XML (junit_operator.xml) includes phase-level testcases:
"Run multi-stage test pre phase" — setup/installation steps"Run multi-stage test test phase" — functional test steps"Run multi-stage test post phase" — gather/cleanup stepsUse the phase entries to classify each failed step_name:
"pre phase", "test phase",
or "post phase" in their name) — these tell you which phase failedpre phase → installation/setup stepstest phase → functional test stepspost phase → gather/cleanup steps (rarely the root cause)If the phase-level testcases are absent from the JUnit XML (some jobs omit them),
fall back to the ci-operator-step-graph.json artifact, which lists all steps with
their dependencies and timing — use execution order and naming conventions as a
secondary signal. When still ambiguous, prefer classifying as a test step so
must-gather analysis is not skipped.
Route each failed step to the appropriate analysis path
pre phase steps (installation) → invoke the ci:analyze-prow-job-install-failure
skill for each such step, passing the same Prow job URL. Must-gather is not
attempted for pre phase failures: must-gather requires a live apiserver, and when
the installation phase fails the cluster is not fully up, so collection will fail.
Capture the delegated output and store it
as a per-step entry in the shared results collection used by Step 5 and Step 5.5:
{
step_name: {step_name},
type: "installation",
summary: {one-line summary from ci:analyze-prow-job-install-failure},
evidence: {key error / stack trace from delegated analysis},
recommendation: {recommended action from delegated analysis}
}
test phase steps (functional tests) → proceed with the iteration in item 4
below (download step log, artifacts, extract stack trace) and include must-gather
analysis (Steps 4.4–4.9). Store findings in the same per-step results collection:
{
step_name: {step_name},
type: "test",
summary: {one-line failure summary},
evidence: {stack trace / error output},
recommendation: {recommended action}
}
post phase steps (gather/cleanup) → treat as informational. Download the step
log and note the failure in the report, but do not perform must-gather analysis for
post-phase failures alone (they are usually a consequence of earlier failures).Iterate over each test phase step (pre and post phase steps already routed above)
For each step_name in the test phase, perform the following sub-steps:
a. Download step-specific log
gcloud storage cp \
"gs://test-platform-results/{bucket-path}/artifacts/{JOB_NAME}/{step_name}/build-log.txt" \
.work/prow-job-analyze-test-failure/{build_id}/logs/{step_name}-build-log.txt \
--no-user-output-enabled 2>/dev/null || true
If the step log is not found at that path, fall back to scanning build-log.txt for
lines mentioning {step_name} and collect surrounding context (±50 lines).
b. Download step-specific artifacts
gcloud storage ls "gs://test-platform-results/{bucket-path}/artifacts/{JOB_NAME}/{step_name}/" \
2>/dev/null
Download any relevant artifacts (e.g., *.json, *.yaml, events*.txt) to
.work/prow-job-analyze-test-failure/{build_id}/logs/{step_name}/.
c. Extract stack trace and context
FAIL, Error:, fatal, or assertion failuresd. Store per-step findings Record for each step:
step_namefailure_message (from JUnit XML)stack_trace (from step log or build-log.txt context)artifacts (list of downloaded artifact paths)Produce a multi-step failure summary
After iterating all failed steps, produce an ordered list of findings:
Failed Steps Discovered:
1. {step_name_1}: {one-line failure summary}
2. {step_name_2}: {one-line failure summary}
...
This list is used in Step 4.4 (evidence gathering) and Step 5 (report) to drive per-step analysis rather than singular test analysis.
gcloud storage ls 'gs://test-platform-results/{bucket-path}/**/e2e-timelines_spyglass_*json'
gcloud storage cp gs://test-platform-results/{bucket-path}/**/e2e-timelines_spyglass_*.json .work/prow-job-analyze-test-failure/{build_id}/logs/ --no-user-output-enabled
source = "E2ETest" and message.annotations.status = "Failed"from and to timestamps on this interval - this indicates when the test was runninglevel = "Error" or level = "Warning"source = "OperatorState"The CI system may attach symptom labels to job runs — machine-detected patterns (e.g., "test failures during high CPU events") stored as JSON artifacts. These are not root causes but provide useful environmental context that may help explain failures when no other cause is found.
List the job_labels directory
gcloud storage ls "gs://test-platform-results/{bucket-path}/artifacts/job_labels/" 2>/dev/null
Download any JSON symptom files (exclude label-summary.html)
gcloud storage cp "gs://test-platform-results/{bucket-path}/artifacts/job_labels/*.json" \
.work/prow-job-analyze-test-failure/{build_id}/logs/job_labels/ --no-user-output-enabled 2>/dev/null || true
Parse symptom labels — each JSON file describes a detected symptom with a summary and explanation. Collect all symptom summaries for inclusion in the report.
Use symptoms as investigative context — symptoms are environmental observations, NOT definitive causes. They should inform your investigation (e.g., if 2 tests failed while high CPU was measured, CPU pressure could explain the failures if no other cause is found) but you must still perform thorough root cause analysis. Include them in the "Known Symptoms Seen" section of the report.
.work/prow-job-analyze-test-failure/{build_id}/tmp.work/prow-job-analyze-test-failure/{build_id}/logs/openshift/release repository. This is part of evidence gathering — identifying the responsible PR early informs the rest of the analysis. Include the responsible PR in your evidence.Be tenacious. When the build log or test output mentions crash-looping pods, container restarts, or deployments not becoming ready, you MUST trace the failure to its root cause. Never stop at "containers are crash-looping" — find out why.
Pursue all available log sources:
containerStatuses (exitCode, lastState.terminated.reason, restartCount), container logs (current and previous), events, and operator conditions. This is often the richest source for diagnosing crash-looping pods.gather-must-gather (e.g., gather-extra, gather-audit-logs) and download relevant logs.Follow the dependency chain. Always trace upstream to the originating error rather than stopping at the first symptom.
Parse optional flags
--fast flag--fast flag present:
test_name was provided → produce singular test report (Branch A)test_name was omitted → produce multi-step report (Branch B)Extract actual test name from prowjob.json
The artifacts directory uses the test name from prowjob.json, NOT the full URL path.
# Extract test name from prowjob.json (e.g., "e2e-aws-operator-serial-ote")
JOB_NAME=$(jq -r '.spec.job' .work/prow-job-analyze-test-failure/{build_id}/logs/prowjob.json)
# Note: For PR jobs, TARGET contains the full PR path like:
# pr-logs/pull/openshift_service-ca-operator/306/pull-ci-openshift-service-ca-operator-main-e2e-aws-operator-serial-ote
# But artifacts are stored under just the test name:
# e2e-aws-operator-serial-ote
Detect must-gather archive (only if --fast not present)
Use JOB_NAME (not TARGET) for artifact paths.
HyperShift jobs may have different must-gather patterns:
Pattern 1: Unified Archive (dump-management-cluster)
Pattern 2: Dual Archives (gather-must-gather + dump)
Pattern 3: Standard Only (gather-must-gather)
Detection logic:
# Check for Pattern 1: Unified archive (dump-management-cluster)
UNIFIED_DUMP=$(gcloud storage ls "gs://test-platform-results/{bucket-path}/artifacts/$JOB_NAME/dump-management-cluster/artifacts/artifacts.tar*" 2>/dev/null | head -1 || true)
# Check for Pattern 2/3: Standard must-gather
STANDARD_MG=$(gcloud storage ls "gs://test-platform-results/{bucket-path}/artifacts/$JOB_NAME/gather-must-gather/artifacts/must-gather.tar" 2>/dev/null || true)
# Check for Pattern 2: Additional hypershift-dump (multiple possible locations)
# Use wildcards to match all current and future HyperShift dump patterns:
# 1. **/artifacts/hypershift-dump.tar (covers dump/, hypershift-mce-dump/, etc.)
# 2. **/artifacts/**/hostedcluster.tar (covers all E2E test patterns)
HYPERSHIFT_DUMP=""
for pattern in \
"**/artifacts/hypershift-dump.tar" \
"**/artifacts/**/hostedcluster.tar"; do
FOUND=$(gcloud storage ls "gs://test-platform-results/{bucket-path}/artifacts/$JOB_NAME/$pattern" 2>/dev/null | head -1 || true)
if [ -n "$FOUND" ]; then
HYPERSHIFT_DUMP="$FOUND"
break
fi
done
# Determine pattern and check for hosted cluster data
if [ -n "$UNIFIED_DUMP" ]; then
# Pattern 1: Unified archive
PATTERN="unified"
# Download temporarily to check for hosted cluster data
TMP_CHECK="/tmp/check-unified-$$.tar"
gcloud storage cp "$UNIFIED_DUMP" "$TMP_CHECK" --no-user-output-enabled
# Check if archive contains hostedcluster-* directory
HAS_HOSTED_CLUSTER=$(tar -tf "$TMP_CHECK" 2>/dev/null | grep -q "hostedcluster-" && echo "true" || echo "false")
rm -f "$TMP_CHECK"
elif [ -n "$STANDARD_MG" ] && [ -n "$HYPERSHIFT_DUMP" ]; then
# Pattern 2: Dual archives
PATTERN="dual"
# Download hypershift-dump temporarily to check for hosted cluster
TMP_CHECK="/tmp/check-dump-$$.tar"
gcloud storage cp "$HYPERSHIFT_DUMP" "$TMP_CHECK" --no-user-output-enabled
# Check if hypershift-dump contains hostedcluster-* directory
HAS_HOSTED_CLUSTER=$(tar -tf "$TMP_CHECK" 2>/dev/null | grep -q "hostedcluster-" && echo "true" || echo "false")
rm -f "$TMP_CHECK"
elif [ -n "$STANDARD_MG" ]; then
# Pattern 3: Standard must-gather only
PATTERN="standard"
HAS_HOSTED_CLUSTER=false
else
# No must-gather found
PATTERN="none"
HAS_HOSTED_CLUSTER=false
fi
Possible outcomes:
Important Pattern Notes:
logs/artifacts/output/, hosted at logs/artifacts/output/hostedcluster-{name}/Ask user if they want must-gather analysis
Only if user chose "Yes" in Step 4.5:
Determine extraction strategy
must-gather/logs/must-gather-mgmt/logs/must-gather-hosted/logs/Check for existing extraction
For single must-gather:
.work/prow-job-analyze-test-failure/{build_id}/must-gather/logs/ exists with contentFor dual must-gather (HyperShift):
.work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs/ exists with content.work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/logs/ exists with contentIf either exists:
rm -rf .work/prow-job-analyze-test-failure/{build_id}/must-gather*/Create must-gather directories
Based on PATTERN from Step 4.5.3:
For Pattern 3 (standard only):
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/must-gather/logs
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/must-gather/tmp
For Pattern 1 (unified) or Pattern 2 (dual):
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/tmp
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/logs
mkdir -p .work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/tmp
Download must-gather archives
Use JOB_NAME (from Step 4.5.2) for artifact paths, not {target}:
For Pattern 3 (standard only):
gcloud storage cp "$STANDARD_MG" \
.work/prow-job-analyze-test-failure/{build_id}/must-gather/tmp/must-gather.tar \
--no-user-output-enabled
For Pattern 1 (unified):
# Download unified archive (contains both management and hosted cluster data)
gcloud storage cp "$UNIFIED_DUMP" \
.work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/tmp/unified-dump.tar \
--no-user-output-enabled
For Pattern 2 (dual):
# Download management cluster must-gather
gcloud storage cp "$STANDARD_MG" \
.work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/tmp/must-gather.tar \
--no-user-output-enabled
# Download hypershift-dump (may or may not contain hosted cluster)
gcloud storage cp "$HYPERSHIFT_DUMP" \
.work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/tmp/hypershift-dump.tar \
--no-user-output-enabled
Extract archives
For Pattern 3 (standard only):
# Use existing extract_archives.py script for standard must-gather
python3 plugins/ci/skills/prow-job-extract-must-gather/extract_archives.py \
.work/prow-job-analyze-test-failure/{build_id}/must-gather/tmp/must-gather.tar \
.work/prow-job-analyze-test-failure/{build_id}/must-gather/logs
For Pattern 1 (unified):
# Extract unified archive to temporary location
TMP_EXTRACT=".work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/tmp/extracted"
mkdir -p "$TMP_EXTRACT"
# Handle both .tar and .tar.gz
if [[ "$UNIFIED_DUMP" == *.tar.gz ]]; then
tar -xzf .work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/tmp/unified-dump.tar -C "$TMP_EXTRACT"
else
tar -xf .work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/tmp/unified-dump.tar -C "$TMP_EXTRACT"
fi
# Find the output directory (may be at logs/artifacts/output or just output)
OUTPUT_DIR=$(find "$TMP_EXTRACT" -type d -name "output" | head -1)
if [ -z "$OUTPUT_DIR" ]; then
echo "ERROR: Could not find output directory in unified dump"
rm -rf "$TMP_EXTRACT"
# Clear variables to prevent subsequent usage
HAS_HOSTED_CLUSTER="false"
unset HOSTED_DIR
unset OUTPUT_DIR
# Skip to Step 5 - no must-gather analysis possible
else
# Move management cluster data (root level in output/)
# Exclude hostedcluster-* directories
for item in "$OUTPUT_DIR"/*; do
if [ -e "$item" ] && [[ ! "$(basename "$item")" =~ ^hostedcluster- ]]; then
mv "$item" .work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs/
fi
done
# Move hosted cluster data (hostedcluster-* subdirectory)
if [ "$HAS_HOSTED_CLUSTER" = "true" ]; then
HOSTED_DIR=$(find "$OUTPUT_DIR" -maxdepth 1 -type d -name "hostedcluster-*" | head -1)
if [ -n "$HOSTED_DIR" ]; then
mv "$HOSTED_DIR"/* .work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/logs/
echo "✓ Hosted cluster data extracted from unified archive"
else
echo "WARNING: Expected hosted cluster data but hostedcluster-* directory not found"
fi
fi
# Cleanup temporary extraction directory
rm -rf "$TMP_EXTRACT"
fi
For Pattern 2 (dual):
# Extract management cluster must-gather (standard format)
python3 plugins/prow-job/skills/prow-job-extract-must-gather/extract_archives.py \
.work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/tmp/must-gather.tar \
.work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs
# Extract hypershift-dump (may contain hosted cluster)
TMP_EXTRACT=".work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/tmp/extracted"
mkdir -p "$TMP_EXTRACT"
tar -xf .work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/tmp/hypershift-dump.tar -C "$TMP_EXTRACT"
if [ "$HAS_HOSTED_CLUSTER" = "true" ]; then
# Look for hostedcluster-* directory in dump
HOSTED_DIR=$(find "$TMP_EXTRACT" -maxdepth 2 -type d -name "hostedcluster-*" | head -1)
if [ -n "$HOSTED_DIR" ]; then
mv "$HOSTED_DIR"/* .work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/logs/
echo "✓ Hosted cluster data extracted from hypershift-dump"
else
echo "WARNING: HAS_HOSTED_CLUSTER=true but no hostedcluster-* directory found in dump"
echo "Hypershift-dump likely contains only management cluster data"
HAS_HOSTED_CLUSTER=false # Update flag since hosted cluster not actually present
fi
else
echo "INFO: Hypershift-dump does not contain hosted cluster data (management cluster only)"
fi
# Cleanup temporary extraction directory
rm -rf "$TMP_EXTRACT"
Locate and validate content directories
For Pattern 3 (standard only):
# Check for content/ directory first (renamed by extraction script)
if [ -d ".work/prow-job-analyze-test-failure/{build_id}/must-gather/logs/content" ]; then
MUST_GATHER_PATH=".work/prow-job-analyze-test-failure/{build_id}/must-gather/logs/content"
else
# Fall back to finding the directory containing -ci- (e.g., registry-build09-ci-...)
MUST_GATHER_PATH=$(find .work/prow-job-analyze-test-failure/{build_id}/must-gather/logs -maxdepth 1 -type d -name "*-ci-*" | head -1)
fi
# Validate MUST_GATHER_PATH is set and directory exists
if [ -z "$MUST_GATHER_PATH" ] || [ ! -d "$MUST_GATHER_PATH" ]; then
echo "ERROR: Must-gather content directory not found after extraction"
# Skip to Step 5 (scope preserved: Branch A or Branch B per test_name presence)
elif [ -z "$(ls -A "$MUST_GATHER_PATH" 2>/dev/null)" ]; then
echo "ERROR: Must-gather content directory is empty"
# Skip to Step 5 (scope preserved: Branch A or Branch B per test_name presence)
else
echo "✓ Must-gather content located at: $MUST_GATHER_PATH"
# Continue to Step 4.7 with MUST_GATHER_PATH set
fi
For Pattern 1 (unified) or Pattern 2 (dual):
# Management cluster validation
if [ "$PATTERN" = "unified" ]; then
# Pattern 1: Data extracted directly to logs/ directory
MUST_GATHER_MGMT_PATH=".work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs"
else
# Pattern 2: Standard must-gather extraction (look for content/ or hash directory)
if [ -d ".work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs/content" ]; then
MUST_GATHER_MGMT_PATH=".work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs/content"
else
MUST_GATHER_MGMT_PATH=$(find .work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs -maxdepth 1 -type d -name "*-ci-*" | head -1)
fi
fi
# Validate management cluster path
if [ -z "$MUST_GATHER_MGMT_PATH" ] || [ ! -d "$MUST_GATHER_MGMT_PATH" ]; then
echo "ERROR: Management cluster directory not found"
# Skip to Step 5 (scope preserved: Branch A or Branch B per test_name presence)
elif [ -z "$(ls -A "$MUST_GATHER_MGMT_PATH" 2>/dev/null)" ]; then
echo "ERROR: Management cluster directory is empty"
# Skip to Step 5 (scope preserved: Branch A or Branch B per test_name presence)
else
echo "✓ Management cluster data located at: $MUST_GATHER_MGMT_PATH"
fi
# Hosted cluster validation - only if HAS_HOSTED_CLUSTER is true
if [ "$HAS_HOSTED_CLUSTER" = "true" ]; then
MUST_GATHER_HOSTED_PATH=".work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/logs"
# Validate hosted cluster path
if [ ! -d "$MUST_GATHER_HOSTED_PATH" ]; then
echo "WARNING: Hosted cluster directory not found (expected based on archive detection)"
MUST_GATHER_HOSTED_PATH="" # Clear the path
elif [ -z "$(ls -A "$MUST_GATHER_HOSTED_PATH" 2>/dev/null)" ]; then
echo "WARNING: Hosted cluster directory is empty"
MUST_GATHER_HOSTED_PATH="" # Clear the path
else
echo "✓ Hosted cluster data located at: $MUST_GATHER_HOSTED_PATH"
fi
else
echo "✓ Management cluster must-gather located at: $MUST_GATHER_MGMT_PATH"
fi
# Only validate hosted cluster if HAS_HOSTED_CLUSTER is true
if [ "$HAS_HOSTED_CLUSTER" = "true" ]; then
if [ -z "$MUST_GATHER_HOSTED_PATH" ] || [ ! -d "$MUST_GATHER_HOSTED_PATH" ]; then
echo "ERROR: Hosted cluster must-gather content directory not found"
elif [ -z "$(ls -A "$MUST_GATHER_HOSTED_PATH" 2>/dev/null)" ]; then
echo "ERROR: Hosted cluster must-gather content directory is empty"
else
echo "✓ Hosted cluster must-gather located at: $MUST_GATHER_HOSTED_PATH"
echo "✓ Hosted cluster namespace: $HOSTED_NAMESPACE"
fi
fi
Only if Step 4.6 completed successfully:
Locate must-gather-analyzer scripts
The must-gather plugin provides analysis scripts. Locate the scripts directory:
# Try to find the must-gather-analyzer scripts in common locations
for SEARCH_PATH in \
"plugins/must-gather/skills/must-gather-analyzer/scripts" \
"~/.claude/plugins/cache/*/plugins/must-gather/skills/must-gather-analyzer/scripts" \
"$(find ~ -type d -path "*/must-gather/skills/must-gather-analyzer/scripts" 2>/dev/null | head -1)"; do
SCRIPTS_DIR=$(eval echo "$SEARCH_PATH")
if [ -d "$SCRIPTS_DIR" ] && [ -f "$SCRIPTS_DIR/analyze_clusteroperators.py" ]; then
break
fi
SCRIPTS_DIR=""
done
if [ -z "$SCRIPTS_DIR" ]; then
echo "WARNING: Must-gather analysis scripts not found."
echo "Install the must-gather plugin: /plugin install must-gather@ai-helpers"
# Continue to Step 5 without cluster analysis
fi
Run targeted cluster diagnostics
Focus on issues relevant to test failures (not full cluster analysis).
For single must-gather (standard OpenShift):
# Core diagnostics (only if scripts and path are available)
if [ -n "$SCRIPTS_DIR" ] && [ -n "$MUST_GATHER_PATH" ]; then
python3 "$SCRIPTS_DIR/analyze_clusteroperators.py" "$MUST_GATHER_PATH"
python3 "$SCRIPTS_DIR/analyze_pods.py" "$MUST_GATHER_PATH" --problems-only
python3 "$SCRIPTS_DIR/analyze_nodes.py" "$MUST_GATHER_PATH" --problems-only
python3 "$SCRIPTS_DIR/analyze_events.py" "$MUST_GATHER_PATH" --type Warning --count 50
else
echo "WARNING: Skipping must-gather analysis (scripts or path not available)"
fi
For dual must-gather (HyperShift):
# Management cluster diagnostics (only if scripts and path are available)
if [ -n "$SCRIPTS_DIR" ] && [ -n "$MUST_GATHER_MGMT_PATH" ]; then
echo "=== Analyzing Management Cluster ==="
python3 "$SCRIPTS_DIR/analyze_clusteroperators.py" "$MUST_GATHER_MGMT_PATH"
python3 "$SCRIPTS_DIR/analyze_pods.py" "$MUST_GATHER_MGMT_PATH" --problems-only
python3 "$SCRIPTS_DIR/analyze_nodes.py" "$MUST_GATHER_MGMT_PATH" --problems-only
python3 "$SCRIPTS_DIR/analyze_events.py" "$MUST_GATHER_MGMT_PATH" --type Warning --count 50
else
echo "WARNING: Skipping management cluster analysis (scripts or path not available)"
fi
# Hosted cluster diagnostics (only if scripts and path are available)
if [ -n "$SCRIPTS_DIR" ] && [ -n "$MUST_GATHER_HOSTED_PATH" ]; then
echo "=== Analyzing Hosted Cluster (Namespace: $HOSTED_NAMESPACE) ==="
python3 "$SCRIPTS_DIR/analyze_clusteroperators.py" "$MUST_GATHER_HOSTED_PATH"
python3 "$SCRIPTS_DIR/analyze_pods.py" "$MUST_GATHER_HOSTED_PATH" --problems-only
python3 "$SCRIPTS_DIR/analyze_nodes.py" "$MUST_GATHER_HOSTED_PATH" --problems-only
python3 "$SCRIPTS_DIR/analyze_events.py" "$MUST_GATHER_HOSTED_PATH" --type Warning --count 50
else
echo "INFO: Skipping hosted cluster analysis (scripts or path not available)"
fi
Run conditional diagnostics based on test context
# Network diagnostics (if test name suggests network issues)
if [[ "$JOB_NAME" =~ network|ovn|sdn|connectivity|route|ingress|egress ]]; then
if [ -n "$MUST_GATHER_PATH" ]; then
python3 "$SCRIPTS_DIR/analyze_network.py" "$MUST_GATHER_PATH"
fi
if [ -n "$MUST_GATHER_MGMT_PATH" ]; then
echo "=== Management Cluster Network ==="
python3 "$SCRIPTS_DIR/analyze_network.py" "$MUST_GATHER_MGMT_PATH"
fi
if [ -n "$MUST_GATHER_HOSTED_PATH" ]; then
echo "=== Hosted Cluster Network ==="
python3 "$SCRIPTS_DIR/analyze_network.py" "$MUST_GATHER_HOSTED_PATH"
fi
fi
# etcd diagnostics (if test name suggests control-plane issues)
if [[ "$JOB_NAME" =~ etcd|apiserver|control-plane|kube-apiserver ]]; then
if [ -n "$MUST_GATHER_PATH" ]; then
python3 "$SCRIPTS_DIR/analyze_etcd.py" "$MUST_GATHER_PATH"
fi
if [ -n "$MUST_GATHER_MGMT_PATH" ]; then
echo "=== Management Cluster etcd ==="
python3 "$SCRIPTS_DIR/analyze_etcd.py" "$MUST_GATHER_MGMT_PATH"
fi
if [ -n "$MUST_GATHER_HOSTED_PATH" ]; then
echo "=== Hosted Cluster etcd ==="
python3 "$SCRIPTS_DIR/analyze_etcd.py" "$MUST_GATHER_HOSTED_PATH"
fi
fi
See plugins/must-gather/skills/must-gather-analyzer/SKILL.md for all available analysis scripts.
Capture analysis output
Only if Step 4.7 completed:
Temporal correlation
For HyperShift (dual must-gather):
Component correlation
clusters-{namespace})For HyperShift (dual must-gather):
clusters-{namespace} namespace)Generate correlated insights
For HyperShift (dual must-gather):
Prefix correlations with cluster type:
Cross-cluster correlations:
Store these insights for inclusion in Step 4.9 root cause determination
Synthesize all gathered evidence to determine the most likely root cause for the test failure.
CRITICAL: Be tenacious — trace symptoms back to root cause. Never stop at high-level symptoms like "nodes didn't join", "operator unavailable", or "containers are crash-looping". Trace backwards through the dependency chain (pod statuses → container logs → originating error) until you find the specific, actionable root cause.
Analyze all available evidence
Prioritize evidence based on temporal proximity
Generate root cause hypothesis
For HyperShift jobs:
Formulate actionable recommendations
Display structured summary with enhanced formatting
Branch on whether test_name was provided (mirrors Step 4.2 branching):
{test_name} report shape below.For single must-gather or no must-gather (Branch A — test_name provided):
# Test Failure Analysis Complete
## Job Information
- **Prow Job**: {prowjob-name}
- **Build ID**: {build_id}
- **Target**: {target}
- **Test**: {test_name}
## Known Symptoms Seen
*(Only if symptom labels were found in Step 4.3b — omit section entirely if none)*
- {symptom summary}: {symptom explanation}
> **Note**: Symptoms are machine-detected environmental observations, not definitive causes. They add context to help explain failures when correlated with other evidence.
## Test Failure Analysis
### Error
{error message from stack trace}
### Summary
{failure analysis from stack trace and code}
### Evidence
{evidence from build-log.txt and interval files}
### Additional Evidence
{additional evidence from logs/events}
---
## Cluster Diagnostics
*(Only if must-gather was analyzed)*
### Cluster Operators
{output from analyze_clusteroperators.py}
### Problematic Pods
{output from analyze_pods.py --problems-only}
### Node Issues
{output from analyze_nodes.py --problems-only}
### Recent Warning Events
{output from analyze_events.py}
### Network Analysis
*(Only if network-related test)*
{output from analyze_network.py}
### etcd Analysis
*(Only if etcd-related test)*
{output from analyze_etcd.py}
---
## Correlation
*(Only if must-gather was analyzed)*
### Timeline
- **Test started**: {from timestamp}
- **Test failed**: {to timestamp}
- **Cluster events during test**:
- {cluster-event} at {timestamp}
- {cluster-event} at {timestamp}
### Affected Components
- {affected operators/pods/nodes}
### Root Cause Hypothesis
{correlated analysis combining test-level and cluster-level evidence}
---
## Artifacts
- **Test artifacts**: `.work/prow-job-analyze-test-failure/{build_id}/logs/`
- **Must-gather**: `.work/prow-job-analyze-test-failure/{build_id}/must-gather/logs/` *(if extracted)*
For dual must-gather (HyperShift):
# Test Failure Analysis Complete (HyperShift)
## Job Information
- **Prow Job**: {prowjob-name}
- **Build ID**: {build_id}
- **Target**: {target}
- **Test**: {test_name}
- **Hosted Cluster Namespace**: {HOSTED_NAMESPACE}
## Known Symptoms Seen
*(Only if symptom labels were found in Step 4.3b — omit section entirely if none)*
- {symptom summary}: {symptom explanation}
> **Note**: Symptoms are machine-detected environmental observations, not definitive causes. They add context to help explain failures when correlated with other evidence.
## Test Failure Analysis
### Error
{error message from stack trace}
### Summary
{failure analysis from stack trace and code}
### Evidence
{evidence from build-log.txt and interval files}
### Additional Evidence
{additional evidence from logs/events}
---
## Management Cluster Diagnostics
### Cluster Operators
{output from analyze_clusteroperators.py for management cluster}
### Problematic Pods
{output from analyze_pods.py --problems-only for management cluster}
### Node Issues
{output from analyze_nodes.py --problems-only for management cluster}
### Recent Warning Events
{output from analyze_events.py for management cluster}
### Network Analysis
*(Only if network-related test)*
{output from analyze_network.py for management cluster}
### etcd Analysis
*(Only if etcd-related test)*
{output from analyze_etcd.py for management cluster}
---
## Hosted Cluster Diagnostics
**Namespace**: `{HOSTED_NAMESPACE}`
### Cluster Operators
{output from analyze_clusteroperators.py for hosted cluster}
### Problematic Pods
{output from analyze_pods.py --problems-only for hosted cluster}
### Node Issues
{output from analyze_nodes.py --problems-only for hosted cluster}
### Recent Warning Events
{output from analyze_events.py for hosted cluster}
### Network Analysis
*(Only if network-related test)*
{output from analyze_network.py for hosted cluster}
### etcd Analysis
*(Only if etcd-related test)*
{output from analyze_etcd.py for hosted cluster}
---
## Correlation
### Timeline
- **Test started**: {from timestamp}
- **Test failed**: {to timestamp}
- **Management cluster events during test**:
- {cluster-event} at {timestamp}
- **Hosted cluster events during test**:
- {cluster-event} at {timestamp}
### Affected Components
**Management Cluster**:
- {affected operators/pods/nodes}
**Hosted Cluster**:
- {affected operators/pods/nodes}
### Root Cause Hypothesis
{correlated analysis combining:
- Test-level evidence
- Management cluster diagnostics
- Hosted cluster diagnostics
- Cross-cluster interactions}
---
## Artifacts
- **Test artifacts**: `.work/prow-job-analyze-test-failure/{build_id}/logs/`
- **Management cluster must-gather**: `.work/prow-job-analyze-test-failure/{build_id}/must-gather-mgmt/logs/`
- **Hosted cluster must-gather**: `.work/prow-job-analyze-test-failure/{build_id}/must-gather-hosted/logs/`
For Branch B (test_name NOT provided — multi-step failure report):
Iterate over the ordered list of failed steps discovered in Step 4.2 Branch B and produce one section per step, then aggregate into a combined root cause section.
# Test Failure Analysis Complete (Multi-Step)
## Job Information
- **Prow Job**: {prowjob-name}
- **Build ID**: {build_id}
- **Target**: {target}
- **Failed Steps**: {count of discovered failed steps}
## Known Symptoms Seen
*(Only if symptom labels were found in Step 4.3b — omit section entirely if none)*
- {symptom summary}: {symptom explanation}
> **Note**: Symptoms are machine-detected environmental observations, not definitive causes.
---
## Failed Step Analyses
*(Repeat the following block for each failed step discovered in Step 4.2 Branch B)*
### Step: {step_name}
#### Error
{failure_message from JUnit XML for this step}
#### Summary
{analysis of this step's stack trace and log context}
#### Evidence
{stack trace extracted from {step_name}-build-log.txt or build-log.txt context}
#### Artifacts Examined
- `.work/prow-job-analyze-test-failure/{build_id}/logs/{step_name}-build-log.txt`
- `.work/prow-job-analyze-test-failure/{build_id}/logs/{step_name}/` *(step artifacts if downloaded)*
---
## Cluster Diagnostics
*(Only if must-gather was analyzed — same structure as Branch A; include per-cluster
sections as appropriate for standard or HyperShift patterns)*
---
## Aggregated Root Cause
### Failed Steps Summary
| Step | One-line Failure |
|------|-----------------|
| {step_name_1} | {summary} |
| {step_name_2} | {summary} |
| ... | ... |
### Timeline
*(Emit one entry per failed step. If interval data cannot be mapped to a specific
step, omit the Timeline section entirely rather than showing a misleading single window.)*
- **{step_name_1}**: started {from timestamp} — failed {to timestamp}
- Cluster events during this window: {cluster-event} at {timestamp}
- **{step_name_2}**: started {from timestamp} — failed {to timestamp}
- Cluster events during this window: {cluster-event} at {timestamp}
- *(repeat for each failed step with mappable interval data)*
### Root Cause Hypothesis
{synthesized root cause drawn from all per-step findings (install + test steps),
per-step interval events, symptom labels, and cluster diagnostics; clearly
identify whether one step is the root cause or whether multiple independent
failures occurred}
### Recommendations
- {actionable next step 1}
- {actionable next step 2}
---
## Artifacts
- **Test artifacts**: `.work/prow-job-analyze-test-failure/{build_id}/logs/`
- **Must-gather**: `.work/prow-job-analyze-test-failure/{build_id}/must-gather*/logs/` *(if extracted)*
Display completion message
✅ Analysis complete!
📄 Report: .work/prow-job-analyze-test-failure/{build_id}/analysis.md
💡 Tip: The Markdown report can be copied directly into JIRA Description field
Handle errors in the same way as "Error handling" in "prow-job-analyze-resource" skill, with these additional must-gather-specific cases:
Must-gather not available
gcloud storage ls returns 404 for must-gather.tar, this is expected (not all jobs have must-gather)test_name was providedMust-gather extraction fails
test_name was providedAnalysis scripts not found
find command returns empty (no scripts found), warn the usertest_name was providedPartial analysis script failures
Empty analysis results
Follow the instructions in "Performance Considerations" in "prow-job-analyze-resource" skill
testing
Snapshot OpenShift payload data (release controller, PR diffs, comments, CI jobs, JUnit results, regression tracking) to a local directory for offline analysis
research
Shared engine for analyzing Jira issue activity and generating status summaries
tools
This skill should be used before any Snowflake command to verify MCP connectivity, guide users through access provisioning, and set the session context. Invoke this skill proactively whenever a command needs Snowflake data access.
development
Analyze a payload snapshot to identify root causes of blocking job failures, score candidate PRs, and produce an HTML report with revert recommendations