skills/ci-status-checker/SKILL.md
Interprets GitHub Actions run status and logs, diagnoses CI failure patterns, and suggests targeted fixes. Handles dependency install failures, test timeouts, build OOM, flaky tests, and workflow misconfigurations. Activate on: 'CI failing', 'build broken', 'workflow error', 'GitHub Actions debug', 'flaky CI', 'pipeline timeout', 'CI red'. NOT for: writing new workflows from scratch (use github-actions-pipeline-builder), CI caching strategy (use ci-cache-optimizer), deployment orchestration (use devops-automator).
npx skillsauth add curiositech/windags-skills ci-status-checkerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Diagnoses CI/CD pipeline failures by reading GitHub Actions logs, identifying failure patterns, and producing actionable fix recommendations. This skill does not write workflows from scratch -- it triages what went wrong with existing ones.
Activate on: "CI failing", "build broken", "workflow error", "GitHub Actions debug", "flaky CI", "pipeline timeout", "CI red", "actions run failed", "check failed on PR", "why is CI broken"
NOT for: Writing new workflows --> github-actions-pipeline-builder | Caching strategy --> ci-cache-optimizer | Deployment pipelines --> devops-automator | Performance testing --> performance-profiler
# Check most recent run status
gh run list --limit 5
# Get details for a specific failed run
gh run view <run-id>
# Get the full log for a failed run
gh run view <run-id> --log-failed
Always start with --log-failed rather than --log to avoid pulling megabytes of passing step output. Only escalate to full logs if the failed step output is insufficient.
Parse the log output to find the first step that failed. CI logs contain noise -- focus on:
##[error] annotationsSignals:
npm ERR! or yarn error in install stepERESOLVE unable to resolve dependency tree404 Not Found for a packageENOENT for lockfilepip install failures with version conflictsCould not find a version that satisfies the requirementCommon causes and fixes:
| Error Pattern | Likely Cause | Fix |
|---------------|-------------|-----|
| ERESOLVE peer dependency conflict | Incompatible package versions | npm install --legacy-peer-deps or fix version constraints |
| 404 Not Found for package | Package unpublished or registry down | Check npm status, pin working version |
| ENOENT package-lock.json | Lockfile not committed | Commit lockfile, or use npm install instead of npm ci |
| engine "node" incompatible | Wrong Node version in CI | Match node-version in workflow to engines in package.json |
| ENOMEM during install | Runner out of memory | Use larger runner or reduce parallelism |
| ETARGET no matching version | Version range resolves to nothing | Pin specific version, check if package was yanked |
Signals:
TS2304, TS2339, TS2345, etc.Module not found or Cannot find moduleSyntaxError: Unexpected tokenCommon causes and fixes:
| Error Pattern | Likely Cause | Fix |
|---------------|-------------|-----|
| TS2307: Cannot find module | Missing type declarations | Install @types/* package or add to tsconfig.json paths |
| TS2345: Argument of type | Type mismatch introduced in PR | Fix the type error in source code |
| Module not found: 'X' | Import path wrong or package missing | Check import paths, verify package in dependencies |
| JavaScript heap out of memory | Build exceeds Node memory limit | Add NODE_OPTIONS=--max-old-space-size=4096 to env |
| SIGKILL during build | OOM killer terminated process | Use larger runner (ubuntu-latest-4-cores) or optimize build |
Signals:
FAIL markers in Jest/Vitest outputExpected X but received YCommon causes and fixes:
| Error Pattern | Likely Cause | Fix |
|---------------|-------------|-----|
| Assertion failure with clear diff | Real bug introduced in PR | Fix the code (not the test, unless test is wrong) |
| Timeout on async test | Missing await or slow operation | Add proper await, increase timeout, or mock slow calls |
| ECONNREFUSED in test | Test needs running service | Use service containers or mock the dependency |
| Snapshot mismatch | Intentional UI change | Update snapshots with --update-snapshot if change is correct |
| Different results on CI vs local | Environment difference | Check timezone, locale, file ordering, parallelism |
Signals:
The job running on runner ... has exceeded the maximum execution time##[error]The operation was canceledCommon causes and fixes:
| Error Pattern | Likely Cause | Fix |
|---------------|-------------|-----|
| Job timeout (6h default) | Infinite loop or hung process | Set explicit timeout-minutes per job, investigate hang |
| Step hangs with no output | Interactive prompt waiting for input | Add -y or --yes flags, pipe yes, set CI=true env |
| E2E test timeout | Browser not starting or page not loading | Check playwright install --with-deps, add retry |
| Docker build timeout | Layer cache miss on large image | Use docker/build-push-action with cache-from |
Signals:
Error: Resource not accessible by integrationError: Input required and not supplied: tokenfatal: could not read Username403 Forbidden on API callsCommon causes and fixes:
| Error Pattern | Likely Cause | Fix |
|---------------|-------------|-----|
| Resource not accessible | Insufficient GITHUB_TOKEN permissions | Add permissions: block to workflow |
| Input required: token | Secret not available in fork PRs | Use pull_request_target or conditional step |
| could not read Username | Private repo access without token | Use actions/checkout with token: ${{ secrets.PAT }} |
| 403 on deployment | Deploy key or token expired | Rotate secret in repo settings |
Signals:
Invalid workflow file before any step runsUnexpected value in YAML parsingCommon causes and fixes:
| Error Pattern | Likely Cause | Fix |
|---------------|-------------|-----|
| Invalid workflow file | YAML syntax error | Validate with actionlint locally |
| Unexpected value 'X' | Wrong key name or indentation | Check GitHub Actions schema for correct key names |
| Unable to resolve action | Action not found or wrong version | Verify action exists, use @v4 not @master |
| Context access might be invalid | Typo in ${{ }} expression | Check variable names against available contexts |
Flaky tests are tests that pass and fail nondeterministically. They erode trust in CI. Diagnose by pattern:
# Get last 20 runs for the test job
gh run list --workflow=tests.yml --limit 20 --json status,conclusion,startedAt
# Compare pass/fail ratio
gh run list --workflow=tests.yml --limit 50 --json conclusion | \
jq 'group_by(.conclusion) | map({conclusion: .[0].conclusion, count: length})'
If the same workflow alternates between success and failure without code changes, you have flaky tests.
beforeEach reset.--shard.TZ=UTC in test env.readdir returns different order on different OS. Fix: sort results before comparing.If a flaky test cannot be fixed immediately:
flaky/ directory or tag with .skip + a TODO comment with issue linkBefore a workflow runs in CI, validate it locally:
# Install actionlint (the gold standard for workflow linting)
brew install actionlint # macOS
# or: go install github.com/rhysd/actionlint/cmd/actionlint@latest
# Lint all workflow files
actionlint .github/workflows/*.yml
# Check a specific file with verbose output
actionlint -verbose .github/workflows/ci.yml
on: triggers: on: push works, but on: [push, pull_request] needs the bracketsruns-on: Every job needs itsteps: must be at the right level${{ github.event.pull_request.number }} not ${ github.event... }include adds combinations, it does not filter -- use exclude to removeSymptom: Re-running failed workflow until it passes Why wrong: Masks real issues, wastes compute minutes, teaches team to ignore CI Fix: Investigate every failure. If genuinely flaky, quarantine the test and fix the root cause.
Symptom: Removing status checks from branch protection because they fail
Why wrong: Opens main branch to broken code
Fix: Fix the check, or mark it as continue-on-error: true with a tracking issue.
Symptom: "CI is slow/broken, works on my machine"
Why wrong: CI environment differences are the test -- if it fails there, the code has assumptions
Fix: Match CI environment locally. Use act or Docker to reproduce. Add CI=true to local test runs.
Symptom: Pasting entire CI log into an issue or PR comment Why wrong: No one reads 10,000 lines of logs Fix: Extract the relevant 10-20 lines around the failure. Include the step name, error message, and exit code.
Symptom: CI passes but with deprecation warnings, security advisories, or linting warnings
Why wrong: Warnings become errors in the next major version. Technical debt compounds.
Fix: Treat warnings as errors in CI (--max-warnings 0 for ESLint, --strict for TypeScript).
Symptom: Single 500-line workflow file that does lint + test + build + deploy
Why wrong: One failure blocks everything, hard to debug, impossible to re-run selectively
Fix: Split into separate workflows or use reusable workflows with workflow_call.
Check https://www.githubstatus.com/ first. If Actions shows degraded:
If infrastructure is healthy, the failure is yours to own.
If only node-18, ubuntu-latest fails but node-20, ubuntu-latest passes:
engines.node matches the matrix[ ] Identified the exact failing step (not just "CI failed")
[ ] Classified failure into correct category (dependency/build/test/timeout/permission/config)
[ ] Reproduced locally when possible (or explained why not)
[ ] Fix addresses root cause, not symptoms (no blind retries)
[ ] Flaky tests either fixed or quarantined with tracking issue
[ ] Workflow YAML validates with actionlint after changes
[ ] CI minutes impact considered (larger runner = more cost)
[ ] Secrets verified as available for the trigger type (push vs pull_request vs fork)
[ ] Timeout values set explicitly on long-running jobs
[ ] Status check passes on the fix PR before merging
# List recent workflow runs
gh run list --limit 10
# View a specific run (shows jobs and steps)
gh run view <run-id>
# Get only failed step logs
gh run view <run-id> --log-failed
# Re-run failed jobs only (not the whole workflow)
gh run rerun <run-id> --failed
# Watch a running workflow
gh run watch <run-id>
# List workflow files
ls .github/workflows/
# Validate workflow syntax locally
actionlint .github/workflows/*.yml
# Check GitHub Actions service status
open https://www.githubstatus.com/
# Run workflow manually (if workflow_dispatch enabled)
gh workflow run <workflow-name>
Covers: GitHub Actions diagnosis | Failure classification | Flaky test analysis | Workflow validation | Permission debugging | Timeout investigation
Use with: github-actions-pipeline-builder (create workflows) | ci-cache-optimizer (fix cache issues) | test-automation-expert (fix failing tests) | site-reliability-engineer (production CI/CD)
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.