plugins/buildkite/skills/investigating-builds/SKILL.md
--- name: investigating-builds description: Use when the user is working with an existing Buildkite CI run and wants to understand, diagnose, or act on it. Strong signals: a buildkite.com URL, the word "buildkite", the `bk` CLI, the `bktide` tool, or a reference to a specific pipeline, build, or job. Covers intents like: "why did this build fail", "what's flaking", "summarize pass/fail rates across recent builds", "pull logs for this job", "figure out the exact command CI ran so I can reproduce
npx skillsauth add technicalpickles/pickled-claude-plugins plugins/buildkite/skills/investigating-buildsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides workflows and tools for working with Buildkite CI builds. It covers checking status, investigating failures, and reproducing issues locally rather than creating or configuring pipelines. Use this skill when working with Buildkite builds, especially for PR workflows, post-push monitoring, failure investigation, and local reproduction.
One command, one URL, gets you everything: build metadata, annotations, and logs for failed steps, all saved to local files you can grep and re-read without burning API calls. The other tools require you to piece together multiple calls and keep track of job UUIDs vs step IDs.
Use this skill when:
CRITICAL: Always use Buildkite-native tools. Never fall back to GitHub tools (gh pr view, GitHub API, etc.) - they only show summaries and lose critical information (annotations, logs, real-time updates, state distinctions).
Use tools in this priority order:
Command: npx bktide@latest snapshot <buildkite-url>
The snapshot command is the preferred approach for investigating builds. It:
npx bktide@latest snapshot https://buildkite.com/org/pipeline/builds/123
Run bktide snapshot --help for all options, or bktide prime for detailed LLM-friendly usage guidance.
Output structure:
./tmp/bktide/snapshots/<org>/<pipeline>/<build>/
├── manifest.json # Step index with states and exit codes
├── build.json # Full build metadata
├── annotations.json # Build annotations
└── steps/
├── 01-step-name/
│ ├── log.txt # Full log output
│ └── step.json # Step metadata
└── 02-another-step/
└── ...
Useful follow-up commands (shown by snapshot):
# List failures
jq -r '.steps[] | select(.state == "failed") | "\(.id): \(.label)"' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json
# View a log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt
# Search for errors across all logs
grep -r "Error\|Failed\|Exception" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/
For quick queries without full snapshot:
npx bktide@latest pipelines <org> # List pipelines
npx bktide@latest builds <org>/<pipeline> # List builds
npx bktide@latest build <org>/<pipeline>#<build> # Get build details
npx bktide@latest annotations <org>/<pipeline>#<build> # Show annotations
When: bktide unavailable, or need programmatic access (wait_for_build, unblock)
Available MCP tools:
buildkite:get_build - Get detailed build informationbuildkite:list_builds - List builds for a pipelinebuildkite:list_annotations - Get annotations for a buildbuildkite:get_pipeline - Get pipeline configurationbuildkite:list_pipelines - List all pipelines in an orgbuildkite:wait_for_build - Wait for a build to complete (useful for monitoring)buildkite:get_logs - Retrieve job logsbuildkite:get_logs_info - Get log metadatabuildkite:list_artifacts - List build artifacts| Capability | bktide snapshot | bktide CLI | MCP Tools | | ----------------- | --------------- | ---------- | --------- | | Parse any BK URL | ✅ | ❌ | ❌ | | Get build details | ✅ | ✅ | ✅ | | Get annotations | ✅ | ✅ | ✅ | | Retrieve logs | ✅ | ❌ | ✅ | | Save to files | ✅ | ❌ | ❌ | | Wait for build | ❌ | ❌ | ✅ | | Unblock jobs | ❌ | ❌ | ✅ |
This tool preference order can be overridden via
~/.config/pickled-claude-plugins/buildkite.yml. A PreToolUse hook enforces your preference by interceptingbkCLI commands that overlap with bktide capabilities.
If bktide fails:
If MCP tools fail:
Critical: One tool failing does NOT mean the entire skill is invalid. Move to fallback tools, don't abandon Buildkite tools.
When a user provides a Buildkite URL for a failing build, use bktide snapshot to gather all context.
Step 1: Capture the build snapshot
npx bktide@latest snapshot <buildkite-url>
This works with any Buildkite URL format:
https://buildkite.com/org/pipeline/builds/12345https://buildkite.com/org/pipeline/builds/12345/steps/canvas?sid=019a5f...The snapshot command will:
./tmp/bktide/snapshots/<org>/<pipeline>/<build>/Step 2: Review the summary
The command output shows:
Step 3: Filter to root failures
Most builds have 1-3 root failures and dozens-to-hundreds of dependent BROKEN steps. Counting BROKEN/RUNNING steps as failures sends you down rabbit holes investigating noise.
First, see the state distribution:
jq -r '.steps[].state' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json | sort | uniq -c | sort -rn
Only these are real failures:
FINISHED (or FAILED) AND non-zero exit_statusThese are NOT failures — do not count or investigate them:
Find actual root failures:
jq -r '.steps[] | select((.state == "FINISHED" or .state == "FAILED") and .exit_status != null and .exit_status != 0) | "\(.id): \(.label) (exit \(.exit_status))"' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json
Example: A summary reads "466 steps: 43 passed, 397 failed, 361 running" — filtering reveals 1 actual failure (e.g. codeownership validation) and 396 BROKEN dependents. Fix the root and everything else passes.
Step 4: Read the failing step's log
# View a specific log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt
# Search for errors across all logs
grep -r "Error\|Failed\|Exception" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/
Step 5: Analyze error output
Look for:
Step 6: Reproduce locally
Follow the "Reproducing Build Failures Locally" workflow below to:
Preferred: Use bktide snapshot (see workflow 1)
The snapshot command automatically downloads logs for failed/broken steps. After running snapshot:
# View a specific step's log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt
# Search across all captured logs
grep -r "pattern" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/
Fallback: MCP tools
If you need logs without running snapshot (e.g., for a specific job, or snapshot isn't available):
Step 1: Get the job UUID
mcp__MCPProxy__call_tool('buildkite:get_build', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
detail_level: 'detailed',
job_state: 'failed',
});
Find the job by matching the label field. Extract its uuid field.
Step 2: Retrieve logs
mcp__MCPProxy__call_tool('buildkite:get_logs', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
job_id: '<job-uuid>',
});
Common Issues:
sid=019a5f...). Job UUIDs come from get_build API responses.state field first.This is the most common workflow when working on a branch:
Step 1: Identify the pipeline and branch
Determine which pipeline(s) run on PRs for this repository. Common patterns:
Step 2: Find builds for the current branch
Use MCP tools to list recent builds:
mcp__MCPProxy__call_tool('buildkite:list_builds', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
branch: '<branch-name>',
detail_level: 'summary',
});
Or use bktide:
npx bktide@latest builds --format json <org>/<pipeline>
Step 3: Progressive disclosure of status
Follow this pattern when examining builds:
passed, failed, running, blocked, or canceled?After pushing code, follow this workflow to monitor the CI build:
Step 1: Find builds for the pushed commit
Use the find-commit-builds script:
~/.claude/skills/buildkite-status/scripts/find-commit-builds.js <org> <commit-sha>
Or manually search using MCP tools with commit filter.
Step 2: Monitor the build
Option A (Preferred): Use MCP wait_for_build tool
mcp__MCPProxy__call_tool('buildkite:wait_for_build', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
timeout: 1800,
poll_interval: 30,
});
This will:
poll_interval)timeout)Option B (Fallback): Use wait-for-build.js script
If you prefer background execution:
~/.claude/skills/buildkite-status/scripts/wait-for-build.js <org> <pipeline> <build-number> --timeout 1800 --interval 30
If the script fails (e.g., bktide dependency error), use Option A - the MCP tool is more reliable.
Step 3: Check on progress
Periodically check the background job or wait for it to complete. When it finishes, check the exit code:
Step 4: Investigate failures
If the build failed, follow the "### 1. Investigating a Build from URL" workflow above.
Note: This workflow is deprecated. Use "### 1. Investigating a Build from URL" and "### 2. Retrieving Job Logs" instead for a more complete investigation process.
When a build has failed, use this systematic approach:
Step 1: Get build overview
mcp__MCPProxy__call_tool('buildkite:get_build', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
detail_level: 'detailed',
job_state: 'failed', // Only show failed jobs
});
This gives you:
Step 2: Check annotations
Some projects put test failures in annotations:
mcp__MCPProxy__call_tool('buildkite:list_annotations', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
});
Look for annotations with style: "error" or style: "warning".
Important: Not all projects use annotations. See references/annotation-patterns.md for project-specific patterns.
Step 3: Examine failed jobs
For each failed job (not "broken" - see state reference below):
Step 4: Understand "broken" vs "failed"
Critical: A job showing as "broken" is often NOT a failure. It typically means:
See references/buildkite-states.md for complete state explanations.
Example: In large monorepos, many jobs show "broken" because they were skipped due to file changes not affecting them. This is normal and expected.
When a build is in blocked state, it's waiting for manual approval:
Step 1: Identify the block step
Get the build with detail_level: "detailed" and look for jobs with state: "blocked".
Step 2: Review what's being blocked
Block steps typically have a label describing what approval is needed (e.g., "Deploy to Production").
Step 3: Unblock if appropriate
Use the MCP tool to unblock:
mcp__MCPProxy__call_tool('buildkite:unblock_job', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
job_id: '<job-id>',
fields: {}, // Optional form fields if the block step has inputs
});
After investigating a failed build (workflows 1-2), use this workflow to reproduce the failure locally for debugging.
Goal: Discover exactly what CI ran - the command, environment, and context.
Step 1: Get the job logs
Use workflow "### 2. Retrieving Job Logs" to get logs for the failed job.
Step 2: Find the actual command
Look early in the log output for the command execution line. Common patterns:
:docker: Running /bin/sh -e -c '<command>' in service <service>+ when shell trace is enabledExample log snippet:
Running plugin docker-compose command hook
:docker: Found a pre-built image for app
:docker: Creating docker-compose override file for prebuilt services
:docker: Pulling services app
:docker: Starting dependencies
:docker: Running /bin/sh -e -c 'bin/rspec spec/models/user_spec.rb' in service app
The actual command here is bin/rspec spec/models/user_spec.rb.
Step 3: Identify environment variables
Check multiple sources in order:
.buildkite/pipeline.yml or buildkite-builder DSL for env: blocksCI=true, BUILDKITE_BRANCH, etc.Step 4: Note the execution context
Record:
Goal: Convert the CI command to something runnable locally.
Step 1: Try the direct command first
Run the extracted command as-is in your local environment:
bin/rspec spec/models/user_spec.rb
This often works for:
rspec, jest, pytest, etc.)Step 2: If direct fails, try with docker-compose
When the command ran in a Docker context in CI, replicate that locally:
docker-compose run <service> <command>
Example - if CI showed:
:docker: Running /bin/sh -e -c 'bin/rspec spec/models/user_spec.rb' in service app
Try locally:
docker-compose run app bin/rspec spec/models/user_spec.rb
Step 3: Set relevant environment variables
If the command behaves differently, add env vars discovered in Phase 1:
CI=true RAILS_ENV=test bin/rspec spec/models/user_spec.rb
Or with docker-compose:
docker-compose run -e CI=true -e RAILS_ENV=test app bin/rspec spec/models/user_spec.rb
Step 4: Handle common translation patterns
| CI Pattern | Local Translation |
|------------|-------------------|
| --parallel 4 | --parallel 1 or remove flag |
| --format buildkite | --format progress or remove flag |
| CI-specific artifact paths | Use local paths |
| buildkite-agent artifact download | Download manually or skip |
Goal: When local reproduction isn't feasible, determine the best alternative.
Decision: Can this be reproduced locally?
Local reproduction is likely NOT feasible when:
Note: Many Buildkite plugins don't block local reproduction - plugins for artifacts, notifications, or caching are CI orchestration concerns, not execution blockers.
Alternative 1: Trigger a test build with debugging changes
Push a branch with modifications to aid debugging:
--verbose, -vvv)echo statements or print debuggingenv | sort or printenvAlternative 2: Inspect artifacts
Download artifacts from the failed build:
mcp__MCPProxy__call_tool('buildkite:list_artifacts', {
org_slug: '<org>',
pipeline_slug: '<pipeline>',
build_number: '<build-number>',
});
Look for:
Alternative 3: Analyze the failure in place
Sometimes reproduction isn't needed - the logs plus artifacts contain enough information to understand and fix the issue without running it locally.
Buildkite has several states that can be confusing. Here's a quick reference:
passed - All jobs completed successfully ✅failed - One or more jobs failed ❌running - Build is currently executing 🔄blocked - Waiting for manual approval 🚫canceled - Build was canceled ⛔passed - Job succeeded ✅failed - Job failed with non-zero exit ❌broken - MISLEADING: Usually means skipped due to pipeline logic, NOT a failure ⚠️soft_failed - Failed but marked as non-blocking 〰️skipped - Job was skipped ⏭️For complete state reference and project-specific patterns, read references/buildkite-states.md.
Always follow this pattern when checking build status:
Don't immediately jump to logs - the build state and annotations often tell you what you need to know.
Don't: Use gh pr view, gh pr checks, or GitHub API to check Buildkite status
Why: GitHub shows Buildkite check summary only. You lose:
Reality: Always use Buildkite tools. GitHub summarizes; Buildkite is the source of truth.
Don't: "The script failed, so I'll use GitHub tools instead"
Why: The skill documents MULTIPLE tool tiers:
Reality: One tool failing doesn't invalidate the skill. Follow the fallback hierarchy - move to MCP tools, don't abandon Buildkite entirely.
Don't: "This is urgent, I don't have time to follow the skill"
Why: Skills exist ESPECIALLY for high-pressure situations. Disciplined workflows prevent mistakes when you're rushed. Making wrong tool choices under pressure wastes MORE time debugging.
Reality: Following the skill is FASTER than recovering from wrong decisions. Taking 2 minutes to use the right tool saves 20 minutes of confusion.
Don't: "I already know gh pr view works, why learn Buildkite tools?"
Why: Familiarity ≠ effectiveness. You'll spend more time working around GitHub's limitations than learning the proper tools.
Reality: Invest 2 minutes learning Buildkite MCP tools once. Save hours across all future builds.
If you catch yourself thinking ANY of these thoughts, you're about to violate this skill:
These are rationalizations. Stop. Follow the tool hierarchy. Use Buildkite MCP tools.
job_state: "failed" to focus on actual failureswait_for_build tool or background scriptsdetail_level: "summary" to reduce dataRun scripts with --help for usage information.
tools
--- name: writing-for-scannability description: Use when structuring prose so readers can skim it - drafting or restructuring READMEs, docs, PR or issue bodies, design docs, RFCs, or any long-form text where a wall of prose hides the structure. Also use when explicitly asked to make something scannable or skimmable, convert prose to a list, surface a buried list, fix a wall of text, or decide whether bullets or prose fit. Strong signal: text with parallel sentence shapes, contrast markers ("that
development
Ignore actually-lsp nudges for an ecosystem in this project. Use when the user wants to silence, dismiss, or ignore the LSP setup nudges for a specific ecosystem (Rust, TypeScript, Ruby), or invokes `/actually-lsp-ignore` directly. Writes `dismissed=true` to `.claude/actually-lsp.json`. Persistent across sessions for this project only.
tools
Diagnose and fix LSP setup for the current project's detected ecosystems (Rust, TypeScript, Ruby). Use when the SessionStart hook nudged about a missing LSP plugin, when the env isn't ready (no `bundle install`, no `cargo build`, missing server binary), when LSP calls are failing, or when the user invokes `/actually-lsp-doctor` directly. Walks the per-ecosystem state machine, reports what's missing, then runs the fix.
tools
--- name: investigating-runs description: Use whenever the user mentions a GitHub Actions / GHA run, even casually — invoke this skill before reaching for raw `gh` commands, because the bundled `gha-snapshot` helper distills `gh run view --log-failed` (a firehose) into a readable block with per-job status, failed-step log tails, and annotations. Specific triggers (any one is enough): a `github.com/.../actions/runs/...` URL; the phrase "GitHub Actions" or "GHA"; the `gh run` CLI; a failing workfl