Working with Buildkite Builds

Overview

This skill provides workflows and tools for working with Buildkite CI builds. It covers checking status, investigating failures, and reproducing issues locally rather than creating or configuring pipelines. Use this skill when working with Buildkite builds, especially for PR workflows, post-push monitoring, failure investigation, and local reproduction.

Why bktide snapshot?

One command, one URL, gets you everything: build metadata, annotations, and logs for failed steps, all saved to local files you can grep and re-read without burning API calls. The other tools require you to piece together multiple calls and keep track of job UUIDs vs step IDs.

When to Use This Skill

Use this skill when:

Checking CI status for the current branch or PR
Investigating why a build failed
Monitoring builds after a git push
Waiting for builds to complete
Checking build status across multiple repos/PRs
Understanding what "broken" or other Buildkite states mean

Tool Hierarchy and Selection

CRITICAL: Always use Buildkite-native tools. Never fall back to GitHub tools (gh pr view, GitHub API, etc.) - they only show summaries and lose critical information (annotations, logs, real-time updates, state distinctions).

Use tools in this priority order:

Primary: bktide snapshot (Use This First)

Command: npx bktide@latest snapshot <buildkite-url>

The snapshot command is the preferred approach for investigating builds. It:

Parses any Buildkite URL automatically (build URL, step URL, etc.)
Downloads build metadata, annotations, and logs for failed steps
Saves everything to structured files for easy analysis
Provides actionable next-step commands

npx bktide@latest snapshot https://buildkite.com/org/pipeline/builds/123

Run bktide snapshot --help for all options, or bktide prime for detailed LLM-friendly usage guidance.

Output structure:

./tmp/bktide/snapshots/<org>/<pipeline>/<build>/
├── manifest.json      # Step index with states and exit codes
├── build.json         # Full build metadata
├── annotations.json   # Build annotations
└── steps/
    ├── 01-step-name/
    │   ├── log.txt    # Full log output
    │   └── step.json  # Step metadata
    └── 02-another-step/
        └── ...

Useful follow-up commands (shown by snapshot):

# List failures
jq -r '.steps[] | select(.state == "failed") | "\(.id): \(.label)"' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json

# View a log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt

# Search for errors across all logs
grep -r "Error\|Failed\|Exception" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/

Secondary: Other bktide Commands

For quick queries without full snapshot:

npx bktide@latest pipelines <org>                    # List pipelines
npx bktide@latest builds <org>/<pipeline>            # List builds
npx bktide@latest build <org>/<pipeline>#<build>     # Get build details
npx bktide@latest annotations <org>/<pipeline>#<build>  # Show annotations

Tertiary: MCP Tools (Fallback)

When: bktide unavailable, or need programmatic access (wait_for_build, unblock)

Available MCP tools:

buildkite:get_build - Get detailed build information
buildkite:list_builds - List builds for a pipeline
buildkite:list_annotations - Get annotations for a build
buildkite:get_pipeline - Get pipeline configuration
buildkite:list_pipelines - List all pipelines in an org
buildkite:wait_for_build - Wait for a build to complete (useful for monitoring)
buildkite:get_logs - Retrieve job logs
buildkite:get_logs_info - Get log metadata
buildkite:list_artifacts - List build artifacts

Tool Capability Matrix

| Capability | bktide snapshot | bktide CLI | MCP Tools | | ----------------- | --------------- | ---------- | --------- | | Parse any BK URL | ✅ | ❌ | ❌ | | Get build details | ✅ | ✅ | ✅ | | Get annotations | ✅ | ✅ | ✅ | | Retrieve logs | ✅ | ❌ | ✅ | | Save to files | ✅ | ❌ | ❌ | | Wait for build | ❌ | ❌ | ✅ | | Unblock jobs | ❌ | ❌ | ✅ |

This tool preference order can be overridden via ~/.config/pickled-claude-plugins/buildkite.yml. A PreToolUse hook enforces your preference by intercepting bk CLI commands that overlap with bktide capabilities.

When Tools Fail: Fallback Hierarchy

If bktide fails:

✅ Use equivalent MCP tool
❌ Do NOT fall back to GitHub tools

If MCP tools fail:

✅ Check MCP server connection status
✅ Restart MCP connection
✅ Report the MCP failure to your human partner
❌ Do NOT fall back to GitHub tools

Critical: One tool failing does NOT mean the entire skill is invalid. Move to fallback tools, don't abandon Buildkite tools.

Core Workflows

1. Investigating a Build from URL (Most Common)

When a user provides a Buildkite URL for a failing build, use bktide snapshot to gather all context.

Step 1: Capture the build snapshot

npx bktide@latest snapshot <buildkite-url>

This works with any Buildkite URL format:

Build URL: https://buildkite.com/org/pipeline/builds/12345
Step URL: https://buildkite.com/org/pipeline/builds/12345/steps/canvas?sid=019a5f...

The snapshot command will:

Parse the URL automatically
Download build metadata, annotations, and logs
Save everything to ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/
Show a summary and helpful next-step commands

Step 2: Review the summary

The command output shows:

Build state (passed/failed/running)
Step counts (how many passed/failed/broken)
Snapshot location

Step 3: Filter to root failures

Most builds have 1-3 root failures and dozens-to-hundreds of dependent BROKEN steps. Counting BROKEN/RUNNING steps as failures sends you down rabbit holes investigating noise.

First, see the state distribution:

jq -r '.steps[].state' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json | sort | uniq -c | sort -rn

Only these are real failures:

Steps with state FINISHED (or FAILED) AND non-zero exit_status

These are NOT failures — do not count or investigate them:

BROKEN — Downstream dependencies that never ran. Auto-resolve when the root failure is fixed.
RUNNING — Still in progress.
WAITING / SCHEDULED — Haven't started yet.
CANCELED — Manually or automatically canceled.

Find actual root failures:

jq -r '.steps[] | select((.state == "FINISHED" or .state == "FAILED") and .exit_status != null and .exit_status != 0) | "\(.id): \(.label) (exit \(.exit_status))"' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json

Example: A summary reads "466 steps: 43 passed, 397 failed, 361 running" — filtering reveals 1 actual failure (e.g. codeownership validation) and 396 BROKEN dependents. Fix the root and everything else passes.

Step 4: Read the failing step's log

# View a specific log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt

# Search for errors across all logs
grep -r "Error\|Failed\|Exception" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/

Step 5: Analyze error output

Look for:

Stack traces
Test failure messages
Exit codes and error messages
File paths and line numbers

Step 6: Reproduce locally

Follow the "Reproducing Build Failures Locally" workflow below to:

Extract the exact command CI ran (visible in the log)
Translate it to a local equivalent
Triage if local reproduction isn't feasible

2. Retrieving Job Logs

Preferred: Use bktide snapshot (see workflow 1)

The snapshot command automatically downloads logs for failed/broken steps. After running snapshot:

# View a specific step's log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt

# Search across all captured logs
grep -r "pattern" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/

Fallback: MCP tools

If you need logs without running snapshot (e.g., for a specific job, or snapshot isn't available):

Step 1: Get the job UUID

mcp__MCPProxy__call_tool('buildkite:get_build', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  detail_level: 'detailed',
  job_state: 'failed',
});

Find the job by matching the label field. Extract its uuid field.

Step 2: Retrieve logs

mcp__MCPProxy__call_tool('buildkite:get_logs', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  job_id: '<job-uuid>',
});

Common Issues:

"job not found" error: You likely provided a step ID instead of a job UUID. Step IDs come from URLs (sid=019a5f...). Job UUIDs come from get_build API responses.
Empty logs: The job may not have started yet. Check the job's state field first.

3. Checking Current Branch/PR Status

This is the most common workflow when working on a branch:

Step 1: Identify the pipeline and branch

Determine which pipeline(s) run on PRs for this repository. Common patterns:

Repository name matches pipeline slug
Monorepo may have pipeline named after the main repo

Step 2: Find builds for the current branch

Use MCP tools to list recent builds:

mcp__MCPProxy__call_tool('buildkite:list_builds', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  branch: '<branch-name>',
  detail_level: 'summary',
});

Or use bktide:

npx bktide@latest builds --format json <org>/<pipeline>

Step 3: Progressive disclosure of status

Follow this pattern when examining builds:

Overall state - Is it passed, failed, running, blocked, or canceled?
Job summary - How many jobs passed/failed/broken?
Annotations (if present) - Check for test failures, warnings, or errors
Failed job details - Get logs for actually failed jobs (not just "broken")

4. Post-Push Monitoring Workflow

After pushing code, follow this workflow to monitor the CI build:

Step 1: Find builds for the pushed commit

Use the find-commit-builds script:

~/.claude/skills/buildkite-status/scripts/find-commit-builds.js <org> <commit-sha>

Or manually search using MCP tools with commit filter.

Step 2: Monitor the build

Option A (Preferred): Use MCP wait_for_build tool

mcp__MCPProxy__call_tool('buildkite:wait_for_build', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  timeout: 1800,
  poll_interval: 30,
});

This will:

Poll every 30 seconds (configurable with poll_interval)
Report status changes
Complete when build reaches terminal state (passed/failed/canceled)
Timeout after 30 minutes (configurable with timeout)

Option B (Fallback): Use wait-for-build.js script

If you prefer background execution:

~/.claude/skills/buildkite-status/scripts/wait-for-build.js <org> <pipeline> <build-number> --timeout 1800 --interval 30

If the script fails (e.g., bktide dependency error), use Option A - the MCP tool is more reliable.

Step 3: Check on progress

Periodically check the background job or wait for it to complete. When it finishes, check the exit code:

0 = passed
1 = failed
2 = canceled
3 = timeout

Step 4: Investigate failures

If the build failed, follow the "### 1. Investigating a Build from URL" workflow above.

5. Investigating Failures (Deprecated)

Note: This workflow is deprecated. Use "### 1. Investigating a Build from URL" and "### 2. Retrieving Job Logs" instead for a more complete investigation process.

When a build has failed, use this systematic approach:

Step 1: Get build overview

mcp__MCPProxy__call_tool('buildkite:get_build', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  detail_level: 'detailed',
  job_state: 'failed', // Only show failed jobs
});

This gives you:

Overall build state
Job summary (how many failed vs broken)
List of failed jobs only

Step 2: Check annotations

Some projects put test failures in annotations:

mcp__MCPProxy__call_tool('buildkite:list_annotations', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
});

Look for annotations with style: "error" or style: "warning".

Important: Not all projects use annotations. See references/annotation-patterns.md for project-specific patterns.

Step 3: Examine failed jobs

For each failed job (not "broken" - see state reference below):

Get the job details from the build data
Check the job's log output
Look for stack traces, error messages, or test failures

Step 4: Understand "broken" vs "failed"

Critical: A job showing as "broken" is often NOT a failure. It typically means:

The job was skipped because an earlier job failed
The job's dependencies weren't met
Conditional pipeline logic determined the job wasn't needed

See references/buildkite-states.md for complete state explanations.

Example: In large monorepos, many jobs show "broken" because they were skipped due to file changes not affecting them. This is normal and expected.

6. Checking Blocked Builds

When a build is in blocked state, it's waiting for manual approval:

Step 1: Identify the block step

Get the build with detail_level: "detailed" and look for jobs with state: "blocked".

Step 2: Review what's being blocked

Block steps typically have a label describing what approval is needed (e.g., "Deploy to Production").

Step 3: Unblock if appropriate

Use the MCP tool to unblock:

mcp__MCPProxy__call_tool('buildkite:unblock_job', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  job_id: '<job-id>',
  fields: {}, // Optional form fields if the block step has inputs
});

7. Reproducing Build Failures Locally

After investigating a failed build (workflows 1-2), use this workflow to reproduce the failure locally for debugging.

Phase 1: Extract

Goal: Discover exactly what CI ran - the command, environment, and context.

Step 1: Get the job logs

Use workflow "### 2. Retrieving Job Logs" to get logs for the failed job.

Step 2: Find the actual command

Look early in the log output for the command execution line. Common patterns:

Docker-compose plugin: :docker: Running /bin/sh -e -c '<command>' in service <service>
Shell trace: Lines starting with + when shell trace is enabled
Direct command steps: Command appears after "Running command" or similar

Example log snippet:

Running plugin docker-compose command hook
:docker: Found a pre-built image for app
:docker: Creating docker-compose override file for prebuilt services
:docker: Pulling services app
:docker: Starting dependencies
:docker: Running /bin/sh -e -c 'bin/rspec spec/models/user_spec.rb' in service app

The actual command here is bin/rspec spec/models/user_spec.rb.

Step 3: Identify environment variables

Check multiple sources in order:

Log output - Some pipelines print env vars at job start
Pipeline config - Check .buildkite/pipeline.yml or buildkite-builder DSL for env: blocks
Buildkite defaults - See references/buildkite-environment-variables.md for standard vars like CI=true, BUILDKITE_BRANCH, etc.
Discover through failure - If local run fails differently, a missing env var may be the cause

Step 4: Note the execution context

Record:

Was it running in Docker? Which service/image?
What working directory?
Any pre-command setup visible in the log?

Phase 2: Translate

Goal: Convert the CI command to something runnable locally.

Step 1: Try the direct command first

Run the extracted command as-is in your local environment:

bin/rspec spec/models/user_spec.rb

This often works for:

Test commands (rspec, jest, pytest, etc.)
Linting/formatting tools
Simple scripts

Step 2: If direct fails, try with docker-compose

When the command ran in a Docker context in CI, replicate that locally:

docker-compose run <service> <command>

Example - if CI showed: :docker: Running /bin/sh -e -c 'bin/rspec spec/models/user_spec.rb' in service app

Try locally:

docker-compose run app bin/rspec spec/models/user_spec.rb

Step 3: Set relevant environment variables

If the command behaves differently, add env vars discovered in Phase 1:

CI=true RAILS_ENV=test bin/rspec spec/models/user_spec.rb

Or with docker-compose:

docker-compose run -e CI=true -e RAILS_ENV=test app bin/rspec spec/models/user_spec.rb

Step 4: Handle common translation patterns

| CI Pattern | Local Translation | |------------|-------------------| | --parallel 4 | --parallel 1 or remove flag | | --format buildkite | --format progress or remove flag | | CI-specific artifact paths | Use local paths | | buildkite-agent artifact download | Download manually or skip |

Phase 3: Triage

Goal: When local reproduction isn't feasible, determine the best alternative.

Decision: Can this be reproduced locally?

Local reproduction is likely NOT feasible when:

Command needs infrastructure not available locally (specific databases, internal APIs, secrets)
Environment differences are too significant (specific Linux dependencies, network topology)
Failure is tied to CI-specific state (parallelization across agents, artifact dependencies from earlier steps)

Note: Many Buildkite plugins don't block local reproduction - plugins for artifacts, notifications, or caching are CI orchestration concerns, not execution blockers.

Alternative 1: Trigger a test build with debugging changes

Push a branch with modifications to aid debugging:

Add verbose flags to the failing command (e.g., --verbose, -vvv)
Insert echo statements or print debugging
Add env var dumping: env | sort or printenv
Modify environment variables in pipeline config
Temporarily simplify the command to isolate the issue

Alternative 2: Inspect artifacts

Download artifacts from the failed build:

mcp__MCPProxy__call_tool('buildkite:list_artifacts', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
});

Look for:

Test output files (JUnit XML, coverage reports)
Log files written during execution
Screenshots or other debug outputs

Alternative 3: Analyze the failure in place

Sometimes reproduction isn't needed - the logs plus artifacts contain enough information to understand and fix the issue without running it locally.

Understanding Buildkite States

Buildkite has several states that can be confusing. Here's a quick reference:

Build States

passed - All jobs completed successfully ✅
failed - One or more jobs failed ❌
running - Build is currently executing 🔄
blocked - Waiting for manual approval 🚫
canceled - Build was canceled ⛔

Job States

passed - Job succeeded ✅
failed - Job failed with non-zero exit ❌
broken - MISLEADING: Usually means skipped due to pipeline logic, NOT a failure ⚠️
soft_failed - Failed but marked as non-blocking 〰️
skipped - Job was skipped ⏭️

For complete state reference and project-specific patterns, read references/buildkite-states.md.

Progressive Disclosure Pattern

Always follow this pattern when checking build status:

Start broad: Overall build state (passed/failed/running)
Check summary: Job counts (how many passed/failed/broken)
Check annotations: If present, they often contain key information
Drill into failures: Only examine failed jobs (not broken)
Read logs: Get actual error messages and stack traces

Don't immediately jump to logs - the build state and annotations often tell you what you need to know.

Project-Specific Patterns

Large Projects / Monorepos

Use annotations heavily: Test failures are usually summarized in annotations
Many "broken" jobs: Normal due to conditional execution
Complex job graphs: Jobs have dependencies and conditional logic
Check annotations first: They save time vs reading all logs

Small Projects

No annotations: All information is in job logs
Simpler job structure: Fewer dependencies and conditions
"Broken" is unusual: May indicate an actual problem
Read logs directly: No annotations to summarize failures

Anti-Patterns: What NOT to Do

❌ Falling Back to GitHub Tools

Don't: Use gh pr view, gh pr checks, or GitHub API to check Buildkite status

Why: GitHub shows Buildkite check summary only. You lose:

Real-time build logs and output
Annotations with test failure details
Job-level breakdown and states
Ability to distinguish "broken" (skipped) from "failed"
Direct build monitoring and waiting
Proper state information

Reality: Always use Buildkite tools. GitHub summarizes; Buildkite is the source of truth.

❌ Abandoning Skill on Tool Failure

Don't: "The script failed, so I'll use GitHub tools instead"

Why: The skill documents MULTIPLE tool tiers:

MCP tools (primary, always available)
bktide CLI (secondary, convenience)
Scripts (tertiary, helpers)

Reality: One tool failing doesn't invalidate the skill. Follow the fallback hierarchy - move to MCP tools, don't abandon Buildkite entirely.

❌ Emergency Override Rationalization

Don't: "This is urgent, I don't have time to follow the skill"

Why: Skills exist ESPECIALLY for high-pressure situations. Disciplined workflows prevent mistakes when you're rushed. Making wrong tool choices under pressure wastes MORE time debugging.

Reality: Following the skill is FASTER than recovering from wrong decisions. Taking 2 minutes to use the right tool saves 20 minutes of confusion.

❌ "I Already Know X" Rationalization

Don't: "I already know gh pr view works, why learn Buildkite tools?"

Why: Familiarity ≠ effectiveness. You'll spend more time working around GitHub's limitations than learning the proper tools.

Reality: Invest 2 minutes learning Buildkite MCP tools once. Save hours across all future builds.

Red Flags - STOP

If you catch yourself thinking ANY of these thoughts, you're about to violate this skill:

"The script failed, so the skill doesn't apply"
"This is an emergency, no time for the skill"
"I already know gh pr view works"
"GitHub tools show the same information"
"I'll just check GitHub quickly"
"One tool failed, so I'll use what I know"
"The skill is for normal situations, not emergencies"
"I don't have time to learn new tools right now"

These are rationalizations. Stop. Follow the tool hierarchy. Use Buildkite MCP tools.

Common Mistakes to Avoid

Treating "broken" as "failed": Broken usually means skipped, not failed
Ignoring annotations: They often contain the most actionable information
Not filtering by state: Use job_state: "failed" to focus on actual failures
Missing blocked builds: A blocked build won't progress without manual intervention
Polling in foreground: Use MCP wait_for_build tool or background scripts

Tips for Efficient Status Checking

Use detail levels: Start with detail_level: "summary" to reduce data
Filter by job state: Request only failed jobs when investigating
Background monitoring: Run wait-for-build.js in background after pushing
Check annotations first: For projects that use them, they're faster than logs
Trust the scripts: The bundled scripts handle polling, timeouts, and edge cases

Resources

References

buildkite-states.md - Complete guide to Buildkite states, including the misleading "broken" state and project-specific patterns
annotation-patterns.md - How different projects use annotations and when to check them
tool-capabilities.md - Comprehensive capability matrix for MCP tools, bktide, and scripts
url-parsing.md - Understanding Buildkite URLs, step IDs vs job UUIDs
troubleshooting.md - Common errors, solutions, and decision tree for when stuck
buildkite-environment-variables.md - Standard Buildkite environment variables for local reproduction

Scripts

wait-for-build.js - Background monitoring with timeout and polling
find-commit-builds.js - Find builds for a specific commit
get-build-logs.js - Helper for log retrieval with UUID resolution (placeholder)
parse-buildkite-url.js - Extract components from Buildkite URLs

Run scripts with --help for usage information.

Working with Buildkite Builds

Overview

Why bktide snapshot?

When to Use This Skill

Use this skill when:

Checking CI status for the current branch or PR
Investigating why a build failed
Monitoring builds after a git push
Waiting for builds to complete
Checking build status across multiple repos/PRs
Understanding what "broken" or other Buildkite states mean

Tool Hierarchy and Selection

Use tools in this priority order:

Primary: bktide snapshot (Use This First)

Command: npx bktide@latest snapshot <buildkite-url>

The snapshot command is the preferred approach for investigating builds. It:

Parses any Buildkite URL automatically (build URL, step URL, etc.)
Downloads build metadata, annotations, and logs for failed steps
Saves everything to structured files for easy analysis
Provides actionable next-step commands

npx bktide@latest snapshot https://buildkite.com/org/pipeline/builds/123

Run bktide snapshot --help for all options, or bktide prime for detailed LLM-friendly usage guidance.

Output structure:

./tmp/bktide/snapshots/<org>/<pipeline>/<build>/
├── manifest.json      # Step index with states and exit codes
├── build.json         # Full build metadata
├── annotations.json   # Build annotations
└── steps/
    ├── 01-step-name/
    │   ├── log.txt    # Full log output
    │   └── step.json  # Step metadata
    └── 02-another-step/
        └── ...

Useful follow-up commands (shown by snapshot):

# List failures
jq -r '.steps[] | select(.state == "failed") | "\(.id): \(.label)"' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json

# View a log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt

# Search for errors across all logs
grep -r "Error\|Failed\|Exception" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/

Secondary: Other bktide Commands

For quick queries without full snapshot:

npx bktide@latest pipelines <org>                    # List pipelines
npx bktide@latest builds <org>/<pipeline>            # List builds
npx bktide@latest build <org>/<pipeline>#<build>     # Get build details
npx bktide@latest annotations <org>/<pipeline>#<build>  # Show annotations

Tertiary: MCP Tools (Fallback)

When: bktide unavailable, or need programmatic access (wait_for_build, unblock)

Available MCP tools:

buildkite:get_build - Get detailed build information
buildkite:list_builds - List builds for a pipeline
buildkite:list_annotations - Get annotations for a build
buildkite:get_pipeline - Get pipeline configuration
buildkite:list_pipelines - List all pipelines in an org
buildkite:wait_for_build - Wait for a build to complete (useful for monitoring)
buildkite:get_logs - Retrieve job logs
buildkite:get_logs_info - Get log metadata
buildkite:list_artifacts - List build artifacts

Tool Capability Matrix

This tool preference order can be overridden via ~/.config/pickled-claude-plugins/buildkite.yml. A PreToolUse hook enforces your preference by intercepting bk CLI commands that overlap with bktide capabilities.

When Tools Fail: Fallback Hierarchy

If bktide fails:

✅ Use equivalent MCP tool
❌ Do NOT fall back to GitHub tools

If MCP tools fail:

✅ Check MCP server connection status
✅ Restart MCP connection
✅ Report the MCP failure to your human partner
❌ Do NOT fall back to GitHub tools

Critical: One tool failing does NOT mean the entire skill is invalid. Move to fallback tools, don't abandon Buildkite tools.

Core Workflows

1. Investigating a Build from URL (Most Common)

When a user provides a Buildkite URL for a failing build, use bktide snapshot to gather all context.

Step 1: Capture the build snapshot

npx bktide@latest snapshot <buildkite-url>

This works with any Buildkite URL format:

Build URL: https://buildkite.com/org/pipeline/builds/12345
Step URL: https://buildkite.com/org/pipeline/builds/12345/steps/canvas?sid=019a5f...

The snapshot command will:

Parse the URL automatically
Download build metadata, annotations, and logs
Save everything to ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/
Show a summary and helpful next-step commands

Step 2: Review the summary

The command output shows:

Build state (passed/failed/running)
Step counts (how many passed/failed/broken)
Snapshot location

Step 3: Filter to root failures

Most builds have 1-3 root failures and dozens-to-hundreds of dependent BROKEN steps. Counting BROKEN/RUNNING steps as failures sends you down rabbit holes investigating noise.

First, see the state distribution:

jq -r '.steps[].state' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json | sort | uniq -c | sort -rn

Only these are real failures:

Steps with state FINISHED (or FAILED) AND non-zero exit_status

These are NOT failures — do not count or investigate them:

BROKEN — Downstream dependencies that never ran. Auto-resolve when the root failure is fixed.
RUNNING — Still in progress.
WAITING / SCHEDULED — Haven't started yet.
CANCELED — Manually or automatically canceled.

Find actual root failures:

jq -r '.steps[] | select((.state == "FINISHED" or .state == "FAILED") and .exit_status != null and .exit_status != 0) | "\(.id): \(.label) (exit \(.exit_status))"' ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/manifest.json

Step 4: Read the failing step's log

# View a specific log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt

# Search for errors across all logs
grep -r "Error\|Failed\|Exception" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/

Step 5: Analyze error output

Look for:

Stack traces
Test failure messages
Exit codes and error messages
File paths and line numbers

Step 6: Reproduce locally

Follow the "Reproducing Build Failures Locally" workflow below to:

Extract the exact command CI ran (visible in the log)
Translate it to a local equivalent
Triage if local reproduction isn't feasible

2. Retrieving Job Logs

Preferred: Use bktide snapshot (see workflow 1)

The snapshot command automatically downloads logs for failed/broken steps. After running snapshot:

# View a specific step's log
cat ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/<step-id>/log.txt

# Search across all captured logs
grep -r "pattern" ./tmp/bktide/snapshots/<org>/<pipeline>/<build>/steps/

Fallback: MCP tools

If you need logs without running snapshot (e.g., for a specific job, or snapshot isn't available):

Step 1: Get the job UUID

mcp__MCPProxy__call_tool('buildkite:get_build', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  detail_level: 'detailed',
  job_state: 'failed',
});

Find the job by matching the label field. Extract its uuid field.

Step 2: Retrieve logs

mcp__MCPProxy__call_tool('buildkite:get_logs', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  job_id: '<job-uuid>',
});

Common Issues:

"job not found" error: You likely provided a step ID instead of a job UUID. Step IDs come from URLs (sid=019a5f...). Job UUIDs come from get_build API responses.
Empty logs: The job may not have started yet. Check the job's state field first.

3. Checking Current Branch/PR Status

This is the most common workflow when working on a branch:

Step 1: Identify the pipeline and branch

Determine which pipeline(s) run on PRs for this repository. Common patterns:

Repository name matches pipeline slug
Monorepo may have pipeline named after the main repo

Step 2: Find builds for the current branch

Use MCP tools to list recent builds:

mcp__MCPProxy__call_tool('buildkite:list_builds', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  branch: '<branch-name>',
  detail_level: 'summary',
});

Or use bktide:

npx bktide@latest builds --format json <org>/<pipeline>

Step 3: Progressive disclosure of status

Follow this pattern when examining builds:

Overall state - Is it passed, failed, running, blocked, or canceled?
Job summary - How many jobs passed/failed/broken?
Annotations (if present) - Check for test failures, warnings, or errors
Failed job details - Get logs for actually failed jobs (not just "broken")

4. Post-Push Monitoring Workflow

After pushing code, follow this workflow to monitor the CI build:

Step 1: Find builds for the pushed commit

Use the find-commit-builds script:

~/.claude/skills/buildkite-status/scripts/find-commit-builds.js <org> <commit-sha>

Or manually search using MCP tools with commit filter.

Step 2: Monitor the build

Option A (Preferred): Use MCP wait_for_build tool

mcp__MCPProxy__call_tool('buildkite:wait_for_build', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  timeout: 1800,
  poll_interval: 30,
});

This will:

Poll every 30 seconds (configurable with poll_interval)
Report status changes
Complete when build reaches terminal state (passed/failed/canceled)
Timeout after 30 minutes (configurable with timeout)

Option B (Fallback): Use wait-for-build.js script

If you prefer background execution:

~/.claude/skills/buildkite-status/scripts/wait-for-build.js <org> <pipeline> <build-number> --timeout 1800 --interval 30

If the script fails (e.g., bktide dependency error), use Option A - the MCP tool is more reliable.

Step 3: Check on progress

Periodically check the background job or wait for it to complete. When it finishes, check the exit code:

0 = passed
1 = failed
2 = canceled
3 = timeout

Step 4: Investigate failures

If the build failed, follow the "### 1. Investigating a Build from URL" workflow above.

5. Investigating Failures (Deprecated)

Note: This workflow is deprecated. Use "### 1. Investigating a Build from URL" and "### 2. Retrieving Job Logs" instead for a more complete investigation process.

When a build has failed, use this systematic approach:

Step 1: Get build overview

mcp__MCPProxy__call_tool('buildkite:get_build', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  detail_level: 'detailed',
  job_state: 'failed', // Only show failed jobs
});

This gives you:

Overall build state
Job summary (how many failed vs broken)
List of failed jobs only

Step 2: Check annotations

Some projects put test failures in annotations:

mcp__MCPProxy__call_tool('buildkite:list_annotations', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
});

Look for annotations with style: "error" or style: "warning".

Important: Not all projects use annotations. See references/annotation-patterns.md for project-specific patterns.

Step 3: Examine failed jobs

For each failed job (not "broken" - see state reference below):

Get the job details from the build data
Check the job's log output
Look for stack traces, error messages, or test failures

Step 4: Understand "broken" vs "failed"

Critical: A job showing as "broken" is often NOT a failure. It typically means:

The job was skipped because an earlier job failed
The job's dependencies weren't met
Conditional pipeline logic determined the job wasn't needed

See references/buildkite-states.md for complete state explanations.

Example: In large monorepos, many jobs show "broken" because they were skipped due to file changes not affecting them. This is normal and expected.

6. Checking Blocked Builds

When a build is in blocked state, it's waiting for manual approval:

Step 1: Identify the block step

Get the build with detail_level: "detailed" and look for jobs with state: "blocked".

Step 2: Review what's being blocked

Block steps typically have a label describing what approval is needed (e.g., "Deploy to Production").

Step 3: Unblock if appropriate

Use the MCP tool to unblock:

mcp__MCPProxy__call_tool('buildkite:unblock_job', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
  job_id: '<job-id>',
  fields: {}, // Optional form fields if the block step has inputs
});

7. Reproducing Build Failures Locally

After investigating a failed build (workflows 1-2), use this workflow to reproduce the failure locally for debugging.

Phase 1: Extract

Goal: Discover exactly what CI ran - the command, environment, and context.

Step 1: Get the job logs

Use workflow "### 2. Retrieving Job Logs" to get logs for the failed job.

Step 2: Find the actual command

Look early in the log output for the command execution line. Common patterns:

Docker-compose plugin: :docker: Running /bin/sh -e -c '<command>' in service <service>
Shell trace: Lines starting with + when shell trace is enabled
Direct command steps: Command appears after "Running command" or similar

Example log snippet:

Running plugin docker-compose command hook
:docker: Found a pre-built image for app
:docker: Creating docker-compose override file for prebuilt services
:docker: Pulling services app
:docker: Starting dependencies
:docker: Running /bin/sh -e -c 'bin/rspec spec/models/user_spec.rb' in service app

The actual command here is bin/rspec spec/models/user_spec.rb.

Step 3: Identify environment variables

Check multiple sources in order:

Log output - Some pipelines print env vars at job start
Pipeline config - Check .buildkite/pipeline.yml or buildkite-builder DSL for env: blocks
Buildkite defaults - See references/buildkite-environment-variables.md for standard vars like CI=true, BUILDKITE_BRANCH, etc.
Discover through failure - If local run fails differently, a missing env var may be the cause

Step 4: Note the execution context

Record:

Was it running in Docker? Which service/image?
What working directory?
Any pre-command setup visible in the log?

Phase 2: Translate

Goal: Convert the CI command to something runnable locally.

Step 1: Try the direct command first

Run the extracted command as-is in your local environment:

bin/rspec spec/models/user_spec.rb

This often works for:

Test commands (rspec, jest, pytest, etc.)
Linting/formatting tools
Simple scripts

Step 2: If direct fails, try with docker-compose

When the command ran in a Docker context in CI, replicate that locally:

docker-compose run <service> <command>

Example - if CI showed: :docker: Running /bin/sh -e -c 'bin/rspec spec/models/user_spec.rb' in service app

Try locally:

docker-compose run app bin/rspec spec/models/user_spec.rb

Step 3: Set relevant environment variables

If the command behaves differently, add env vars discovered in Phase 1:

CI=true RAILS_ENV=test bin/rspec spec/models/user_spec.rb

Or with docker-compose:

docker-compose run -e CI=true -e RAILS_ENV=test app bin/rspec spec/models/user_spec.rb

Step 4: Handle common translation patterns

Phase 3: Triage

Goal: When local reproduction isn't feasible, determine the best alternative.

Decision: Can this be reproduced locally?

Local reproduction is likely NOT feasible when:

Command needs infrastructure not available locally (specific databases, internal APIs, secrets)
Environment differences are too significant (specific Linux dependencies, network topology)
Failure is tied to CI-specific state (parallelization across agents, artifact dependencies from earlier steps)

Note: Many Buildkite plugins don't block local reproduction - plugins for artifacts, notifications, or caching are CI orchestration concerns, not execution blockers.

Alternative 1: Trigger a test build with debugging changes

Push a branch with modifications to aid debugging:

Add verbose flags to the failing command (e.g., --verbose, -vvv)
Insert echo statements or print debugging
Add env var dumping: env | sort or printenv
Modify environment variables in pipeline config
Temporarily simplify the command to isolate the issue

Alternative 2: Inspect artifacts

Download artifacts from the failed build:

mcp__MCPProxy__call_tool('buildkite:list_artifacts', {
  org_slug: '<org>',
  pipeline_slug: '<pipeline>',
  build_number: '<build-number>',
});

Look for:

Test output files (JUnit XML, coverage reports)
Log files written during execution
Screenshots or other debug outputs

Alternative 3: Analyze the failure in place

Sometimes reproduction isn't needed - the logs plus artifacts contain enough information to understand and fix the issue without running it locally.

Understanding Buildkite States

Buildkite has several states that can be confusing. Here's a quick reference:

Build States

passed - All jobs completed successfully ✅
failed - One or more jobs failed ❌
running - Build is currently executing 🔄
blocked - Waiting for manual approval 🚫
canceled - Build was canceled ⛔

Job States

passed - Job succeeded ✅
failed - Job failed with non-zero exit ❌
broken - MISLEADING: Usually means skipped due to pipeline logic, NOT a failure ⚠️
soft_failed - Failed but marked as non-blocking 〰️
skipped - Job was skipped ⏭️

For complete state reference and project-specific patterns, read references/buildkite-states.md.

Progressive Disclosure Pattern

Always follow this pattern when checking build status:

Start broad: Overall build state (passed/failed/running)
Check summary: Job counts (how many passed/failed/broken)
Check annotations: If present, they often contain key information
Drill into failures: Only examine failed jobs (not broken)
Read logs: Get actual error messages and stack traces

Don't immediately jump to logs - the build state and annotations often tell you what you need to know.

Project-Specific Patterns

Large Projects / Monorepos

Use annotations heavily: Test failures are usually summarized in annotations
Many "broken" jobs: Normal due to conditional execution
Complex job graphs: Jobs have dependencies and conditional logic
Check annotations first: They save time vs reading all logs

Small Projects

No annotations: All information is in job logs
Simpler job structure: Fewer dependencies and conditions
"Broken" is unusual: May indicate an actual problem
Read logs directly: No annotations to summarize failures

Anti-Patterns: What NOT to Do

❌ Falling Back to GitHub Tools

Don't: Use gh pr view, gh pr checks, or GitHub API to check Buildkite status

Why: GitHub shows Buildkite check summary only. You lose:

Real-time build logs and output
Annotations with test failure details
Job-level breakdown and states
Ability to distinguish "broken" (skipped) from "failed"
Direct build monitoring and waiting
Proper state information

Reality: Always use Buildkite tools. GitHub summarizes; Buildkite is the source of truth.

❌ Abandoning Skill on Tool Failure

Don't: "The script failed, so I'll use GitHub tools instead"

Why: The skill documents MULTIPLE tool tiers:

MCP tools (primary, always available)
bktide CLI (secondary, convenience)
Scripts (tertiary, helpers)

Reality: One tool failing doesn't invalidate the skill. Follow the fallback hierarchy - move to MCP tools, don't abandon Buildkite entirely.

❌ Emergency Override Rationalization

Don't: "This is urgent, I don't have time to follow the skill"

Why: Skills exist ESPECIALLY for high-pressure situations. Disciplined workflows prevent mistakes when you're rushed. Making wrong tool choices under pressure wastes MORE time debugging.

Reality: Following the skill is FASTER than recovering from wrong decisions. Taking 2 minutes to use the right tool saves 20 minutes of confusion.

❌ "I Already Know X" Rationalization

Don't: "I already know gh pr view works, why learn Buildkite tools?"

Why: Familiarity ≠ effectiveness. You'll spend more time working around GitHub's limitations than learning the proper tools.

Reality: Invest 2 minutes learning Buildkite MCP tools once. Save hours across all future builds.

Red Flags - STOP

If you catch yourself thinking ANY of these thoughts, you're about to violate this skill:

"The script failed, so the skill doesn't apply"
"This is an emergency, no time for the skill"
"I already know gh pr view works"
"GitHub tools show the same information"
"I'll just check GitHub quickly"
"One tool failed, so I'll use what I know"
"The skill is for normal situations, not emergencies"
"I don't have time to learn new tools right now"

These are rationalizations. Stop. Follow the tool hierarchy. Use Buildkite MCP tools.

Common Mistakes to Avoid

Treating "broken" as "failed": Broken usually means skipped, not failed
Ignoring annotations: They often contain the most actionable information
Not filtering by state: Use job_state: "failed" to focus on actual failures
Missing blocked builds: A blocked build won't progress without manual intervention
Polling in foreground: Use MCP wait_for_build tool or background scripts

Tips for Efficient Status Checking

Use detail levels: Start with detail_level: "summary" to reduce data
Filter by job state: Request only failed jobs when investigating
Background monitoring: Run wait-for-build.js in background after pushing
Check annotations first: For projects that use them, they're faster than logs
Trust the scripts: The bundled scripts handle polling, timeouts, and edge cases

Resources

References

buildkite-states.md - Complete guide to Buildkite states, including the misleading "broken" state and project-specific patterns
annotation-patterns.md - How different projects use annotations and when to check them
tool-capabilities.md - Comprehensive capability matrix for MCP tools, bktide, and scripts
url-parsing.md - Understanding Buildkite URLs, step IDs vs job UUIDs
troubleshooting.md - Common errors, solutions, and decision tree for when stuck
buildkite-environment-variables.md - Standard Buildkite environment variables for local reproduction

Scripts

wait-for-build.js - Background monitoring with timeout and polling
find-commit-builds.js - Find builds for a specific commit
get-build-logs.js - Helper for log retrieval with UUID resolution (placeholder)
parse-buildkite-url.js - Extract components from Buildkite URLs

Run scripts with --help for usage information.

Adoption

technicalpickles/plugins/buildkite/skills/investigating-builds

$ install --global

Security Scan Results

SKILL.md

Working with Buildkite Builds

Overview

Why bktide snapshot?

When to Use This Skill

Tool Hierarchy and Selection

Primary: bktide snapshot (Use This First)

Secondary: Other bktide Commands

Tertiary: MCP Tools (Fallback)

Tool Capability Matrix

When Tools Fail: Fallback Hierarchy

Core Workflows

1. Investigating a Build from URL (Most Common)

2. Retrieving Job Logs

3. Checking Current Branch/PR Status

4. Post-Push Monitoring Workflow

5. Investigating Failures (Deprecated)

6. Checking Blocked Builds

7. Reproducing Build Failures Locally

Phase 1: Extract

Phase 2: Translate

Phase 3: Triage

Understanding Buildkite States

Build States

Job States

Progressive Disclosure Pattern

Project-Specific Patterns

Large Projects / Monorepos

Small Projects

Anti-Patterns: What NOT to Do

❌ Falling Back to GitHub Tools

❌ Abandoning Skill on Tool Failure

❌ Emergency Override Rationalization

❌ "I Already Know X" Rationalization

Red Flags - STOP

Common Mistakes to Avoid

Tips for Efficient Status Checking

Resources

References

Scripts

Related Skills

technicalpickles/plugins/writing-tools/skills/writing-for-scannability

technicalpickles/actually-lsp-ignore

technicalpickles/actually-lsp-doctor

technicalpickles/plugins/github-actions/skills/investigating-runs

technicalpickles/plugins/buildkite/skills/investigating-builds

$ install --global

Security Scan Results

SKILL.md

Working with Buildkite Builds

Overview

Why bktide snapshot?

When to Use This Skill

Tool Hierarchy and Selection

Primary: bktide snapshot (Use This First)

Secondary: Other bktide Commands

Tertiary: MCP Tools (Fallback)

Tool Capability Matrix

When Tools Fail: Fallback Hierarchy

Core Workflows

1. Investigating a Build from URL (Most Common)

2. Retrieving Job Logs

3. Checking Current Branch/PR Status

4. Post-Push Monitoring Workflow

5. Investigating Failures (Deprecated)

6. Checking Blocked Builds

7. Reproducing Build Failures Locally

Phase 1: Extract

Phase 2: Translate

Phase 3: Triage

Understanding Buildkite States

Build States

Job States

Progressive Disclosure Pattern

Project-Specific Patterns

Large Projects / Monorepos