skills/worker-image-investigation/SKILL.md
Investigate Taskcluster task failures caused by worker images — extract image versions, compare passing vs failing tasks, find pass/fail cliffs across branches, debug Azure VMs via `az` CLI, and spin up throwaway debug VMs from the same image a pool uses. Use when a CI failure looks image-related. DO NOT USE FOR triggering a new image build (use worker-image-build).
npx skillsauth add jwmossmoz/agent-skills worker-image-investigationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Investigate Taskcluster task failures by comparing worker images, extracting SBOM info, and debugging Azure VMs.
Run via the installed skill path:
WII=~/.claude/skills/worker-image-investigation/scripts/investigate.py
taskcluster CLI: brew install taskclusteraz CLI (for VM debugging): brew install azure-cli && az loginuv for running scripts# Investigate a failing task - get worker pool, image version, status
uv run "$WII" investigate <TASK_ID>
uv run "$WII" investigate https://firefox-ci-tc.services.mozilla.com/tasks/<TASK_ID>
# Compare two tasks (e.g., passing vs failing on same revision)
uv run "$WII" compare <PASSING_TASK_ID> <FAILING_TASK_ID>
# List running workers in a pool (for Azure VM access)
uv run "$WII" workers gecko-t/win11-64-24h2
# Get SBOM/image info for a worker pool
uv run "$WII" sbom gecko-t/win11-64-24h2
# Get Windows build and GenericWorker version from Azure VM
uv run "$WII" vm-info <VM_NAME> <RESOURCE_GROUP>
# Batch compare all failed tasks in a task group
uv run "$WII" batch-compare U0vOaaW-T-i5nN79edugYA
# Filter to specific alpha pool
uv run "$WII" batch-compare U0vOaaW-T-i5nN79edugYA \
--alpha-pool gecko-t/win11-64-24h2-alpha
# Find tasks that likely failed due to image changes
uv run "$WII" find-image-regressions U0vOaaW-T-i5nN79edugYA
# Generate sheriff-friendly markdown report
uv run "$WII" sheriff-report <TASK_ID>
When investigating image upgrades, analyze all failures at once.
The find-image-regressions command identifies failures that are likely caused by image changes:
| Signal | Meaning |
|--------|---------|
| Task on alpha pool | Using new/staging image |
| Different image versions | Alpha has different version than production |
| likelyImageRegression: true | High confidence this is image-related |
{
"taskGroupId": "U0vOaaW-T-i5nN79edugYA",
"likelyImageRegressions": 3,
"regressions": [
{
"taskId": "Xcac5C8gRqiOT13YsVRX8A",
"taskLabel": "mochitest-chrome-1proc",
"alphaPool": "gecko-t/win11-64-24h2-alpha",
"alphaImageVersion": "1.0.9",
"productionPool": "gecko-t/win11-64-24h2",
"productionImageVersion": "1.0.8",
"versionDiffers": true,
"likelyImageRegression": true
}
]
}
Generate a markdown summary suitable for sharing with sheriffs:
uv run investigate.py sheriff-report Xcac5C8gRqiOT13YsVRX8A
## Sheriff Triage Summary
**Task**: `Xcac5C8gRqiOT13YsVRX8A`
**Test**: `mochitest-chrome-1proc`
**Status**: failed
### Worker Pool Comparison
| Property | Value |
|----------|-------|
| **Failing Pool** | `gecko-t/win11-64-24h2-alpha` |
| **Failing Image Version** | 1.0.9 |
| **Production Pool** | `gecko-t/win11-64-24h2` |
| **Production Image Version** | 1.0.8 |
| **Version Differs** | Yes |
### Verdict: **IMAGE REGRESSION**
Image version differs between alpha and production pools
| Verdict | Meaning | |---------|---------| | IMAGE REGRESSION | Alpha has different image version than production | | NEEDS INVESTIGATION | Same image version - could be code or intermittent | | PRODUCTION FAILURE | Failure on production pool - likely code issue |
# Get task info including worker pool and image version
uv run "$WII" investigate <FAILING_TASK_ID>
Output includes: taskId, taskLabel, workerPool, workerId, imageVersion, status.
treeherder-cli --similar-history takes a Treeherder job ID, not a Taskcluster task ID.
Resolve it from the task ID:
REPO=autoland # or try, mozilla-central, etc.
curl -s "https://treeherder.mozilla.org/api/project/${REPO}/jobs/?task_id=<TASK_ID>&count=5" \
| jq '.results[0] | {id, job_type_name, result, platform}'
Save the id field as JOB_ID.
Run --similar-history with a large count on the branch where failures are observed.
A sudden flip from passing to all-failing indicates a regression — possibly an image update.
treeherder-cli --similar-history <JOB_ID> --similar-count 200 --repo autoland --json \
| jq '[.[] | {result, job_type_name, push_timestamp}]'
Look for a timestamp where results flip from success to testfailed/busted.
Correlate that date against when the image was rolled out.
If autoland shows all failures, check mozilla-central (which may be on an older image) to confirm whether the job type still passes elsewhere:
job_type="<job_type_name from step 2>"
enc=$(jq -nr --arg v "$job_type" '$v|@uri')
for repo in autoland mozilla-central mozilla-beta; do
curl -s "https://treeherder.mozilla.org/api/project/${repo}/jobs/?job_type_name=${enc}&result=success&count=50" \
| jq -r --arg repo "$repo" '
if (.results|length)==0 then "\($repo)\tNO_PASSING_RUNS"
else (.results|last) as $r | "\($repo)\t\($r.last_modified)\t\($r.id)"
end'
done
If mozilla-central shows recent passes but autoland does not, the failure is branch/image-specific.
Grab a passing task ID from step 4 (e.g., from a mozilla-central run) and compare:
uv run "$WII" compare <PASSING_TASK_ID> <FAILING_TASK_ID>
Look for differences in imageVersion (e.g., 1.0.8 vs 1.0.9).
A version bump that aligns with the pass/fail cliff confirms the image caused the regression.
# Find running workers
uv run "$WII" workers gecko-t/win11-64-24h2
# Get VM details - extract VM name from workerId (e.g., vm-xyz...)
uv run "$WII" vm-info vm-xyz RG-TASKCLUSTER-WORKER-MANAGER-PRODUCTION
When a failure looks like it happened before the task ever claimed (no worker, claim
timeout, mysterious termination), check what Taskcluster's worker-manager and worker-scanner
saw with tc-logview:
# Lifecycle for the pool over a window
tc-logview query -e fx-ci --type worker-removed \
--where 'workerPoolId="gecko-t/win11-64-24h2"' --since 24h --json
# Trace one VM end-to-end across worker-manager events
tc-logview query -e fx-ci --service worker-manager --since 2h \
--filter '"vm-abc123"' --raw
# Hunt Azure-side errors (preemption, quota, capacity)
tc-logview query -e fx-ci --service worker-manager --since 2h \
--filter '"OperationPreempted"' --json
Install: go install github.com/taskcluster/tc-logview@latest. See the /taskcluster skill's
references/tc-logview.md for the full guide.
For deeper investigation, use Azure CLI directly:
# Get Windows build number
az vm run-command invoke --resource-group RG-TASKCLUSTER-WORKER-MANAGER-PRODUCTION \
--name <VM_NAME> --command-id RunPowerShellScript \
--scripts "(Get-ItemProperty 'HKLM:\\SOFTWARE\\Microsoft\\Windows NT\\CurrentVersion').CurrentBuild"
# Get GenericWorker version
az vm run-command invoke --resource-group RG-TASKCLUSTER-WORKER-MANAGER-PRODUCTION \
--name <VM_NAME> --command-id RunPowerShellScript \
--scripts "Get-Content C:\\generic-worker\\generic-worker-info.json"
# Get recent Windows updates
az vm run-command invoke --resource-group RG-TASKCLUSTER-WORKER-MANAGER-PRODUCTION \
--name <VM_NAME> --command-id RunPowerShellScript \
--scripts "Get-HotFix | Sort-Object InstalledOn -Descending | Select-Object -First 10"
# Check for file system filters (AppLocker, etc.)
az vm run-command invoke --resource-group RG-TASKCLUSTER-WORKER-MANAGER-PRODUCTION \
--name <VM_NAME> --command-id RunPowerShellScript \
--scripts "fltMC"
Spin up a throwaway Azure VM from the same image a worker pool uses.
dbg-1a2b3c)./taskcluster skill to look up the worker pool configuration and extract the vmSize from launchConfigs. Never hardcode or assume a VM size.Password1!./taskcluster skill to get the worker pool config for the target pool (e.g., gecko-t/win11-64-24h2). Extract the vmSize and image reference from config.launchConfigs.az vm create \
--resource-group jmoss-win11 \
--name "dbg-abc123" \
--image "<image-id-from-step-1>" \
--size "<vmSize-from-step-1>" \
--admin-username azureuser \
--admin-password "Password1!" \
--location eastus
az vm run-command invoke \
--resource-group jmoss-win11 \
--name "dbg-abc123" \
--command-id RunPowerShellScript \
--scripts "<POWERSHELL_COMMAND>"
az vm delete --resource-group jmoss-win11 \
--name "dbg-abc123" --yes
See references/azure-commands.md for the full command reference.
RG-TASKCLUSTER-WORKER-MANAGER-PRODUCTIONRG-TASKCLUSTER-WORKER-MANAGER-STAGING| Pool | Description |
|------|-------------|
| gecko-t/win11-64-24h2 | Windows 11 24H2 64-bit production |
| gecko-t/win11-64-24h2-alpha | Windows 11 24H2 64-bit alpha (os-integration) |
| gecko-t/win11-32-24h2 | Windows 11 24H2 32-bit |
RG-TASKCLUSTER-WORKER-MANAGER-PRODUCTION contains all taskcluster azure windows 10 windows 11 windows server machines.For CI tasks on fxci-config PRs:
uv run "$WII" --root-url https://stage.taskcluster.nonprod.cloudops.mozgcp.net \
investigate <TASK_ID>
All commands return JSON for easy parsing with jq:
uv run "$WII" investigate <TASK_ID> | jq '.imageVersion'
uv run "$WII" workers gecko-t/win11-64-24h2 | jq '.workers[0].workerId'
Some Windows worker SBOM markdown artifacts are UTF-16LE encoded. If text looks garbled, decode before parsing:
curl -sL <SBOM_URL> | iconv -f UTF-16LE -t UTF-8
tc-logviewdbg-1a2b3c.vmSize for debug VMs. Pull it from the worker pool config (launchConfigs via the taskcluster skill) so you match what real workers use.iconv -f UTF-16LE -t UTF-8.treeherder-cli --similar-history takes a Treeherder job ID (numeric), not a Taskcluster task ID. Resolve via /api/project/{repo}/jobs/?task_id=... before passing it in.vm-info are useless — go straight to tc-logview for the worker-manager view.Password1! for throwaway debug VMs. Don't try to pass complex passwords on the az vm create command line; quoting is unforgiving.fxci-config/worker-images/development
Download Azure Cost Management exports and query local Parquet/CSV in DuckDB. Use when refreshing local Azure cost caches or writing DuckDB SQL over exports. DO NOT USE FOR live Cost Management API diagnosis; use azure-cost-analysis.
data-ai
Use when creating performance self-reviews from local notes, prior reviews, review prompts, and verified evidence. Helps draft H1/H2, annual, and promotion self evaluations, example answers, and rich review-form paste output. Do not use for routine status or 1:1 summaries; use one-on-one.
tools
Prepare one-on-one/status bullets from ~/moz_artifacts using qmd and copy a topic-organized HTML/RTF list with embedded links to the macOS clipboard. Use when summarizing recent Mozilla work for a manager, 1:1, or status update. DO NOT USE FOR generating raw daily logs; use daily-log.
development
Use when tracing Taskcluster Azure VM startup from worker-manager request through in-VM boot scripts to generic-worker `workerReady` with tc-logview, paperctl, Splunk Web, and Yardstick Prometheus. Applies to Windows worker provisioning latency. DO NOT USE FOR task failure triage (use worker-image-investigation).