skills/cluster-blame/SKILL.md
Audit active or recent Slurm queue state to find likely job-shape misconfigurations that strand shared cluster capacity (CPU, memory, GPU) and block scheduling for others. Use when users ask why resources appear idle, who may be blocking allocation, which jobs/users look misconfigured, or when preparing evidence for neutral outreach. Keep the workflow strictly read-only: inspect and report only, never cancel, edit, reprioritize, or otherwise mutate jobs or cluster state.
npx skillsauth add olliecrow/codex cluster-blameInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Inspect Slurm scheduler state and node packing to identify probable resource-stranding job submissions while avoiding false accusations.
Treat output as evidence-backed candidate attribution, not certainty: label findings by impact and confidence, separate policy effects from user-level misfit, and produce neutral follow-up language.
cluster-monitor: default to quick-status first for operational questions, return concrete units and timestamps, make scope/identity explicit, and treat legitimate queue waiting (Priority, Resources, dependencies) as normal scheduler behavior until fit/fragmentation evidence shows avoidable stranding.investigate: build explicit hypotheses, try to falsify attribution, and report coverage gaps and uncertainty instead of overstating certainty.summarize: lead with high-impact findings and clearly separate facts, inferences, and unknowns.rg -> find/grep, python -> python3, alternate repo-native scripts).plan/, ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.The skill is complete only when all of the following are true:
blocked with concrete blocker evidence.done, blocked, or not-applicable, with brief evidence or rationale.Stop only after this terminal contract is satisfied; otherwise continue iterating.
done: requested outcome is delivered and required checks are completed (for example expected artifact/report produced and required validation command(s) passed).blocked: progress cannot continue after bounded retries because of a concrete dependency or access issue; blocker evidence and exact unblock action are reported.not-applicable: an optional step is explicitly skipped with reason (for example no remote configured, so push step is marked not-applicable).When the user asks for a fast status answer (for example who is blocking compute right now), run quick-scan mode first.
Quick-scan mode requirements:
squeue, sinfo, scontrol show node, scontrol show job -d, sprio).CPU used/total, GPU used/total, mem used/total).likely blocker, possible blocker, and policy-driven explicitly.Use quick-scan mode for prompts like:
who is blocking resources right nowwhy are gpus idlewhich users are stranding capacitycheck current cluster blockersUse deep-attribution mode for prompts like:
deeply research why resources are not fully allocatedseparate policy effects from misconfigurationrank top blockers with confidencedraft outreach messagescwd, verify required tools with command -v, and verify referenced files/directories exist before reading or searching them.scancel, scontrol update, scontrol hold, scontrol release, srun, sbatch, salloc, or any command that mutates scheduler or job state.candidate, possible, likely) and include confidence levels.Determine and record:
cluster_user: analyst identity for command scope.cluster_host: cluster endpoint used for evidence.partition_scope: analyzed partitions (for example training, gpu).analysis_window: live snapshot time and any historical range.Unless explicitly asked otherwise, analyze all users in scope because the goal is shared-capacity attribution, not only self-jobs.
quick-scan mode for fast answers (for example who is blocking resources right now).deep-attribution mode for root-cause separation, confidence scoring, and outreach-ready evidence.quick-scan mode is selected, run only quick-scan workflow and report; do not run deeper steps unless asked.squeue),sinfo + scontrol show node),ReqTRES/AllocTRES (scontrol show job -d),sprio, sacctmgr show qos) only if attribution depends on it.pwd, identity, and connectivity to Slurm host.ssh + squeue/sacct/scontrol.ssh, squeue/sacct via remote if local tools absent, rg, python3).squeue for running/pending jobs with user, partition, reason, and allocated node list.sinfo and scontrol show node for node states and free/allocated CPU, memory, GPU.scontrol show job -d for suspected jobs to get ReqTRES, AllocTRES, and constraints.sprio and sacctmgr show qos when priority policy may explain waiting.sacct records for trend confirmation.Flag only with evidence. Common high-signal patterns:
gpu=0 jobs on GPU nodes requesting near-full node memory and stranding multiple GPUs.For each candidate, capture:
ReqTRES and AllocTRES,7 GPUs idle on node X while mem free is 0).Before attributing to user misconfiguration, test policy explanations:
DefMemPerGPU, DefCpuPerGPU) that inflate allocations.afterany, arrays) and reservation constraints.Apply explicit falsification checks:
policy-likely unless job-specific evidence contradicts.mixed and avoid single-user blame language.Classify each finding as one of:
submission-likely: likely user-level request misfit.policy-likely: mostly scheduler/QoS/defaults behavior.mixed: both policy and submission shape contribute.Use an explicit score:
impact: amount of stranded capacity and expected queue delay contribution.confidence: strength of attribution after policy checks.Confidence rubric:
high: direct node/job evidence plus failed falsification checks.medium: strong indicators but at least one unresolved policy confounder.low: plausible but not well-separated from policy or transient effects.Prioritize only high-impact findings. Avoid naming low-confidence users as primary blockers.
Keep the report short by default.
In most cases include only:
Add the source ledger, full policy-vs-submission table, or the evidence runbook only when the user asks or when they are needed to support the answer.
Source ledger entry requirements:
where_found: exact system/file/output location.link: workspace path or command context.confidence: high, medium, or low with brief reason.relevance: why this source supports the attribution.Quick-scan report (short form):
ReqTRES, node free/used, pending reason).[$cluster-blame] quick-scan: identify likely users/jobs currently stranding CPU/GPU/memory, with evidence and confidence.[$cluster-blame] deep-attribution: explain why resources are idle despite pending jobs, separate policy effects from likely submission misconfiguration, and rank top blockers.[$cluster-blame] deep-attribution: generate a neutral outreach draft for the top 3 high-confidence blocking candidates.When a decision is required, always provide:
When you establish an important attribution rule, threshold, or classification convention, capture the rationale in a durable place (docs, runbooks, or tests for parser/analysis logic). Do not rely only on plan/ scratch notes.
no material change and keep prior ranking with updated timestamp.data-ai
Use automatically for Sentinel repo sessions, trading research questions, market/company/ticker/source questions, or any request that should use Sentinel's read-only data sources and reference context. Enforces Sentinel's high-confidence, read-only, no-local-query-trace research posture.
documentation
Compact the current conversation into a handoff document for another agent to pick up.
tools
Run the Codex custom review feature from the CLI for arbitrary review instructions. Use when the user asks to use /review, custom review, or Codex review without tying the review to commits, uncommitted changes, or a base branch; prefer multicodex exec review with no explicit account and fall back to codex exec review only when multicodex is unavailable.
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).