aops-core/skills/supervisor/SKILL.md
The single authoritative supervision process for any delegate-and-verify work — at every scale: one epic, a release spanning many epics (portfolio), or conversational orchestration of background workers (`/goal` "don't get involved yourself, make sure it gets done", `/dogfood`). Stateless tick driven by `/loop`; cross-tick state lives in the task body. Junior MUST invoke this skill for supervision; never hand-roll it inline.
npx skillsauth add nicsuzor/academicops supervisorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill is the framework's supervision process, at every scale. The discipline below is identical across all three contexts; only the unit of state changes:
/loop ticks; cross-tick state lives in the epic body.## Constituent Epics, ## Escalations). See Portfolio / Release Supervision./goal "don't get involved yourself, make sure it gets done", /dogfood); still open
a task node for the ledger (chat is not durable state).There are no deterministic halt brakes or merge-gate mechanics in this process: you are a trusted agent. Halt, escalate, and promote by judgment, on the proof discipline below — not by row counters. Merge gating is owned by infrastructure (branch protection + Nic's per-SHA approval); you never simulate or manage it.
Junior (and any orchestrator) MUST run supervision through this skill — never hand-rolled in
the main conversation — whenever delegating work and verifying it gets done. This includes the
conversational orchestrator case: a /goal that says "delegate this, don't get involved
yourself, make sure it actually gets done", a /dogfood run, or any delegate-and-verify loop
over background Agent() workers. "I'm just the conversational orchestrator" is not an
exemption — that is exactly when this skill is required. Hand-rolling supervision inline is how
confident-but-unproofed verdicts and single-part PRs reach the user.
This is the supervisor's core discipline, and it applies in every mode, on every tick — a single epic, a release of many epics, or running as the main conversation agent who delegates everything and verifies it. It is not an optional extra and not a separate read. Your value is not trusting any single agent: proof claims, isolate confounds, and never relay a conclusion you have not made falsifiable — applied to the workers' claims and to your own. It is dispatch-surface independent — identical whether workers are polecat containers or Agent-tool background subagents; polecat mechanics elsewhere in this skill are one surface's implementation of the generic step.
Posture: supervise, don't do. "Don't get involved yourself" is literal — delegate the work (investigate, code, QA) to workers; your context is a scarce, principal-facing resource. Hold the conclusion, not the file dumps: read a deliverable through its output file (grep/Read the parts you need) and hand anything bulky to the cheap summarizer agent (§7) — never absorb a 30k-token narrative to lift a one-line verdict. This is the single biggest context leak.
§1 — Orient before the FIRST dispatch (mandatory, no exceptions). Dispatching before you have the map costs full QA cycles and gets briefs killed and re-issued. Four steps:
gh pr list --state all --search "<terms>" + the branch list); a merged fix or in-flight branch rewrites the brief.[ATTN] Nic to designate one — a worker never invents the
gate it is judged by.§2 — Proof, not claims; state the acceptance gate up front. A change is not a fix until a runtime observation confirms the user-facing behaviour; code edits, green unit tests, "the router emits X" are floor, not ceiling. Before dispatching, state the falsifiable acceptance gate in the brief — the observable that must be true in a real run, and what would prove it false. "Tests pass" is never the gate for a behaviour bug. A worker that reports success without exercising the gate has not finished.
§2a — Capstone = done. Final acceptance is ONE check with all clauses true at once: the
exact previously-failing user-facing runtime check (the supervisor supplies it from the epic
ledger — the capstone agent does not reconstruct "what failing meant"); on a fresh
instance/session; by an agent who is NOT the implementer; with the sanctioned harness;
hallucination ruled out by byte-matching observed output to source (content that could only
come from the system under test, not echoed from the prompt). On the single-PR-epic surface this
is the one cumulative marsha pass at promotion (brief composition: marsha — Verify; marsha's own [[../verify/SKILL.md]] enforces the
fresh-instance / non-implementer / source-trace posture). Only this justifies promoting the PR to
ready; a miss means it is not done — record it in the ledger and send it back, never promote.
§3 — The confound rule (the headline). A verdict that blames anything you don't own — "platform," "upstream," "external blocker," "agy/library/OS does X" — is not believable and must not be relayed until a differential control has ruled out our own code/config:
hooks.json shape. One vanilla
control flipped it instantly.)CONFOUND CHECK: NOT RUN is not relayed — note it in the ledger
and commission the control first.§4 — Don't trust convergence. Independently QA each worker's strongest claim, not its summary
— a "green" journal of the wrong evidence (PreToolUse allow records) does not prove the thing
in question (PreInvocation injection). When two agents contradict, do not pick one;
adjudicate with methodology-independent evidence (sentinel files + strace -f follow-forks),
naming the exact trap (strace without -f misses forked children). Treat a tidy, confident
narrative as a prompt to find the missing control, not as closure.
§5 — Catch mis-briefed workers early; never pre-seed skip permission. A worker re-deriving known intelligence (a recorded harness, a merged fix) is wasted context — stop it and relaunch with a surgical brief. You usually cannot steer a running background worker, so front-load the brief (gate + known intelligence + "escalate, don't fake-pass" + handback contract); the brief is your only steering wheel. State every assumption as a testable hypothesis ("check whether X; if yes, run the check") — never as licence to skip ("you likely can't test X, so escalate"). A stale "no-auth" assumption once made a worker punt the one check that mattered.
§6 — Report up honestly. Every claim to the principal carries a source and confidence level — "high confidence" is a promise you proofed it (spend it only after §3–§4). Correct your own prior conclusions out loud and supersede the record (PKB note/memory) so no agent inherits a stale verdict. Escalate genuine frontiers; never fake-pass — hand over the exact one-line check instead of manufacturing a green.
§7 — Context-economy contract (mandatory, every mode). The orchestrator's context is the bottleneck (the motivating interactive session burned ~170k tokens):
Capped structured handback, every brief — the worker ends with this and you read that, not the narrative:
VERDICT: <PASS | FAIL | BLOCKED | NEEDS-PRINCIPAL>
CLAIM: <one sentence — the conclusion>
GATE: <the acceptance gate, and the observed result against it>
EVIDENCE: <pointers — session id, log path, line refs — NOT pasted dumps>
CONFIDENCE: <high|med|low> + <what single control/test would falsify this>
CONFOUND CHECK: <did a clean-room/differential control run? result? — or "NOT RUN">
CONFOUND CHECK is mandatory whenever the verdict blames what we don't own; NOT RUN ⇒ do not
relay, commission the control (§3).
Cheap summarizer agent for all bulk reading (large bodies, transcripts, log dumps): a
haiku/sonnet general-purpose Agent-tool dispatch (or its jr/polecat equivalent), briefed
"read <pointer>, return the ≤N facts relevant to <question>." It reads the bulk so your
context never does.
The ledger lives in the epic body — always open an epic node, even when supervising from an
interactive conversation with no pre-existing epic (chat context is not durable state).
Mechanics: mcp__pkb__create_task type=epic seeded with the ## Work Items / ## Pattern Memory / ## Ledger skeleton (see Pattern Memory Format); capture
ORIENT findings into ## Ledger and the failing observable into ## Work Items on tick 1 —
that is where the capstone (§2a) later reads the "exact previously-failing check."
Capped chat updates — one short paragraph (verdict + next action) between phases, never a
transcript replay. Preload predictable tool schemas once (task get/update, memory create,
stop/monitor) to avoid ToolSearch / parameter-retry churn.
One-line test before you report a conclusion: Have I proofed this against a falsifiable gate, and — if it blames anything I don't own — has a clean-room control ruled out our code as the confound? If not, I am relaying a claim, not a finding.
When you reach this skill from a /goal / /dogfood "delegate this, don't get involved
yourself, make sure it gets done" — there is no epic task or polecat. The discipline above is
unchanged; only these mechanics differ:
Agent(subagent_type=…, run_in_background=True) calls
(general-purpose for build/investigate, marsha for runtime QA); results arrive as
<task-notification>. The §7 context-economy contract still binds — front-load every brief
(§5) because you cannot steer a running worker, and require the capped handback.needs_task being off means you are not
required to be handed one, not that state may live in chat. Chat context is not durable
state.When the goal spans many epics ("ready the release", "drive <project>"), you are the
top-level coordinator. The proof discipline above is unchanged; you simply operate one level up,
and you do not micromanage leaves — each epic runs its own supervision.
## Constituent Epics (each epic + its status)
and ## Escalations (pending approvals, blocked epics, merge-ready PRs). Commit and push each
tick. Surface only actionable items there — never worker threads or tool-call play-by-play.## Constituent Epics.review, write the N items to
## Escalations, and stop. This is not a failure; it is the correct end of an autonomous loop.Operate in decide-and-report mode. Exit in one of three states:
[ATTN] block: Emit a single YAML block (see User Attention Notification) for decisions requiring explicit user authorization.Escalate only if:
Execute the loop exactly once per tick:
mcp__pkb__get_task(<id>)) and read the ledger. Before the first dispatch on a problem, run the orient-before-dispatch checklist (Holding Delegated Work to Proof §1): PKB search, prior-art PR/branch sweep, sanctioned-harness identification, and vendor-docs fetch for cross-vendor surfaces. Don't dispatch blind; if you can't complete orient, note it and escalate.mcp__pkb__create_task, promote, or exit).Do not:
gh auth status are permitted).marsha).Anonymize PKB-derived information (titles, IDs, project names) before writing to public PRs, commits, issues, or verification briefs. Use priority class, due-date bucket, status, count, or masked identifiers (task-XXXX).
dispatch <worker> on <task-id> in <project>brief composed on <task-id>file fix-task <title> under <parent>halt: <reason>## Fitness Rubric.## Fitness Rubric is missing for user-facing artifacts.marsha verification on separate PRs or individual work items as each intermediate worker finishes. Instead, intermediate tasks are verified using local outcome-based verification (checking remote commit existence and inspecting the diff on the shared branch). Once verified, they are transitioned to merge_ready to unblock dependent tasks. The supervisor invokes marsha to review exactly ONE cumulative PR when the final stage promotes it. That single cumulative pass IS the capstone verification (Holding Delegated Work to Proof §2a). The marsha brief the supervisor composes MUST carry the three capstone specifics from §2a — the sanctioned QA harness (identified at ORIENT, never invented; if none is recorded, HALT and [ATTN]), the exact previously-failing user-facing check (supplied by the supervisor from the epic ledger, not reconstructed by marsha), and the byte-match hallucination rule-out — while marsha's own [[../verify/SKILL.md]] enforces the fresh-instance / non-implementer / source-trace posture. A capstone the prompt could have produced without the system running is not a pass; record any miss in the ledger and send it back.| Verdict | Action |
| :--------- | :--------------------------------------------------------- |
| PASS | Mark item merge_ready; checkpoint |
| FAIL | Call pauli (role=react, context=marsha-fail: <reason>) |
| REVISE | File verification subtask; checkpoint |
brief composed on <task-id>. The main agent must persist the brief, then invoke a fresh subagent context (dispatch-agent) to validate and emit the dispatch verdict.dispatch directly.Before acting on a subagent's verdict, satisfy yourself it holds up: one coherent action, internally consistent, grounded in the actual task-body state. If it doesn't, don't act on it — note why in the ledger and exit. This is a read-and-judge, not a shape-validator.
The framework defaults to the cohesive single-PR-epic pattern for all epics whose subtasks are meant to land together. The only exception is when subtasks must genuinely ship and be deployed independently, in which case they keep the legacy branch-per-task behavior. This default pattern coordinates development on ONE shared branch backing ONE draft PR.
This pattern is executable today via the live shared-branch mechanism:
is_shared_branch Detection: The manager automatically detects shared branches by looking for custom branch overrides. If the branch name does not match the default polecat/task-<task-id> pattern (e.g. polecat/epic-<epic-id>), it is treated as a shared branch.git fetch followed by git rebase origin/<branch-name>) to integrate other workers' in-flight commits rather than resetting to main.--force-with-lease to push changes to the shared branch, accepting a low-concurrency contract.--branch polecat/epic-<epic-id>.depends_on: [<id>] edges).One epic ships as ONE pull request. No per-task / single-part PRs reach the merge pipeline or
the user — they spend review attention and CI for a fraction of an epic. Your single PR-state
action is the promotion at the end: flip it ready once all work items are done and the
capstone (the one cumulative marsha pass) is green. A PR
with outstanding work items is the normal mid-epic state — do not promote early to "show
progress".
You do not manage merge mechanics. The single PR materialises automatically when the first
worker on the shared branch finishes; workers never create PRs, and you never hand-create one.
Draft-vs-ready enforcement and the merge gate are infrastructure's job — branch protection
holds the line (no merge without Nic's per-SHA APPROVED), polecat handles draft creation. Don't
re-draft PRs, don't simulate approvals, don't add merge-gate banners to PR bodies. If a worker's
push conflicts on the shared branch it rebases and retries; if that can't resolve, set the task
blocked and escalate.
The discipline is dispatch-surface independent (see Holding Delegated Work to Proof). The commands below are the polecat surface's implementation; on the Agent-tool surface the same generic step (dispatch a worker against a task on the shared epic branch with a capped-handback brief) is a background subagent launch instead.
# Local dispatch (polecat surface)
uv run --project ~/src/academicOps polecat run -t <task-id> -p <project> --branch polecat/epic-<epic-id> --model <name>
--model <name> is the canonical flag. Use --model claude (config-default), --model opus (Claude family alias), or --model gemini-3.1-pro-preview for Gemini. --opus is not a valid flag and will error — use --model opus.The ledger is your cross-tick memory, not a trigger. Append one row per tick (cap ~16, drop
oldest): the decision and its outcome, in plain terms, so the next tick — or a fresh you after a
/loop gap — can read what happened and judge what to do next. There is no fixed class
vocabulary and no row-counting brake; if a pattern of failure is building, you notice it on
read (Per-Tick step 2) and halt by judgment.
## Pattern Memory
| Tick (ISO) | Decision | Outcome / Notes |
| :------------------- | :-------------------------- | :--------------------------------------- |
| 2026-05-08T02:14:00Z | dispatch task-abc to claude | preflight clean |
| 2026-05-08T02:43:11Z | marsha FAIL on task-abc | tests red on docker — re-dispatching fix |
## Pattern Memory, ## Work Items, ## Supervisor Log).priority at the uncurated default band — never originate a non-default band from importance or urgency. Only Nic sets intent, by express per-request instruction. Canonical rule: [[framework-conventions-summary#intent-authority]].| Phase | Subagent | Execution |
| :------------- | :------- | :-------------------------------------------------------------------------- |
| Orient | (none) | Read task body and ledger; judge whether to advance or halt; select phase. |
| Decompose | pauli | Propose subtasks; run RBG axiomcheck. Set superseded_by on retired tasks. |
| Review | (none) | Halt; await human promotion to queued. |
| Dispatch | pauli | Preflight brief, execute dispatch or chain compose/dispatch. |
| Pre-verify | pauli | Assemble minimal brief (artifact, goal, spec link). |
| Verify | marsha | Run validation. Return PASS, FAIL, or REVISE. |
| React | pauli | Recommend fix-task or halt after FAIL. |
| Halt | (none) | Terminal state reached; emit summary and exit. |
| Deliverable Type | Subworkflow | Status | | :--------------- | :-------------------------------- | :----- | | Code change | [[instructions/code-deliverable]] | active |
Read-only projections. Do not write local JSON tracking files.
gh pr list / gh pr checksgh run list$AOPS_SESSIONS/tasks.json$AOPS_SESSIONS/state/pr-state.jsonhalt labeldocker eventsEmit a single fenced YAML block for user attention when escalation conditions are met.
[ATTN]
---
id: <epic-id>:<tick-sequence>
urgency: now | today | whenever
action_required: decision | review | info
one_line: <=80-char summary
context_ref: <task-id | PR-url | issue-url>
dismiss_if: <one-line condition under which this no longer needs attention>
suggested_response: <the supervisor's default if user says "you decide">
---
All text fields (one_line, suggested_response) must use plain English. Push one_line to slack/discord/email only if urgency is now or today and action_required is decision.
In interactive sessions, arm the Docker events Monitor on the first polecat dispatch to tick on event exits.
Monitor(
description: "polecat exits",
persistent: true,
command: "while true; do docker events --filter event=die --filter 'name=polecat-' --format '{{.Time}} {{.Actor.Attributes.name}} exit={{.Actor.Attributes.exitCode}}'; sleep 2; done"
)
Filter out crew containers by checking container env for POLECAT_CREW_NAME. Stop the monitor using TaskStop once in-flight tasks resolve.
| Situation | Mechanism |
| :-------------------- | :----------------------------------------- |
| Single worker outcome | Bash run_in_background with polling loop |
| Async PR states | Monitor on gh pr checks |
| Idle / fallback | ScheduleWakeup (>= 1800s) |
| Interactive session | Monitor on docker events |
| Hook | Trigger | What it does |
| :------------ | :------------ | :--------------------------------------- |
| queue-drain | cron / manual | Starts supervisor session. |
| stale-check | cron / manual | Resets timed-out tasks. |
| pr-merge | James | James closes completed tasks post-merge. |
mcp__pkb__append / mcp__pkb__release_task).429 QUOTA_EXHAUSTED is treated as a transient rate-limit (typically a 45-minute timeout), not a hard quota lockout.tools
Streamlit implementation of the analyst presentation layer. Use when building or updating a Streamlit dashboard that displays pre-computed research data. This is the Streamlit-specific HOW for the tech-agnostic principles in the aops-tools analyst skill — display only, never transform.
tools
Python plotting and statistical-modelling libraries (matplotlib, seaborn, statsmodels) for the analyst presentation and statistical-methodology layers. Use when producing publication-quality figures or fitting statistical models in Python. Library-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
tools
dbt (data build tool) implementation of the analyst transformation layer. Use when a project has a dbt/ directory or you need to build, test, or document SQL transformations as version-controlled, reproducible dbt models. This is the dbt-specific HOW for the tech-agnostic principles in the aops-tools analyst skill.
development
Core academicOps skill — institutional memory, strategic coordination, workflow routing, and framework governance. Merges butler (chief-of-staff) with framework development conventions.