skills/ep-operator-companion/SKILL.md
EP Operator Companion
npx skillsauth add arthur0824hao/skills ep-operator-companionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
description: EP (Experiment Pipeline) canonical operator guide. Use when registering experiments, managing RunPod orders/workers, syncing payloads, debugging EP lesions, or understanding EP architecture. Authoritative after Finch reform 2026-05-19.
Canonical source: this skill is the EP operator-facing quick reference. Full contract details live in the three-book triad (see below). If this skill and any older doc disagree, the three-book
.bh/books/files win.
| Book | Path | Owner |
|---|---|---|
| Behavior (operator contract) | Project/ExperimentPipeline/.bh/books/ep-behavior.yaml | Aster |
| Structure (architecture) | Project/ExperimentPipeline/.bh/books/ep-structure.yaml | Lark |
| Wire (neural registry) | Project/ExperimentPipeline/.bh/books/ep-wire.yaml | Lark |
| Contract Surface Index | Project/ExperimentPipeline/docs/contract-surfaces.md | Lark |
| RunPod Manual | Project/ExperimentPipeline/docs/wren-ep-runpod-operator-manual.md | Aster |
| RunPod Image Book | Project/ExperimentPipeline/.bh/books/ep-runpod-images.yaml | Lark |
Deprecated paths: spec/behavior/*, spec/neural/*, docs/ep.*-book.md — not live truth unless contract-surfaces.md maps them as current aliases.
exp_registry.*) is runtime truth — experiments, attempts, heartbeats, settings, progress, provider instances, worker sessionsextra jsonb)ep runpod gpu-select shows candidates; operator chooses (no auto-recommendation)| Role | EP authority | |---|---| | Aster | Paid RunPod pod open + operator behavior owner | | Wren | Prepare manifests, paper experiment spec | | Corvus | Model code deep edit, experiment intent/spec | | Lark | EP infra owner, worker/scheduler repair, structure+wire book | | Finch | BH immune sensor for EP health (no direct EP mutation) |
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
"SELECT name, status FROM exp_registry.experiments WHERE name = '<NAME>';"
# Experiments registered via entry.json + experiment.yaml in:
# Study/GNN/FraudDetect/Experiment/experiments/<NAME>/
# Canonical register: ready_register.py --entry <json> --set-ready
1. Create experiment dir: experiment.yaml + entry.json + scripts/train.py
2. Register via ready_register or EP registration wrapper
3. Order create:
python3 Project/ExperimentPipeline/ep/cli/ep.py runpod order-create \
--experiment <NAME> --policy paper-train --worker-id runpod-1 \
--gpu-type-alias rtx-pro-4500 --json
4. Sync payload:
python3 Project/ExperimentPipeline/ep/cli/ep.py runpod sync --run <RUN_ID> --json
5. Preflight:
python3 Project/ExperimentPipeline/ep/cli/ep.py runpod preflight --run <RUN_ID> --json
6. Worker launch (Aster opens paid pod):
python3 Project/ExperimentPipeline/ep/cli/ep.py worker launch \
--nodes runpod-1 --gpu-type-alias rtx-pro-4500 --json
7. Worker claims experiment from DB (atomic with lease/attempt identity)
8. Training → artifacts under /workspace (= RunPod network volume/S3)
9. S3 result sync before pod delete — verify local download receipt
10. Stop:
python3 Project/ExperimentPipeline/ep/cli/ep.py runpod stop --run <RUN_ID> --reason <REASON>
# S3 endpoint: https://s3api-eu-ro-1.runpod.io
# Bucket = network volume ID: tgb8bx6fwl
# Profile: ~/.aws/credentials [runpod]
# Path mapping: pod /workspace/X → S3 key X
# Note: list_objects_v2 unreliable for POSIX files; use head_object
python3 Project/ExperimentPipeline/scripts/runpod/sync_results_s3.py download --experiment <NAME>
# All experiments
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
"SELECT name, status, result_f1, result_auc FROM exp_registry.experiments ORDER BY name;"
# Running
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
"SELECT name, worker_id, started_at FROM exp_registry.experiments WHERE status = 'RUNNING';"
# Worker heartbeats
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
"SELECT worker_id, last_seen, running_jobs FROM exp_registry.worker_heartbeats;"
REQUESTED → PROVISIONED → SYNCED → VERIFIED → RUNNER_ONLINE → DRAINING → DELETED
runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404rtx-pro-4500EU-RO-1runpod-1| Gate | Enforces | |---|---| | order-intake-gate | Validates entry.json, experiment.yaml, train script, launch_contract | | worker-affinity-intake-gate | runtime_target=ep for ordinary; preferred_worker + target_node_type for paid | | runnable-admission-gate | Worker match, dependencies, capacity, retry state | | db-claim-atomic-gate | RUNNING with run_id, worker_id, pid, gpu_id, lease fencing | | attempt-artifact-isolation-gate | New run_id/attempt per retry; no prior artifact overwrite | | runpod-full-control-gate | EP controls create/bind/sync/preflight/dispatch/monitor/terminalize/stop-delete | | runpod-order-contract-gate | Order durable before paid launch; Aster authority; active-jobs-restart-guard | | runpod-capability-sensor-gate | GraphQL gpuTypes truth; GPU class is not rejected by old A100/Blackwell policy | | sync-payload-contract-gate | Payload locations + excludes from structure book | | split-brain-ownership-gate | DB/heartbeat/pid/GPU must agree; no co-ownership | | terminal-artifact-gate | results.json + resource_usage.json agree with DB terminal state | | checkpoint-retention-gate | Retention class declared before launch | | runtime-settings-hot-reload-gate | mtime change → validate → hot-apply without restart |
| Failure domain | Route to | |---|---| | Order/gene invalid | Lark (EP infra) | | Study spec invalid | Corvus (research) | | Worker claim failure | Lark | | Paid resource authority | Aster | | Split-brain duplicate | Aster containment → Lark repair | | Raw SSH bypass | Aster/Wren receipt → Finch immune hardening |
| Lesion | Symptom | Fix |
|---|---|---|
| preflight blocked | no runnable workload | Check DB status (should be NEEDS_RERUN) |
| stale binding | reconcile_action=cleared_stale_binding | workerctl bind --allow-name-mismatch or wait fresh pod |
| manual pod no volume | networkVolume=null | Use EP-created pods only |
| S3 list empty | list_objects_v2 returns 0 | Use head_object with known paths |
| terminalization divergence | FAILED_STATUS_DIVERGENCE | Experiment already completed; check terminal state first |
| schema drift | Missing column | Add idempotent migration under pipeline/migrations/ |
| payload mtime drift | Train script rejects payload | Prefer sha256 over mtime for S3/network-volume sync |
| policy name drift | Selector/order-create mismatch | Use canonical order policy paper-train |
data-ai
Persistent shared memory for AI agents backed by PostgreSQL (fts + pg_trgm, optional pgvector). Includes compaction logging and maintenance scripts.
tools
ICD Operator
tools
Canonical skill graph navigation skill for the Skill System.
tools
GitHub operations skill for gh CLI issue, label, template, and workflow management. Use when requests include: create issue, list issues, apply label, manage templates, check workflow, or gh operations.