EP Operator Companion

description: EP (Experiment Pipeline) canonical operator guide. Use when registering experiments, managing RunPod orders/workers, syncing payloads, debugging EP lesions, or understanding EP architecture. Authoritative after Finch reform 2026-05-19.

Canonical source: this skill is the EP operator-facing quick reference. Full contract details live in the three-book triad (see below). If this skill and any older doc disagree, the three-book .bh/books/ files win.

Three-Book Canonical Paths

| Book | Path | Owner | |---|---|---| | Behavior (operator contract) | Project/ExperimentPipeline/.bh/books/ep-behavior.yaml | Aster | | Structure (architecture) | Project/ExperimentPipeline/.bh/books/ep-structure.yaml | Lark | | Wire (neural registry) | Project/ExperimentPipeline/.bh/books/ep-wire.yaml | Lark | | Contract Surface Index | Project/ExperimentPipeline/docs/contract-surfaces.md | Lark | | RunPod Manual | Project/ExperimentPipeline/docs/wren-ep-runpod-operator-manual.md | Aster | | RunPod Image Book | Project/ExperimentPipeline/.bh/books/ep-runpod-images.yaml | Lark |

Deprecated paths: spec/behavior/*, spec/neural/*, docs/ep.*-book.md — not live truth unless contract-surfaces.md maps them as current aliases.

EP Truth Model

DB (exp_registry.*) is runtime truth — experiments, attempts, heartbeats, settings, progress, provider instances, worker sessions
JSON snapshots sync DB → file; file is never authoritative over DB
experiment.yaml is experiment intent source of truth (not DB extra jsonb)

Core Operator Invariants (from behavior book)

ep-db-truth: DB is runtime truth; SSH/tmux/provider-UI are diagnostic until written back
attempt-artifact-immutability: Each retry gets new run_id/attempt_id; never overwrite prior artifacts
checkpoint-retention-governance: Declare retention class (thin/research/forensic) before launch
paid-runpod-gated: RunPod requires order/gene/sync/runtime/terminal closure chain
paid-runpod-gpu-class-open-policy: Arthur 2026-05-22 removed the old A100 ceiling and Blackwell/canonical-image admission blocker; GPU class is evaluated by EP selector, VRAM, price, availability, network-volume attach/preflight evidence, and Aster launch authority
ep-settings-text-dynamic-load: All settings/prompts in YAML/MD with mtime hot-reload; no hardcoded literals
runpod-selector-operator-choice: ep runpod gpu-select shows candidates; operator chooses (no auto-recommendation)

Authority Boundaries

| Role | EP authority | |---|---| | Aster | Paid RunPod pod open + operator behavior owner | | Wren | Prepare manifests, paper experiment spec | | Corvus | Model code deep edit, experiment intent/spec | | Lark | EP infra owner, worker/scheduler repair, structure+wire book | | Finch | BH immune sensor for EP health (no direct EP mutation) |

Common Operations

1. Register Experiment

psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT name, status FROM exp_registry.experiments WHERE name = '<NAME>';"

# Experiments registered via entry.json + experiment.yaml in:
# Study/GNN/FraudDetect/Experiment/experiments/<NAME>/
# Canonical register: ready_register.py --entry <json> --set-ready

2. RunPod Order → Launch Sequence (Aster authority required)

1. Create experiment dir: experiment.yaml + entry.json + scripts/train.py
2. Register via ready_register or EP registration wrapper
3. Order create:
   python3 Project/ExperimentPipeline/ep/cli/ep.py runpod order-create \
     --experiment <NAME> --policy paper-train --worker-id runpod-1 \
     --gpu-type-alias rtx-pro-4500 --json
4. Sync payload:
   python3 Project/ExperimentPipeline/ep/cli/ep.py runpod sync --run <RUN_ID> --json
5. Preflight:
   python3 Project/ExperimentPipeline/ep/cli/ep.py runpod preflight --run <RUN_ID> --json
6. Worker launch (Aster opens paid pod):
   python3 Project/ExperimentPipeline/ep/cli/ep.py worker launch \
     --nodes runpod-1 --gpu-type-alias rtx-pro-4500 --json
7. Worker claims experiment from DB (atomic with lease/attempt identity)
8. Training → artifacts under /workspace (= RunPod network volume/S3)
9. S3 result sync before pod delete — verify local download receipt
10. Stop:
    python3 Project/ExperimentPipeline/ep/cli/ep.py runpod stop --run <RUN_ID> --reason <REASON>

3. S3 / Network Volume

# S3 endpoint: https://s3api-eu-ro-1.runpod.io
# Bucket = network volume ID: tgb8bx6fwl
# Profile: ~/.aws/credentials [runpod]
# Path mapping: pod /workspace/X → S3 key X
# Note: list_objects_v2 unreliable for POSIX files; use head_object

python3 Project/ExperimentPipeline/scripts/runpod/sync_results_s3.py download --experiment <NAME>

4. DB Quick Checks

# All experiments
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT name, status, result_f1, result_auc FROM exp_registry.experiments ORDER BY name;"

# Running
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT name, worker_id, started_at FROM exp_registry.experiments WHERE status = 'RUNNING';"

# Worker heartbeats
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT worker_id, last_seen, running_jobs FROM exp_registry.worker_heartbeats;"

5. Provider Session Lifecycle

REQUESTED → PROVISIONED → SYNCED → VERIFIED → RUNNER_ONLINE → DRAINING → DELETED

6. RunPod Defaults (from behavior book)

Image: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
GPU default: rtx-pro-4500
Datacenter: EU-RO-1
Worker slot: runpod-1

Wire Book: Key Gates (17 gates total)

| Gate | Enforces | |---|---| | order-intake-gate | Validates entry.json, experiment.yaml, train script, launch_contract | | worker-affinity-intake-gate | runtime_target=ep for ordinary; preferred_worker + target_node_type for paid | | runnable-admission-gate | Worker match, dependencies, capacity, retry state | | db-claim-atomic-gate | RUNNING with run_id, worker_id, pid, gpu_id, lease fencing | | attempt-artifact-isolation-gate | New run_id/attempt per retry; no prior artifact overwrite | | runpod-full-control-gate | EP controls create/bind/sync/preflight/dispatch/monitor/terminalize/stop-delete | | runpod-order-contract-gate | Order durable before paid launch; Aster authority; active-jobs-restart-guard | | runpod-capability-sensor-gate | GraphQL gpuTypes truth; GPU class is not rejected by old A100/Blackwell policy | | sync-payload-contract-gate | Payload locations + excludes from structure book | | split-brain-ownership-gate | DB/heartbeat/pid/GPU must agree; no co-ownership | | terminal-artifact-gate | results.json + resource_usage.json agree with DB terminal state | | checkpoint-retention-gate | Retention class declared before launch | | runtime-settings-hot-reload-gate | mtime change → validate → hot-apply without restart |

Failure Routing

| Failure domain | Route to | |---|---| | Order/gene invalid | Lark (EP infra) | | Study spec invalid | Corvus (research) | | Worker claim failure | Lark | | Paid resource authority | Aster | | Split-brain duplicate | Aster containment → Lark repair | | Raw SSH bypass | Aster/Wren receipt → Finch immune hardening |

Common Lesions

| Lesion | Symptom | Fix | |---|---|---| | preflight blocked | no runnable workload | Check DB status (should be NEEDS_RERUN) | | stale binding | reconcile_action=cleared_stale_binding | workerctl bind --allow-name-mismatch or wait fresh pod | | manual pod no volume | networkVolume=null | Use EP-created pods only | | S3 list empty | list_objects_v2 returns 0 | Use head_object with known paths | | terminalization divergence | FAILED_STATUS_DIVERGENCE | Experiment already completed; check terminal state first | | schema drift | Missing column | Add idempotent migration under pipeline/migrations/ | | payload mtime drift | Train script rejects payload | Prefer sha256 over mtime for S3/network-volume sync | | policy name drift | Selector/order-create mismatch | Use canonical order policy paper-train |

Policy Summary

No manual provider pod creation — always EP-created with network volume
Paid RunPod = Aster authority — Wren/Corvus prepare, Aster opens
S3 result sync before pod delete — verify local download receipt
experiment.yaml is intent truth — DB is runtime truth
No GPU-class ceiling — Arthur removed the old A100 ceiling and Blackwell/canonical-image admission blocker; paid GPU use is governed by EP behavior gate, local-first probe, operator choice, image/volume/preflight evidence, and Aster launch authority

EP Operator Companion

Canonical source: this skill is the EP operator-facing quick reference. Full contract details live in the three-book triad (see below). If this skill and any older doc disagree, the three-book .bh/books/ files win.

Three-Book Canonical Paths

Deprecated paths: spec/behavior/*, spec/neural/*, docs/ep.*-book.md — not live truth unless contract-surfaces.md maps them as current aliases.

EP Truth Model

DB (exp_registry.*) is runtime truth — experiments, attempts, heartbeats, settings, progress, provider instances, worker sessions
JSON snapshots sync DB → file; file is never authoritative over DB
experiment.yaml is experiment intent source of truth (not DB extra jsonb)

Core Operator Invariants (from behavior book)

ep-db-truth: DB is runtime truth; SSH/tmux/provider-UI are diagnostic until written back
attempt-artifact-immutability: Each retry gets new run_id/attempt_id; never overwrite prior artifacts
checkpoint-retention-governance: Declare retention class (thin/research/forensic) before launch
paid-runpod-gated: RunPod requires order/gene/sync/runtime/terminal closure chain
paid-runpod-gpu-class-open-policy: Arthur 2026-05-22 removed the old A100 ceiling and Blackwell/canonical-image admission blocker; GPU class is evaluated by EP selector, VRAM, price, availability, network-volume attach/preflight evidence, and Aster launch authority
ep-settings-text-dynamic-load: All settings/prompts in YAML/MD with mtime hot-reload; no hardcoded literals
runpod-selector-operator-choice: ep runpod gpu-select shows candidates; operator chooses (no auto-recommendation)

Authority Boundaries

Common Operations

1. Register Experiment

psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT name, status FROM exp_registry.experiments WHERE name = '<NAME>';"

# Experiments registered via entry.json + experiment.yaml in:
# Study/GNN/FraudDetect/Experiment/experiments/<NAME>/
# Canonical register: ready_register.py --entry <json> --set-ready

2. RunPod Order → Launch Sequence (Aster authority required)

1. Create experiment dir: experiment.yaml + entry.json + scripts/train.py
2. Register via ready_register or EP registration wrapper
3. Order create:
   python3 Project/ExperimentPipeline/ep/cli/ep.py runpod order-create \
     --experiment <NAME> --policy paper-train --worker-id runpod-1 \
     --gpu-type-alias rtx-pro-4500 --json
4. Sync payload:
   python3 Project/ExperimentPipeline/ep/cli/ep.py runpod sync --run <RUN_ID> --json
5. Preflight:
   python3 Project/ExperimentPipeline/ep/cli/ep.py runpod preflight --run <RUN_ID> --json
6. Worker launch (Aster opens paid pod):
   python3 Project/ExperimentPipeline/ep/cli/ep.py worker launch \
     --nodes runpod-1 --gpu-type-alias rtx-pro-4500 --json
7. Worker claims experiment from DB (atomic with lease/attempt identity)
8. Training → artifacts under /workspace (= RunPod network volume/S3)
9. S3 result sync before pod delete — verify local download receipt
10. Stop:
    python3 Project/ExperimentPipeline/ep/cli/ep.py runpod stop --run <RUN_ID> --reason <REASON>

3. S3 / Network Volume

# S3 endpoint: https://s3api-eu-ro-1.runpod.io
# Bucket = network volume ID: tgb8bx6fwl
# Profile: ~/.aws/credentials [runpod]
# Path mapping: pod /workspace/X → S3 key X
# Note: list_objects_v2 unreliable for POSIX files; use head_object

python3 Project/ExperimentPipeline/scripts/runpod/sync_results_s3.py download --experiment <NAME>

4. DB Quick Checks

# All experiments
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT name, status, result_f1, result_auc FROM exp_registry.experiments ORDER BY name;"

# Running
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT name, worker_id, started_at FROM exp_registry.experiments WHERE status = 'RUNNING';"

# Worker heartbeats
psql -h 127.0.0.1 -p 5432 -d "ExperimentPipeline-database" -c \
  "SELECT worker_id, last_seen, running_jobs FROM exp_registry.worker_heartbeats;"

5. Provider Session Lifecycle

REQUESTED → PROVISIONED → SYNCED → VERIFIED → RUNNER_ONLINE → DRAINING → DELETED

6. RunPod Defaults (from behavior book)

Image: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
GPU default: rtx-pro-4500
Datacenter: EU-RO-1
Worker slot: runpod-1

Wire Book: Key Gates (17 gates total)

Failure Routing

Common Lesions

Policy Summary

No manual provider pod creation — always EP-created with network volume
Paid RunPod = Aster authority — Wren/Corvus prepare, Aster opens
S3 result sync before pod delete — verify local download receipt
experiment.yaml is intent truth — DB is runtime truth
No GPU-class ceiling — Arthur removed the old A100 ceiling and Blackwell/canonical-image admission blocker; paid GPU use is governed by EP behavior gate, local-first probe, operator choice, image/volume/preflight evidence, and Aster launch authority

Adoption

arthur0824hao/ep-operator-companion

$ install --global

Security Scan Results

SKILL.md

EP Operator Companion

Three-Book Canonical Paths

EP Truth Model

Core Operator Invariants (from behavior book)

Authority Boundaries

Common Operations

1. Register Experiment

2. RunPod Order → Launch Sequence (Aster authority required)

3. S3 / Network Volume

4. DB Quick Checks

5. Provider Session Lifecycle

6. RunPod Defaults (from behavior book)

Wire Book: Key Gates (17 gates total)

Failure Routing

Common Lesions

Policy Summary

Related Skills

arthur0824hao/skill-system-memory

arthur0824hao/icd-operator

arthur0824hao/skill-system-graph

arthur0824hao/skill-system-github

arthur0824hao/ep-operator-companion

$ install --global

Security Scan Results

SKILL.md

EP Operator Companion

Three-Book Canonical Paths

EP Truth Model

Core Operator Invariants (from behavior book)

Authority Boundaries

Common Operations

1. Register Experiment

2. RunPod Order → Launch Sequence (Aster authority required)

3. S3 / Network Volume

4. DB Quick Checks

5. Provider Session Lifecycle

6. RunPod Defaults (from behavior book)

Wire Book: Key Gates (17 gates total)

Failure Routing

Common Lesions

Policy Summary

Related Skills

arthur0824hao/skill-system-memory

arthur0824hao/icd-operator

arthur0824hao/skill-system-graph

arthur0824hao/skill-system-github