Name: eval
Author: overthinkos

Eval — unified declarative + AI-iteration evaluation

kind: eval beds + `charly eval run <bed>` — config-driven R10 acceptance

The disposable R10 test beds are kind: eval entities in the eval: block of the project charly.yml. A kind: eval entity is a deploy-shaped node folded into the Deploy map at load time, so every deploy verb resolves it by name; charly eval run <bed> drives the full R10 sequence on it:

This is the ecosystem-wide rule: every repo-shipped disposable test bed — the main repo's beds AND every box/<distro> submodule's beds (the arch / cachyos / debian / ubuntu / fedora bootstrap-VM and pacstrap/debootstrap beds) — is a kind: eval entity, in that repo's config (its charly.yml + per-kind sibling files). Repos ship NO kind: deploy test beds. The lone kind: deploy exception is an operator profile, not a test bed (the cachyos submodule's charly-cachyos workstation profile); operator deployments otherwise live in the per-host ~/.config/charly/charly.yml. disposable: true is the sole authorization for the unattended destroy + rebuild on any of them.

A bed is a candybox (CLAUDE.md "Candyboxing"): a disposable, secured container / VM stocked with the FULL toolset — the entire charly eval probe surface (cdp/wl/dbus/vnc/mcp/adb/appium/k8s), nested podman, real package managers, the whole layer/image library — so the AI can build, deploy, and prove the real composition inside it (the substance of RDD), and disposable: true makes tearing it down and rebuilding fearless. The box is restricted at the boundary, never by stripping the candy.

charly box build <image> --dev-local-pkg — build the test artifact (pod beds only). The eval-bed runner sets --dev-local-pkg AUTOMATICALLY, so a bed bakes the in-development charly toolchain (any localpkg: candy built from local source) — never a stale published release. A bed thus always tests the code under development; a production box build downloads the released package. See /charly-tools:charly "Eval-vs-production binary source" + /charly-internals:install-plan "Eval-vs-production charly toolchain".
charly eval box <image> — box-section + baked candy-section probes.
charly deploy add <bed> <ref> — apply the bed (or charly vm create for vm beds).
(pod beds) charly config <bed> + charly start <bed>.
charly eval live <bed> — full three-section live probe pass.
charly update <bed> — fresh-rebuild re-verification (R10 acceptance gate).
charly remove <bed> (or charly vm destroy) — leave the host clean.

Exclusive-resource preemption wraps the sequence. When a bed declares requires_exclusive: [token...] (e.g. a GPU-passthrough bed needing the one physical card), runEvalBed acquires a resource-arbitration lease BEFORE step 1 and defer-releases it after teardown: any running preemptible holder of the token (e.g. the operator's GPU workstation VM) is gracefully stopped — and the arbiter waits until it actually powers off so the device is freed — then restored when the bed finishes (even on failure, for restore: always). A holder is never left stopped: charly preempt status / charly preempt restore recover a crashed run. Triggering this — running a requires_exclusive: bed that stops the operator's preemptible GPU holder — is STANDING-authorized (no per-run confirmation): it is reversible by design (graceful stop + guaranteed restore), not a destroy. See /charly-internals:disposable "The resource-arbitration axis" + /charly-core:deploy.

A GPU bed runs on a passthrough host where host nvidia-smi FAILS — that is the READY state, not a blocker. A requires_exclusive: [nvidia-gpu] bed needs the card bound to the vfio-pci host driver so the guest VM can claim it via VFIO; host-side nvidia-smi / rocm-smi failing is therefore EXPECTED and does NOT mean the bed can't run — the GPU comes alive INSIDE the guest. Gauge readiness with charly vm gpu status (or charly vm gpu list / lspci -nnk → Kernel driver in use: vfio-pci), NEVER host nvidia-smi. See /charly-vm:vm "GPU passthrough (VFIO)".

charly eval run --all-beds runs every bed name-sorted. Flags: --keep (don't tear down), --no-rebuild (skip step 6) — both scope-shrinking, governed by "Flag discipline" below. Per-run logs land in .eval/<bed>/<calver>/ (per-step .log files + summary.yml).

Flag discipline — the score/bed config IS the test specification

The kind: score / kind: eval config in the project's eval: block IS the test spec; the operator authorizes overrides, not Claude. Passing ANY scope-shrinking flag to charly eval run (or charly eval live) without the user explicitly naming that flag in the SAME conversation turn is the same fraud class as dry-run-as-R10 (CLAUDE.md R10 flag-override clause). The catalog: --plateau-iteration, --max-scenario, --tag, --skip-rebuild, --on-pod / --on-vm / --on-host, --keep-repo, --dry-run, and the bed flags --no-rebuild (skips the R10 fresh-rebuild gate) and --keep. (--all-beds is scope-EXPANDING, not shrinking: it is in-spec WITHOUT authorization when "R10 gate by change class" mandates it for a cross-cutting change, and needs authorization only as a SUBSTITUTE for the class-mandated gate.) Internal-voice triggers — "tractable wall-clock", "for the canary", "to fit session bounds", "shorten this run", "skip the heavy leg", "faster iteration cycle" — are confessions, not defences. The score's plateau_iteration and the AI's progress_no_improvement_timeout together define the AI's recovery budget per phase; narrow neither without explicit authorization. Speed levers that do NOT shrink scope (e.g. --podman-jobs, --jobs) need no authorization — the lever catalog is /charly-internals:agents "Speed levers".

Run retention (defaults.keep_eval_runs). After charly eval run (any path — bed / --all-beds / score), .eval/<bed|score>/ is trimmed to the newest N run artifacts (CalVer run dirs, runs/<id>/ dirs, result-<calver>.yml) so harness output doesn't accumulate unbounded. NOTES.md (the durable Syncthing-replicated memory) is ALWAYS preserved. Set defaults.keep_eval_runs in charly.yml (0/absent disables); apply on demand with charly clean --eval. See /charly-core:clean.

The ecosystem's `kind: eval` beds

| Bed | Target | Ref | Surface | |---|---|---|---| | eval-sway-browser-vnc-pod | pod | box: sway-browser-vnc | cdp/wl/vnc/dbus/mcp/record + pod-side file/service/port/process/http | | eval-k3s-vm | vm | vm: k3s-vm | k8s (all 13 methods) + guest-side file/service/port/process, http via port-forward, VmDeployTarget end-to-end | | eval-pod | pod | box: eval-pod | combined mechanism bed: kind: box build + kind: candy composition order + kind: pod runtime (nc :18794 + supervisord) + every DeployTarget rendering path | | eval-local | local | local: eval-local | kind: local layer apply via ShellExecutor | | eval-jupyter-pod | pod | box: jupyter | jupyter-mcp regression coverage | | eval-jupyter-ml-pod | pod | box: jupyter-ml | jupyter-ml spacy/quarto + GPU MCP probes | | eval-versa-pod | pod | box: versa | versa OSM analytics + vector-tile + marimo MCP | | eval-android-emulator-pod | pod | box: android-emulator | Android 14 emulator (/dev/kvm) + adb/appium | | eval-charly-vm | vm | vm: charly-vm | charly toolchain localpkg deploy witness (opencharly-git pacman install + dep auto-resolve) on the cloud VM |

Bed HOMES: the main repo's charly.yml owns eval-k3s-vm, eval-local, and eval-charly-vm; the pod beds above live in the box/<distro> submodules' eval: blocks (eval-pod, eval-jupyter-pod, eval-jupyter-ml-pod, eval-sway-browser-vnc-pod in box/fedora; eval-versa-pod, eval-android-emulator-pod in box/cachyos) and run from that submodule (e.g. charly -C box/fedora eval run eval-pod).

Naming: eval-<descriptor>-<kind>, dropping a redundant suffix when the descriptor already equals the kind AND the short form is free (eval-local, eval-pod). The harness AI-sandbox pod (the score pod: target) is named eval-sandbox, kept disjoint from these beds. Nothing in the charly Go code hardcodes that name — it flows from the score's pod: field through ResolveScoreTarget, and prompts reference it via the ${TARGET_NAME} substitution token. The sandbox is an OPERATOR-PROVISIONED per-host deploy — the runner restarts-but-never-creates it: provision once with charly deploy add eval-sandbox <ref> --disposable + charly start eval-sandbox (the ref must provide charly + nested podman + the configured AI CLI; the per-host overlay never ships with the repo). On a host without the entry, charly eval run <score> fails fast with exactly that remediation. The supporting vm: k3s-vm + k8s: vm-k3s-vm entities live in the project charly.yml alongside its beds. disposable: true is the sole authorization for the unattended destroy+rebuild (see /charly-internals:disposable). Two load-time guards back the beds: validateEvalBeds enforces target ∈ {pod, vm, local, android}, a resolvable cross-ref, and disposable: true; foldEvalBeds enforces the name-disjointness from kind: deploy. Neither checks host ports — disjoint ports across beds are the AUTHOR's responsibility (an overlap fails the second bed at deploy via CheckPortAvailability).

Approximate wall-clock (10-CPU 32-GB reference host)

eval-pod ~110s (one build → deploy → eval → fresh-update → teardown cycle covering all four mechanisms) · eval-local ~45s · eval-k3s-vm ~5–7 min · the heavy feature beds (eval-sway-browser-vnc-pod ~14 min incl. image build) longer. charly eval run --all-beds runs beds STRICTLY SEQUENTIALLY (a plain loop — no concurrency in charly), so its wall-clock ≈ the SUM. To collapse that to ≈ the slowest single bed, parallelize at the AGENT layer — one agent/teammate per bed, N concurrent charly eval run <bed> processes (/verify-beds and an agent team both do this; see /charly-internals:agents "Speed levers"). The dominant cost is the step-1 charly box build: a pod bed builds the image ONCE — the "fresh charly update" R10 step is a systemctl restart onto the already-built image, not a second build — and same-base beds share cached layers, so pre-warming a shared base once makes every sibling bed's build incremental.

Handling a long-running bed — by mechanism, not by who owns it. A VM/emulator bed's charly eval run orchestrator runs for minutes-to-tens-of-minutes AND the libvirt domain / emulator it spawns outlives a single turn. (1) Launch it as a harness-tracked background task (run_in_background) — never foreground (the Bash tool's timeout, 120s default / 600s max, kills the call mid-vm-create, orphaning the domain) and never a sleep/poll loop (the R4 bandaid). (2) Let the completion notification drive the next step — the harness re-invokes the LAUNCHING session when the run exits, so the launcher must SURVIVE to completion to be notified: the persistent main session does; an ephemeral sub-agent (returns synchronously) and an idle teammate (torn down on idle) do NOT, and orphan the bed. EVERY full charly eval run <bed> is a run_in_background task owned by a PERSISTENT owner that survives across turns to be notified — the main session, a background agent, or (interactive tmux) a split-pane teammate; an in-process teammate CANNOT. Duration-independent; the 600s is a Bash FOREGROUND cap, irrelevant to a backgrounded bed. Teammates/sub-agents do bed-local edits + short foreground checks (charly eval image), never the full run. (3) Reconnect via durable state — .eval/<bed>/<calver>/summary.yml + the live domain ARE the truth: "done" = summary.yml present; "alive" = the orchestrator is in the process table; clean up an orphan (running domain, no live orchestrator) with charly vm destroy <entity> (or charly remove <name> for a pod) before re-running. See /charly-internals:agents "The binding rule".

Prereq for the vm bed

eval-k3s-vm depends on the libvirt user-session daemon. Enable once:

systemctl --user enable --now virtqemud.service     # libvirt >= 8 (modular)
# OR (older monolithic):
systemctl --user enable --now libvirtd.service

(The host libvirt daemon is host infrastructure, not a charly-managed resource — enabling it via systemctl sits outside the charly-CLI-only mandate's scope.)

charly eval run eval-k3s-vm best-effort starts the unit before charly vm create (via startLibvirtUserSession() in charly/vm.go), and resolveVmBackend() now spawns it too before probing the socket — so a cold socket is never mistaken for "libvirt absent" (Arch/CachyOS ship no persistent virtqemud.socket; the socket appears only after an autospawn). Never gauge libvirt readiness with systemctl is-active virtqemud.service — a socket-activated / autospawn daemon reports the service inactive while libvirt works fine; check the socket ($XDG_RUNTIME_DIR/libvirt/virtqemud-sock) or virsh -c qemu:///session list instead. resolveVmBackend() surfaces a clear error when libvirt is absent. The k3s-vm: vm template pins backend: libvirt explicitly so the silent auto -> qemu fallback can't mask a missing daemon. See /charly-vm:vm "Prereq", /charly-eval:libvirt.

Why this is config-driven

Beds are kind: eval config; the runner reads the bed node directly. This means the runner works for ANY bed an operator defines — charly eval run <name> resolves the bed by name through the same path every deploy verb uses, with no hardcoded bed table to keep in sync.

R10 gate by change class — pick the gate that exercises the change

The R10 principle is EXERCISE, not ceremony: a gate that cannot fail on the change proves nothing (wasted verification), and a change whose gate never executed is unproven (the fraud class CLAUDE.md R10 bans). Pick the SMALLEST gate that genuinely exercises every changed code path — and run it in full. CLAUDE.md R10 carries the mandate; this matrix is the authoritative detail.

| Change class | Pre-flight | The R10 gate | Tier on a clean pass | Explicitly NOT required | |---|---|---|---|---| | Documentation-only change class — *.md (CLAUDE.md, plugins/**/SKILL.md, READMEs, CHANGELOG), comment-only code edits, or a submodule pointer bump to an all-documentation submodule commit; zero behavior change | markdown integrity, link checks | The non-runtime standards: adversarial consistency review, the R5 grep self-test, cross-reference validation, the pre-commit-gate.sh/pre-push-gate.sh gates | documentation reviewed | ANY bed run or image build — beds cannot fail on prose | | Hook / workflow scripts — .claude/hooks/*.sh, .claude/workflows/*.js | bash -n / async-body parse | Execute the changed script live: run the hook directly (paste its output); a workflow whose CONTROL FLOW changed runs against ONE bed matching the change. Prompt-string-only workflow edits: parse + the non-runtime standards | fully tested and validated | The full bed fan-out | | charly Go code | go test ./... + go vet + task build:charly (R9 freshness + charly version check) | charly eval run <bed> for EACH bed whose kind matches a touched code path: box/candy/pod/DeployTarget mechanism → eval-pod; target: local → eval-local; VM / k8s → eval-k3s-vm; a feature surface → its feature bed. Cross-cutting loader / resolver / IR / unified-schema changes → --all-beds (in-spec for that class, not a scope override) | fully tested and validated | Beds whose substrate the change cannot reach | | Candy / box / pod / vm / k8s / local / android config | charly box validate | Build + run a bed that COMPOSES the changed entity (a candy edit → a bed whose image stacks that candy); when no bed composes it, the R7 sequence on a disposable deploy: build → charly eval box → deploy → charly eval live → fresh charly update | fully tested and validated | Beds that do not compose the changed entity | | Score / recipe / ai eval config | charly box validate | The affected score / bed run AS SPECIFIED (see "Flag discipline") | fully tested and validated | Unrelated beds |

Tier on a clean pass — the class fixes the gate; the gate fixes the tier. Passing a class's full gate earns the tier in that column. analysed on a live system and syntax check only are partial-completion WAYPOINTS for the RUNTIME classes only — a live invocation ran but no fresh-rebuild R10 → analysed on a live system; only compile / validate / dry-run ran → syntax check only (which FORBIDS commit). The Documentation-only change class has no live runner, so it has no waypoint: its non-runtime standards either all pass (→ documentation reviewed) or the change is not ready. The tier DEFINITIONS live in CLAUDE.md "AI Attribution"; this column names which tier each gate earns, it does not redefine them.

Mixed changes take the UNION of their classes' gates (docs ride along with the code class in the same commit, at that code class's runtime tier — never documentation reviewed). Class assignment is honest, not aspirational: a "comment" edit that changes a prompt string an agent executes is script-text, not docs; a YAML comment is docs, but a YAML field is config. A submodule pointer bump is documentation only when the bumped submodule commit is itself all-documentation (pre-commit-gate.sh recurses into its old..new diff); a bump integrating submodule code is a code class, at a runtime tier.

The 10 Testing Standards (READ FIRST — referenced by CLAUDE.md R1–R10)

An earlier agent claimed "cutover complete, all tests pass" after green go test ./... runs, then built an image that crash-looped at startup because a Containerfile stage was silently dropped. The unit tests didn't notice — they only exercised YAML loaders. Unit tests are NOT a substitute for running the service.

These are the 10 standards referenced in CLAUDE.md's AI attribution tier ("fully tested and validated"). Each is keyed to a CLAUDE.md R-rule. Apply them whenever a change could affect Containerfile generation, OCI labels, init systems, service startup, or deploy code.

Prove every HIGH-RISK assumption BEFORE you edit (RDD — Risk Driven Development) — the proactive bookend to Standard 10's fresh-rebuild gate. Low-risk orientation ("what does layer X do") is a skill lookup (R0, zero risk); every high-risk assumption — including any a skill or the code merely asserts, and above all whether this layer composition at its latest available versions builds / deploys / runs TOGETHER — is proven on a disposable: true bed FIRST (charly eval it). Never accept docs or code as ground truth for a high-risk decision; if the bed disagrees with a skill, the skill is stale — fix it. Standard 0 (validate forward, riskiest-first) and Standard 10 (re-verify on a fresh rebuild) are the two ends of the same loop.
Build a real artifact (R7) — charly box build <image> / go build / charly vm build <vm>. Not just go test. Not just charly box generate.
Verify the emitted artifact's content (R8) — grep -c supervisord-conf .build/<image>/Containerfile for any image that uses supervisord; charly eval libvirt domain-xml <vm> for a VM.
Verify critical OCI / capability labels post-build (R8) — charly box labels <ref> --format init (or any ai.opencharly.<key> shorthand) prints the BUILT ref's label and exits non-zero when absent; charly box labels <ref> lists the whole capability contract (/charly-internals:capabilities). Empty / missing label → the detection path silently returned nil → regression.
Deploy to a DISPOSABLE target (R10) — NEVER experiment on a resource that doesn't carry disposable: true. If no suitable disposable target exists, create one first (charly deploy add <name> <ref> --disposable or mark a VM in vm.yml and charly vm create). The setup is part of the task. On a disposable target: charly update <name> (unattended). On anything else: confirm with the user before any irreversible destroy — EXCEPT preempting a declared-preemptible: holder, which is standing-authorized (reversible: graceful stop + guaranteed restore).
Target must reach steady-state — charly status <image> → running; charly eval libvirt info <vm> → state running; SPICE socket file exists and accepts a handshake. If the service start-limit is hit, the container is crashing — read charly logs <image> and reproduce in a disposable shell (charly shell <image>, running the service command manually).
Run the declarative test suite — charly eval live <image> full three-section pass against the live container (or --uri / --host remote equivalent for a remote target).
Verify the deployed binary is the one you built (R9) — charly version on the target matches the expected CalVer, and charly status <image> (detail Image ref: line; JSON image_ref) shows the running container's image ref carries THIS build's CalVer tag, not the prior one. Source-only changes (Syncthing, git push) do NOT update the deployed binary; you must build AND deploy on the target host.
Verify runtime deps are installed via package management (R9) — on the HOST: charly doctor (dependency status), or the host package manager directly (rpm -q <pkg> / pacman -Q <pkg> — host packages are not charly-managed resources, so the package manager IS their interface); inside a container: charly cmd <box> 'rpm -q <pkg>'. Manual installs do NOT count — they won't survive a fresh install on a synced host. Every runtime dep must live in setup.sh + pkg/arch/PKGBUILD.
Leave the target healthy, not paused/errored — the final charly status (and charly eval libvirt info <vm> for a VM) is healthy. If the target is in a broken state during exploration, charly update it back to the committed config before continuing — never layer experiments on broken state.
Re-verify on a FRESH rebuild after committing the source-level fix (R10) — charly update <disposable-target> one more time from clean, with the new source applied. Run standards 1–9 again against this fresh rebuild. THIS IS THE ACCEPTANCE GATE. A fix that works on a hand-patched target but not on a fresh rebuild is a regression waiting for the next unrelated rebuild to wipe your patch. Paste BOTH the exploratory-pass output AND the fresh-rebuild-pass output into the conversation — the user sees both.

`charly eval live parent.child` reaches the actual leaf

charly eval live <parent>.<child> walks the dotted deployment path through ResolveDeployChain (charly/deploy_chain.go) and constructs a multi-hop DeployExecutor chain that lands probes inside the leaf's actual venue — command: id for a pod-in-VM leaf returns the inner pod's user, not the parent VM's. Live-eval and AI-iteration-scoring chain construction go through the same code path charly deploy add uses. For deeply-nested paths, each segment adds one hop:

charly eval live eval-vm                            # → SSHExecutor (1 hop)
charly eval live eval-vm.inner                      # → SSH + podman exec charly-eval-vm_inner (2 hops)
charly eval live eval-vm.inner.deeper               # → SSH + 2× podman exec (3 hops)

Same chain primitive (NestedExecutor) used everywhere — charly deploy add, charly eval live, charly eval run (AI iteration scoring).

Exception — a POD leaf nested in a VM delegates its eval to the guest. Because the host-vantage probes (the protocol verbs cdp/wl/dbus/vnc/mcp, which shell out to a host charly eval <verb> subprocess resolving the container on the HOST's podman; and ${HOST_PORT} addr/http, resolved via HOST podman inspect) cannot reach a container living in the guest, charly eval live <vm>.<pod> does NOT chain-dispatch those per-check (they would SKIP). Instead runVm runs charly eval live <pod> IN the guest over SSH (guestNestedEvalCmd → SSHExecutor.RunCapture), where the nested pod is a DIRECT pod — guest-local podman, ports on guest localhost, the guest charly (installed by EnsureCharlyInGuest) — so cdp/wl/mcp + ${HOST_PORT} run natively, zero skips. The guest reads the SAME baked checks from the cp-box'd pod image, so the check set is identical; the host propagates the guest's report + exit code. A pod-in-pod or vm-in-vm leaf (no host↔guest container boundary) still uses the multi-hop chain above. See "ONE primitive: scenario.pod (flat or dotted)" below for the symmetric authoring surface in eval recipes (kind: recipe).

Specific anti-patterns observed and banned

"Unit tests pass → cutover done" — no. Build + deploy + run + test, every time.
"Retested after update → still passing!" but the pre-update test was against the old running image and the post-update container failed to come up — see the is not running error and conclude the update broke it, not that "tests pass in aggregate."
"Service start failed, probably a transient" — no. A dependency job for X failed + immediate exit is a real error. Read charly logs <image>; charly shell <image> reproduces it in a disposable container.
"Lifecycle: dev implies disposable" — no. disposable: true is the ONE authorization for autonomous destroy + rebuild. Lifecycle tags are human-facing metadata; they do not authorize anything. See /charly-internals:disposable.
"This is a dev box so I can just nuke it" — no. The only authorization for autonomous destroy is the explicit disposable: true field on a specific deploy. Everything else requires user confirmation, regardless of hostname.
"I tested on the VM I've been patching all afternoon, looks fine" — incomplete. Run charly update <disposable-target> once more from clean and re-verify before claiming success. Without the fresh-rebuild re-verification, your "fix" may be latent on hand-patched state.
"I'll test it later / Phase 2" — no. If the plan said "clean cutover in one PR", don't invent a Phase 2.
"The skill/code says so, so I don't need to test it" — for a HIGH-RISK assumption, no: docs drift and code has bugs; prove it on a disposable bed FIRST (Standard 0). Conversely, do NOT bed-test a LOW-risk orientation fact a skill already states — that's wasted effort. Risk, not documentation status, is the trigger.

If the container needs state that's only available in deploy (volumes, env, tunnel), author the test at scope: deploy. If it needs something at build only (binary path, package presence), author at scope: build. Both scopes must pass for the cutover to be real.

Confidence tier mapping: The "fully tested and validated" confidence level in CLAUDE.md's AI-attribution table requires ALL 10 standards met — including Standard 10, the fresh-rebuild re-verification. Anything short of that ships at a lower confidence tier.

Overview

charly ships a goss-inspired declarative testing framework built into the CLI. Eval checks are authored inline under eval: (or deploy_eval:) in a candy/box charly.yml, or charly.yml. They are embedded as a three-section OCI label (ai.opencharly.eval → {candy, box, deploy}) so any pulled image is self-testable without its source repo. A local charly.yml overlay can add checks or override baked ones by id:.

The runner resolves deploy-time variables (actual host port mappings, volume backings, env vars, container IP, DNS) at execution time, so a check written once works unchanged when charly.yml remaps ports.

Quick Reference

| Action | Command | Description | |--------|---------|-------------| | Pure-box eval (disposable, build-scope) | charly eval box <image> | Candy + box sections only, in podman run --rm (no host port mappings, no volumes attached) | | Live full-stack eval (running deployment) | charly eval live <name> [-i instance] | All three sections run via podman exec / SSH / nested chain, with full runtime variable resolution | | R10 bed (full sequence) | charly eval run <bed> / charly eval run --all-beds | Build → eval image → deploy → eval live → fresh update → tear down on a kind: eval disposable bed. Canonical R10 gate | | AI iteration loop | charly eval run <score> | Drives an AI through plateau-bounded iterations against kind: score | | Validate authored tests at config time | charly box validate | Schema, scope/variable consistency, id uniqueness | | Inspect effective spec | charly box inspect <image> | JSON includes merged eval structure | | Filter by verb | charly eval live <name> --filter file --filter port | Repeatable | | Filter by section | charly eval live <name> --section deploy | One of: candy / box / deploy | | Output format | charly eval live <name> --format json\|tap\|text | Default text |

Exit codes

charly eval uses a goss/pytest-style three-way exit convention so automation (and charly eval run <bed>) can tell "the thing under test is broken" apart from "the eval couldn't run":

| Code | Meaning | |---|---| | 0 | All checks passed. | | 1 | Command / usage / infra error — bad args, undeclared deploy, container not running, image build/deploy/vm-create failed. The eval never produced a pass/fail verdict. | | 2 | The eval RAN and one or more checks FAILED (EvalCheckFailExitCode). |

charly eval box / charly eval live return 2 directly when checks fail. charly eval run <bed> propagates 2 when the bed's eval step (eval-image / eval-live) fails on checks, but 1 when an infra step (build / deploy / vm-create) fails — so a broken bed image is distinguishable from a genuine test failure. charly eval run --all-beds returns 2 only when every failing bed failed on checks; any infra failure makes it 1. Implementation: EvalFailedError + EvalCheckFailExitCode (charly/eval_cmd.go), mapped to the process exit code in main() via errors.As.

Image preflight (host-target runs only)

When charly eval run --on-host <name> (or any score whose target resolves to TargetKindHost) dispatches the runner, an image preflight runs FIRST: walk the score's recipes, collect every distinct scenario.pod plus score.target_image, deduplicate, and ensure each image is present in local podman storage BEFORE handing off to the host runner.

For each discovered image, the algorithm is:

LocalImageExists(podman, ref) — short-circuit if already present. Idempotent on re-runs.
charly box pull <ref> — preferred path. Resolves short names via cfg.Images[<name>], full registry refs pass-through, remote @github.com/... refs walk through ResolveRemoteImage.
charly box build <name> — fallback when pull fails AND the identifier is a short name resolvable via the project's charly.yml. Build fallback is local-only; for any non-short identifier that fails to pull, the preflight aborts with an actionable error.

Failures abort the eval BEFORE any scenario runs, so operators see problems early rather than mid-scenario.

This is the only image-fetch surface in the system: deploys (any target — local, pod, vm, k8s) emit zero image-pull steps. A kind: local template has no image: field; image preflight lives on this verb. Operators with legacy YAML run charly migrate. See /charly-local:local-spec "What the deploy does NOT do" and CLAUDE.md "Deploy fetches NOTHING speculative".

Lives in charly/eval_image_preflight.go (EnsureImagePresent, ensureScoreImages); wired into EvalRunCmd.Run for the case TargetKindHost: arm. Pod / VM / k8s targets carry their own image inside their respective deploy schema and never trigger the preflight.

Three primary modes (`charly eval box` / `live` / `run`)

The surface is three orthogonal verbs, each named for what it evaluates:

charly eval box <image> — evaluates the image artifact in isolation. Spawns a disposable container via podman run --rm <image-ref> (ImageExecutor). Only eval:-block checks at scope: build run; deploy-scope checks are skipped with a clear message. The right choice for build-time invariants ("did this binary land at this path?", "is this package installed?") where a running service would only add noise.
charly eval live <name> — evaluates a running deployment (pod / vm / host / k8s; auto-resolved by <name> against charly.yml). Uses ContainerExecutor / SSHExecutor / NestedExecutor (for dotted-path children) + ResolveEvalVarsRuntime so deploy-scope checks see real supervisord state, real HOST_PORT:<N> mappings (including host-networked containers via HostConfig.NetworkMode detection), and real in-container env. Same dispatcher that powers charly eval live parent.child for pod-in-vm topologies.
charly eval run <score> — drives an AI runner through plateau-bounded iterations against a kind: score. See "AI iteration loop semantics" below.

The mode is explicit in the verb; there is no autodetect or implicit fallback. Choose the mode by picking the right verb.

Which verb/bed proves what (CLAUDE.md R7): charly eval box passes on zero-content stages too — it is NOT a substitute for the generated-artifact checks (R8). For the R10 gate, pick the kind: eval bed whose kind matches what you changed (eval-pod for the combined box/candy/pod/DeployTarget mechanism, eval-local, eval-k3s-vm, or a feature bed). charly eval run <score> is the multi-hour AI-iteration benchmark, never a quick gate — the same verb dispatches by the kind the name resolves to.

The banner after charly eval box reports the image ref:

Image: ghcr.io/overthinkos/fedora-coder:latest

The meta.Box short-name (from the ai.opencharly.box OCI label) is used by charly eval live for the charly-<image> container-name lookup — full image refs like ghcr.io/overthinkos/fedora-coder:latest are correctly mapped to charly-fedora-coder. Implementation: charly/eval_cmd.go EvalBoxCmd.Run() and EvalLiveCmd.Run().

Agent Driven Evaluation (ADE) — `charly box/eval feature run` + the agent grader

ADE runs an entity's OWN baked Gherkin scenarios (the description.scenario blocks, shipped in the ai.opencharly.description OCI label) as acceptance tests. It is the canonical Gherkin acceptance pattern, named for the agent that drives the evaluation: a prose-only step is graded by an AI agent, and the agent's verdicts drive the red→green loop. This complements the flat eval: checks — charly eval box / charly eval live run the eval: list; charly box/eval feature run run the description.scenario list.

The six-stage loop

| Stage | Who | Command | What | |---|---|---|---| | Specify | you / an agent | charly candy add-scenario <layer> <name> --given/--when/--then (idempotent; MCP tool candy.add-scenario) | Author the goal (feature:/narrative:) + scenarios on the CANDY that provides the behaviour (it bakes into every box that composes the candy — R3). | | Bind | author (implicit) | charly feature pending <entity> lists the gaps | Embed a check verb → deterministic; leave the step prose → agent-graded. | | Run | you / CI | charly box feature run <image> / charly eval feature run <deployment> | Execute the scenarios; per-scenario pass/fail + grader evidence. | | Iterate | you / an agent | edit + re-run, OR charly eval run <score> | Drive red→green by hand, or let the plateau-bounded AI loop write the code until scenarios pass. | | Bake | the build | charly box build | Goal + scenarios bake into ai.opencharly.description; runs source-less against a pulled image. | | Gate | R10 | charly eval run <bed> | Runs the bed image's deterministic scenarios — every candy ships them (ADE authoring is mandatory; charly box validate enforces it). The live agent grader via charly eval feature run stays opt-in. |

The BIND contract — a step binds to its verifier BY SHAPE

description:
  scenario:
    - name: dashboard-loads
      step:
        - then: the page title is Dashboard
          cdp: eval
          expression: document.title        # DETERMINISTIC — embedded check verb
        - then: the chart looks populated, not empty   # PROSE-ONLY — agent-graded

A step with an embedded check verb (file/http/command/cdp/mcp/…) runs that check. A step with only Gherkin prose (then: and no verb) binds to the agent grader. charly feature pending lists every prose-only step so an author sees the still-unbound work.

The two run verbs

charly box feature run <image> — BUILD scope. A disposable container (podman run --rm per check), deterministic steps only. A prose-only step has no stable target to probe, so it reports unbound (advisory skip; --strict fails it). Resolves the image against local storage (never charly.yml), like charly eval box.
charly eval feature run <deployment> — DEPLOY scope. Against a running image-backed (pod) deployment; deterministic steps run their checks and prose-only steps bind to the agent grader (unless --no-agent). Flags: -i/--instance, --format text|json|tap|junit, --tag <expr>, --agent <name>, --timeout <dur>, --no-agent, --strict.

Exit codes follow the 0/1/2 convention (a scenario fail → exit 2 via EvalFailedError; an infra error → 1). No scenarios baked → a no-op pass.

The agent grader contract

For a prose-only step (when grading is enabled), the runner spawns the configured kind: agent CLI ONCE (bounded — never the plateau loop), handing it the scenario goal (feature/narrative), the step's Given/When/Then, the live target name, and the instruction that it MAY probe via charly cmd <target> …, charly eval mcp/cdp/wl <target> …, charly status <target>. The agent returns a single-line JSON verdict {"verdict":"pass|fail","evidence":"…"}; the runner parses it (plain or stream-json) into a pass/fail with evidence. An unparseable / timed-out / launch-failed grader is a FAIL with the raw output — never a silent pass. The grader wall-clock cap is --timeout → the ai entry's timeout: → a 5m default. Which agent: --agent <name> → the sole configured agent: entry → an explicit "specify --agent" error.

--no-agent is the deterministic-only mode (CI): prose-only steps report unbound (visible, never silently green). The unattended charly eval run <bed> gate runs charly eval feature run <bed> --no-agent so a bed stays deterministic and free; exercise the live grader with an explicit charly eval feature run <name>.

Worked example (end to end)

# 1. Specify two scenarios on the layer that provides the behaviour.
charly candy add-scenario web-layer dashboard-loads \
  --then "the dashboard renders without a 500"          # prose → agent-graded
$EDITOR candy/web-layer/charly.yml                        # add a deterministic step with cdp: eval

# 2. See what's still prose-only (the authoring gaps).
charly feature pending layer:web-layer

# 3. Build → run build-scope acceptance (deterministic steps).
charly box build web
charly box feature run web

# 4. Deploy, then run deploy-scope acceptance with the agent grader.
charly start web
charly eval feature run web                # prose steps agent-graded against the live pod
charly eval feature run web --no-agent     # deterministic-only (CI): prose reports unbound

Implementation: charly/eval_feature_run.go (the verbs), charly/eval_feature_grader.go (AgentGrader + RunAIOnce + parseVerdict), the Runner.Grader dispatch in charly/description_run.go, and charly/description_cmd.go (charly feature list/pending/validate). The scenario engine (RunScenarios) and target resolution are shared with the harness loop and charly eval box/live (R3).

Subcommands: live-container verbs

charly eval live is both the declarative test runner AND the grouping point for the interactive verbs that drive a running service. Kong's default:"withargs" tag means charly eval live <image> still dispatches to the runner — only the explicit subcommand names below take over when matched.

| Subcommand | Skill | Purpose | |-----------|-------|---------| | charly eval cdp … | /charly-eval:cdp | Chrome DevTools Protocol — open/list/close tabs, click, type, eval, screenshot, axtree | | charly eval wl … | /charly-eval:wl | Wayland desktop input/windows/clipboard + sway IPC | | charly eval dbus … | /charly-eval:dbus | D-Bus calls, introspection, desktop notifications | | charly eval vnc … | /charly-eval:vnc | VNC framebuffer screenshot + click/key/type/passwd | | charly eval mcp … | (this skill) | MCP client — ping/list-tools/list-resources/list-prompts/call/read/servers against any mcp_provide endpoint. Speaks github.com/modelcontextprotocol/go-sdk (Streamable HTTP by default, SSE when transport: sse). | | charly eval record … | /charly-eval:record | Recording sessions — start/stop/list/cmd. Terminal (asciinema) or desktop (pixelflux/wf-recorder). Container-only. | | charly eval spice … | /charly-eval:spice | SPICE wire client for VMs — handshake, native-SPICE framebuffer screenshot, input injection. VM-only. | | charly eval libvirt … | /charly-eval:libvirt | libvirt-RPC test commands for VMs — info, domain XML, QMP, qemu-guest-agent, snapshots, events. VM-only. |

These eight verbs live under charly eval because every one of them is a "probe or drive a running service" operation — the same surface the declarative test runner composes when it executes checks.

Reserved image names: because subcommand names take priority when matched, an image literally named cdp, wl, dbus, vnc, mcp, record, spice, or libvirt cannot be run via charly eval live <name> — use the explicit charly eval live <name> form or rename the image. No such images currently exist in charly.yml.

Gotcha — stale container-baked charly binary: charly eval dbus notify and charly eval dbus call delegate to the container's own charly binary (see charly/notify.go:20, charly/dbus.go:195,229). If the container bakes an charly binary too old to know the charly eval <verb> subcommand path, the delegation fails. Fix by rebuilding and redeploying any image that bakes charly (grep charly.yml for - charly$ to find them). Test runner itself is unaffected — this only bites the host→container delegation paths.

Authoring: the `eval:` list

Every check is a list entry with exactly one verb discriminator plus shared modifiers and verb-specific attributes. This mirrors the task: pattern in charly.yml.

Gold-standard pattern (redis candy)

# candy/redis/charly.yml
eval:
  # Build-scope — run inside the built image via `podman run --rm`.
  - id: redis-binary
    file: /usr/bin/redis-server
    exists: true
  - id: redis-cli-binary
    file: /usr/bin/redis-cli
    exists: true
  - id: redis-package
    package: valkey-compat-redis     # see "Authoring Gotchas" on Fedora 43 renames
    installed: true

  # Deploy-scope — HOST_PORT:N resolves to the effective port mapping, so
  # the same test works whether the deploy publishes 6379:6379 or 16379:6379.
  - id: redis-responds
    scope: deploy
    command: redis-cli -h 127.0.0.1 -p ${HOST_PORT:6379} ping
    stdout: PONG
    in_container: false    # run redis-cli from the host
  - id: redis-port-open
    scope: deploy
    addr: 127.0.0.1:${HOST_PORT:6379}
    reachable: true

The six entries cover: binary existence, package identity, a functional live probe, and raw TCP reachability. Adopt this shape for any primary service layer.

Cross-distro package names (`package_map:`)

The package: verb chains rpm -q || dpkg -s || pacman -Q with a single literal name, so a test that works on Fedora (openssh-server) fails on Arch (where the package is just openssh). package_map: is the authoring hook: the first key that matches any of the image's distro: tags wins; otherwise the package: scalar is used as the fallback. Source: charly/testrun_verbs.go:resolvePackageName + the Runner.Distros field wired from meta.Distro at both test entry points.

# candy/sshd/charly.yml — cross-distro authoring
- id: openssh-server-package
  package: openssh-server              # Fedora / Debian default
  package_map:
    arch: openssh                 # Arch ships the metapackage as 'openssh'
    fedora: openssh-server
    fedora:43: openssh-server          # explicit version tag — matches before 'fedora'
  installed: true

Tag priority: entries in distro: are in priority order (e.g. ["fedora:43", fedora]), so package_map: can be keyed on either the version-specific or the family tag, and the more-specific tag wins naturally. An empty-string map value falls through to the next tag (see TestResolvePackageName in charly/testrun_verbs_test.go).

Verb catalog (goss-parity feature set)

| Verb | Typical attributes | Notes | |------|---------------------|-------| | file | exists, mode, owner, group_of, filetype, contains, sha256 | group_of (not group) avoids the verb collision | | package | installed, versions | Tries rpm / dpkg / pacman — exact match on installed name (see gotchas) | | service | running, enabled | Supervisord first, then systemd | | port | listening, ip, reachable | listening needs ss/netstat inside the container — absent from minimal images | | process | running | Needs pgrep — absent from minimal images | | command | exit_status, stdout, stderr, in_container | Default runs via podman exec; set in_container: false to run from host | | http | status (int, single value), body, headers, method, request_body, allow_insecure, no_follow_redirects, ca_file, timeout | Host-side under charly eval live; curl-in-container under charly eval box | | dns | resolvable, addrs, server | Host resolver for charly eval live; getent hosts for charly eval box | | user | uid, gid, home, shell | getent passwd | | group | gid, groups | getent group | | interface | mtu, addrs | ip -o addr show | | kernel-param | value (scalar or matcher) | sysctl -n | | mount | mount_source, filesystem, opts | findmnt | | addr | reachable, timeout | Pure Go-native net.DialTimeout for charly eval live; nc for charly eval box | | matching | contains | Pure in-process value matching — no target probe | | cdp | Method name (status/list/eval/text/html/url/axtree/screenshot/open/click/type/raw/wait/coords + spa-) + method-specific modifiers (tab, expression, url, selector, text, artifact, x, y) + shared stdout/stderr/exit_status/artifact_min_bytes | Deploy-scope only. Wraps charly eval cdp <method>. See Live-container verb catalog below. | | wl | Method name (screenshot/status/click/type/key/key-combo/mouse/scroll/drag/clipboard/toplevel/windows/focus/close/geometry/xprop/atspi/exec/resolution + overlay-/sway-*) + method-specific modifiers (x, y, text, key, combo, target, action, artifact) + shared matchers | Deploy-scope only. Wraps charly eval wl <method> (including sway and overlay nested subgroups). See Live-container verb catalog below. | | dbus | Method name (list/call/introspect/notify) + method-specific modifiers (dest, path, method, args, text) + shared matchers | Deploy-scope only. Wraps charly eval dbus <method>. | | vnc | Method name (status/screenshot/click/mouse/type/key/rfb/passwd) + method-specific modifiers (x, y, text, key, artifact) + shared matchers | Deploy-scope only. Wraps charly eval vnc <method>. | | mcp | Method name (ping/servers/list-tools/list-resources/list-prompts/call/read) + method-specific modifiers (tool, uri, input, mcp_name) + shared matchers | Deploy-scope only. Speaks github.com/modelcontextprotocol/go-sdk to any mcp_provide endpoint. See "Method allowlist — mcp" below. |

Shared modifiers

| Field | Purpose | |-------|---------| | id | Optional stable identifier. Enables charly.yml to override by id. Unique per section per image. | | description | Human-readable label for reports. | | skip: true | Always skip this check (reported but doesn't fail the run). | | exclude_distros: [<tag>, ...] | Skip the check when any of the image's distro: tags matches an entry. Use for probes that only apply on some distros (e.g. file: /usr/bin/fastfetch is valid on Fedora/Arch/Debian but fastfetch is dropped from Ubuntu 24.04's noble main). Matched against the image's full distro list (["ubuntu:24.04", "ubuntu", "debian"]), so either ubuntu:24.04 or ubuntu matches. See charly/testspec.go:Check.ExcludeDistros and charly/testrun.go:runOne. | | timeout: "5s" | Per-check timeout (http, addr). | | scope: build\|deploy | Default build at layer/image level, deploy at deploy level. |

`exclude_distros:` worked example

- id: fastfetch-binary
  file: /usr/bin/fastfetch
  exists: true
  exclude_distros:
    - ubuntu:24.04

On ghcr.io/overthinkos/ubuntu-coder:latest this reports skipped (excluded on distro "ubuntu:24.04") instead of failed. On any other image it runs normally. Prefer this over dropping the test entirely — it keeps the guard in place for Fedora/Arch/Debian and documents why Ubuntu is special.

Compared to package_map: — package_map: changes what the test probes for per distro (same semantic, different package name). exclude_distros: skips the check entirely — use it when the functionality genuinely doesn't exist on a given distro.

Matcher forms

Any MatcherList attribute (stdout, stderr, body, headers, contains, opts, value) accepts three shapes — picked by yaml.v3 at parse time:

stdout: PONG                         # scalar  → [{equals: PONG}]
stdout: [PONG, READY]                # list    → [{equals: PONG}, {equals: READY}]
stdout:
  - equals: PONG                     # operator map
  - contains: ["ready", "ok"]
  - matches: "^[A-Z]+$"
  - not_contains: "error"

Supported operators: equals, not_equals, contains, not_contains, matches, not_matches, plus lt/le/gt/ge (numeric, added as verbs need them).

status: is NOT a MatcherList — it's a plain int on the http verb. One code per test. See Authoring Gotcha #2.

Live-container verb catalog: `cdp`, `wl`, `dbus`, `vnc`, `mcp`

These five verbs wrap the corresponding charly eval <verb> <method> CLI subcommands so every live-container operation (browser automation, Wayland input/screenshot, D-Bus calls, VNC framebuffer capture, MCP protocol probes) is authorable as a declarative check. All five are deploy-scope only — they need a running container with port mappings; charly box validate rejects them in build scope, and charly eval box skips them at runtime with a clear message.

Assertion semantics: subprocess delegation — the runner executes charly eval <verb> <method> <image> <args…> [-i <instance>] on the host, captures stdout/stderr/exit, and feeds the output through the existing matcher pipeline. Queries (status/list/eval/text/html/url/axtree/screenshot/…) produce assertable output. Side-effect actions (click/type/open/close/…) pass when they exit 0 — follow them with a query check to verify the effect.

No state chaining (Phase 1): there's no framework-managed variable to carry a tab ID or window handle from one check to the next. Authors rely on conventions (new tabs are usually ID 1) and capture state manually via follow-up cdp: list checks if needed. Phase 2 (deferred) will add capture: + ${CAPTURED:name} for stateful flows.

Method allowlist — `cdp` (15 methods + 6 SPA-nested)

Queries: status, list, url, text, html, eval, axtree, coords, raw, wait, screenshot. Actions: open, close, click, type. SPA-nested (selkies coordinate-scaling / passthrough input): spa-status, spa-click, spa-type, spa-key, spa-key-combo, spa-mouse.

- id: cdp-up
  cdp: status
  stdout:
    equals: ok

- id: cdp-page-title
  cdp: eval
  tab: "1"
  expression: "document.title"
  stdout: "Dashboard"

- id: cdp-screenshot-valid
  cdp: screenshot
  tab: "1"
  artifact: /tmp/cdp.png
  artifact_min_bytes: 10000       # PNG must be non-empty

Method allowlist — `wl` (22 top-level + 4 overlay + 12 sway)

Queries: screenshot, status, toplevel, windows, geometry, xprop, atspi, clipboard. Actions: click, double-click, mouse, scroll, drag, type, key, key-combo, focus, close, fullscreen, minimize, exec, resolution. Overlay-nested: overlay-list, overlay-status, overlay-show, overlay-hide. Sway-nested: sway-tree, sway-workspaces, sway-outputs, sway-msg, sway-focus, sway-move, sway-resize, sway-kill, sway-floating, sway-layout, sway-workspace, sway-reload.

- id: wl-desktop-captured
  wl: screenshot
  artifact: /tmp/wl.png
  artifact_min_bytes: 10000

- id: wl-has-windows
  wl: toplevel
  stdout:
    matches: "."        # at least one line of output

- id: wl-sway-has-workspace-1
  wl: sway-workspaces
  stdout:
    contains: '"name":"1"'

Method allowlist — `dbus` (4 methods)

Queries: list, call, introspect. Actions: notify.

- id: dbus-notifications-registered
  dbus: list
  stdout:
    contains: "org.freedesktop.Notifications"

- id: dbus-get-capabilities
  dbus: call
  dest: org.freedesktop.Notifications
  path: /org/freedesktop/Notifications
  method: org.freedesktop.Notifications.GetCapabilities
  args: []
  stdout:
    contains: body

Method allowlist — `vnc` (8 methods)

Queries: status, screenshot, rfb. Actions: click, mouse, type, key, passwd.

- id: vnc-up
  vnc: status
  stdout:
    equals: ok

- id: vnc-framebuffer-captured
  vnc: screenshot
  artifact: /tmp/vnc.png
  artifact_min_bytes: 5000

Method allowlist — `mcp` (7 methods)

Queries: ping, servers, list-tools, list-resources, list-prompts, read. Actions: call.

The mcp verb speaks the Model Context Protocol to any server advertised via mcp_provide. URL resolution is automatic — the runner reads the image's ai.opencharly.mcp_provide OCI label, substitutes {{.ContainerName}}, runs the entries through podAwareMCPProvides, and then rewrites the host portion to 127.0.0.1:<published-host-port> using the same port-mapping data that powers ${HOST_PORT:N}. No URL argument is needed in YAML.

Transport dispatch: transport: http (or empty) → Streamable HTTP; transport: sse → SSE. Anything else is rejected at dial time.

Disambiguation: if an image declares multiple mcp_provide entries, add mcp_name: <server> on the check; otherwise the single entry auto-picks.

- id: mcp-ping
  scope: deploy
  mcp: ping
  timeout: 10s                # optional; default 30s

- id: mcp-list-tools
  scope: deploy
  mcp: list-tools
  stdout:
    - contains: insert_cell   # plaintext "name\tdescription" per line
    - contains: execute_cell

- id: mcp-call-tool
  scope: deploy
  mcp: call
  tool: list_notebooks        # required
  input: "{}"                 # optional JSON arg blob
  exit_status: 0              # assert no IsError

- id: mcp-read-resource
  scope: deploy
  mcp: read
  uri: file:///tmp/example.txt
  stdout:
    - matches: "."            # non-empty body

Ad-hoc CLI equivalents (each leaf also supports --json for programmatic use and --name <server> for multi-server images):

charly eval mcp ping       <image> [-i <instance>] [--name <server>]
charly eval mcp servers    <image> [-i <instance>]
charly eval mcp list-tools <image> [-i <instance>] [--name <server>]
charly eval mcp call       <image> <tool> '<json-args>' [-i <instance>]
charly eval mcp read       <image> <uri>            [-i <instance>]

Output format: tab-separated plaintext (one record per line) so matchers can contains: without JSON decoding. list-tools emits <name>\t<description>; list-resources emits <uri>\t<name>\t<mime>; call emits the concatenated TextContent payloads; ping emits ok.

Artifact-validation modifiers

For artifact-producing methods (cdp: screenshot, wl: screenshot, vnc: screenshot, libvirt: screenshot, spice: screenshot, record: stop), set artifact: <path> to tell charly where to write the output AND any combination of the modifiers below to assert the artifact's correctness post-run.

artifact_min_bytes: <N> — assert the file is at least N bytes after the run. Guards against zero-byte files. Cheap (os.Stat only).
artifact_min_dimensions: WxH — assert decoded image width and height are each at least the given values. Reads only the PNG/JPEG header via image.DecodeConfig, so essentially free. Catches "the screenshot ran but the compositor produced 320×240 instead of the expected 1920×1080".
artifact_not_uniform: true — assert the image is not uniformly one color. Decodes the full image and samples 100 pixels at deterministic stride positions; fails if every sampled pixel has the same RGBA. Catches the failure mode artifact_min_bytes is blind to: a 100KB all-black PNG passes the byte check but fails this one. Use on every screenshot probe where "the image has real content" matters.
artifact_min_cast_events: <N> — for asciinema .cast artifacts (e.g. record: stop in terminal mode): validates the first line is the asciinema v2 header and counts at least N event lines after it. Catches "the recording started and stopped but captured nothing because nothing was typed".

# Combined screenshot validity assertion — bytes + dimensions + content.
- id: cdp-screenshot-real
  cdp: screenshot
  tab: "1"
  artifact: /tmp/cdp.png
  artifact_min_bytes: 5000
  artifact_min_dimensions: 800x600
  artifact_not_uniform: true

# Recording validity — bytes + event count.
- id: record-cast-has-events
  record: stop
  artifact: /tmp/session.cast
  artifact_min_bytes: 200
  artifact_min_cast_events: 5

The four modifiers are independent — set just the one you need or combine all four for the strongest "the artifact is real" assertion. Failure messages identify the specific artifact and the specific threshold that wasn't met.

Gotcha — stale container-baked `charly` binary

The dbus: verb invokes the container's charly binary via delegation. If the container bakes an charly binary too old to know the charly eval <verb> subcommand path, the delegation fails. Rebuild and redeploy any image that bakes charly (grep charly.yml for - charly$). The test runner itself is unaffected; this only bites the host→container delegation paths in the dbus verb.

Authoring Gotchas (learned the hard way)

These are the non-obvious issues that surface only when you run the tests against real containers. Skim before writing new checks.

1. Host-side deploy tests: `127.0.0.1:${HOST_PORT:N}`, NOT `${CONTAINER_IP}:${HOST_PORT:N}`

In rootless podman, the container's pod-network IP (e.g. 10.89.0.x) is not routable from the host. ${HOST_PORT:N} resolves to a port on the host's loopback that forwards into the container. Combining them is nonsense:

# ❌ WRONG — 10.89.0.x:HOST_PORT is unreachable from either side.
addr: ${CONTAINER_IP}:${HOST_PORT:8080}

# ✓ RIGHT — host-side access via the forwarded port.
addr: 127.0.0.1:${HOST_PORT:8080}

# ✓ ALSO RIGHT — in-container, using the container port directly.
command: curl -fsS http://127.0.0.1:8080/health
in_container: true

Use ${CONTAINER_IP} only from inside another container in the same pod.

2. `http` status is a single int — no lists

# ❌ WRONG — yaml unmarshals a list to int fails with "cannot unmarshal !!seq"
status: [200, 301, 302]

# ✓ RIGHT — pick the most-likely code. Or author multiple tests.
status: 200

For endpoints that return 400/401 when probed without auth (MCP servers, API endpoints), status: 400 is a legitimate "endpoint exists and refuses empty input" check.

3. `pgrep`, `ss`, `nc` are NOT in minimal images

The process:, port: listening, and nc-based command tests silently fail on images that skip procps-ng / iproute / ncat. Portable alternatives:

| Intent | Portable alternative | |--------|----------------------| | "supervisord is running" | command: supervisorctl pid; exit_status: 0 (in_container) | | "some program under supervisord" | service: <name>; running: true | | "port N listening in-container" | command: curl -fsS http://127.0.0.1:N/<route>; in_container: true | | "port reachable from host" | addr: 127.0.0.1:${HOST_PORT:N}; reachable: true (Go-native dial) |

4. `supervisorctl status` exits 3 when ANY program is non-RUNNING

Layers like hermes declare autostart=false programs (e.g. hermes-whatsapp), which leaves them STOPPED. supervisorctl status returns exit 3 — fails exit_status: 0 checks.

# ❌ BRITTLE — fails if any program is autostart=false.
command: supervisorctl status
exit_status: 0

# ✓ ROBUST — `pid` asks for supervisord's own PID, 0 iff the socket
# responds.
command: supervisorctl pid
exit_status: 0
in_container: true

5. Know which stream a `--version`-style command writes to

charly version writes to stdout (via fmt.Println) — the canonical candy/charly/charly.yml test asserts a stdout: matcher.

- id: charly-version
  command: /usr/local/bin/charly version
  exit_status: 0
  stdout:
    - matches: "[0-9]{4}\\.[0-9]+"

But other tools genuinely write to stderr — ssh -V, openssl version, some *-config --version scripts. Always probe first (charly cmd <image> '<cmd> 2>&1 >/dev/null' — if the output shows up here it was stderr; if silent, it's stdout). See also the ssh-version check in candy/ssh-client/charly.yml which IS a real stderr case.

6. Drop `$` anchor in `matches:` regexes on command output

Command stdout typically ends with \n. Go's regexp treats $ as end-of-input (not end-of-line by default), so \.ipynb$ won't match path.ipynb\n. Drop the anchor:

# ❌ WRONG — fails because trailing newline
stdout:
  - matches: "\\.ipynb$"

# ✓ RIGHT — substring match is usually what you want
stdout:
  - matches: "\\.ipynb"

7. `${VAR:-default}` is NOT supported by charly's resolver

charly's variable resolver accepts ${IDENT} only — no bash-style defaults, no parameter expansion. An unresolved variable is reported as a unresolved variables: <name> SKIP.

# ❌ Parsed as a single var name; reported as unresolved if not set
command: psql -U ${POSTGRES_USER:-postgres}

# ✓ Plain env-var reference; resolves from the container's environment
# at deploy scope. For a true default, set it in the layer's `env:`.
command: psql -U ${POSTGRES_USER}

8. Fedora 43 package renames — query the installed name, not the dnf request

The package: verb does exact string equality against the rpm database, not the dnf install request. When dnf resolves a virtual provides, the test must name the actual installed package.

| dnf request | Installed name on Fedora 43 | |-------------|-----------------------------| | redis | valkey-compat-redis | | liberation-fonts | liberation-sans-fonts (+ -serif-fonts, -mono-fonts) |

Always verify with charly shell <image> -c 'rpm -qf /path/to/binary'.

9. Volume paths don't exist at build-scope

/workspace, ${VOLUME_CONTAINER_PATH:name}, and similar paths are provisioned by charly config at deploy time. A build-scope test on them fails in charly eval box because the mount isn't attached.

# ❌ Fails under `charly eval box` — workspace isn't mounted
- id: workspace-dir
  file: /workspace
  filetype: directory

# ✓ Scope to deploy so it runs only against live containers
- id: workspace-dir
  scope: deploy
  file: /workspace
  filetype: directory

10. Disposable tests run as the image's default USER — not root

charly eval box runs podman run --rm <image> with the image's USER directive (typically uid 1000 for containers). Root-only files (e.g. /etc/sudoers.d/* is root:root 0750) report as "missing" because the user can't traverse the parent directory.

# ❌ Reports "exists=false" even though the file IS there (0750 dir)
- id: sudoers-charly-user
  file: /etc/sudoers.d/charly-user
  exists: true

# ✓ Verify the semantic (can I sudo?) instead of the file bit
- id: sudoers-charly-user
  command: sudo -n -l
  exit_status: 0
  stdout:
    - contains: "NOPASSWD"

11. Bootc images keep USER=root — tests must cover both modes

Gotcha 10 flips for bootc. charly/generate.go deliberately omits the final USER <uid> directive on bootc images because systemd (PID 1 in a bootc VM) manages user sessions via login, so the container's own USER directive is irrelevant. Result: the same charly eval box that runs as uid 1000 in a container runs as uid 0 in a bootc image.

That breaks any test that assumes the user-context default. A sudo -n -l check that expects NOPASSWD in the output works for non-bootc (USER=1000 → sudo lists the user's NOPASSWD rule) but fails for bootc (USER=0 → sudo prints root's Defaults block, which doesn't contain the literal string NOPASSWD). Same failure mode for any test -f ~/.config/foo that depends on $HOME=/home/user.

The fix: drop into user explicitly when running as root, stay as-is otherwise. Use runuser -u user -- <cmd> as the bridge — it's portable across util-linux versions and preserves stdout:

# Dual-mode: container (USER=1000) AND bootc (USER=root) both pass
- id: sudoers-charly-user
  command: |
    if [ "$(id -u)" = "0" ]; then
      runuser -u user -- sudo -n -l
    else
      sudo -n -l
    fi
  exit_status: 0
  stdout:
    - contains: "NOPASSWD"

Don't use runuser -l user -s /bin/bash -c '…'. On Arch's util-linux (2.42+), runuser -l … -c swallows the wrapped command's stdout when the shell is a login shell — reproduced: runuser -l user -s /bin/bash -c 'sudo -n -l' prints nothing and exits 0; runuser -u user -- sudo -n -l prints the full NOPASSWD listing. This bit the earlier shipping form of the test; the sshd candy now uses -u … --.

Worked example: /charly-coder:sshd ships exactly the -u user -- pattern. Alternative: make the check scope: deploy so it runs against a live container where you control the user-switch externally.

12. `charly eval box <short-name>` is ambiguous with multiple CalVer tags — use the full registry ref

charly eval box resolves its positional argument against local podman storage, not charly.yml. When the host has accumulated many CalVer tags for the same image (a normal consequence of iterative charly box build runs), the short form errors out:

charly: error: ambiguous short name "openclaw-desktop" in local storage;
           candidates: ghcr.io/overthinkos/openclaw-desktop:latest,
           ghcr.io/overthinkos/openclaw-desktop:2026.109.1418,
           ... Re-run with a full ref.

Use the fully-qualified registry ref:

charly eval box ghcr.io/overthinkos/openclaw-desktop:latest

This is different from charly box inspect, charly box build, and charly eval live (the live-service runner), which key off charly.yml and accept short names unambiguously. Only the disposable-container runner has this restriction because it does not consult charly.yml at all.

Three levels, three sections

| Section | Authored in | When it runs | |---------|-------------|--------------| | candy | eval: in candy/<name>/charly.yml (scope:"build") | charly eval box + charly eval live | | box | eval: in charly.yml per box (scope:"build") | charly eval box + charly eval live | | deploy | eval: with scope: deploy, or deploy_eval: in charly.yml, or local charly.yml eval: | charly eval live <name> only (deploy-scope checks need a running deployment with port mappings, volumes, and resolved runtime variables) |

The build label ai.opencharly.eval contains all three sections with origin: annotations (candy:<name>, box:<name>, deploy-default, deploy-local). CollectEval walks the base-image chain with a visited-image guard: cycles are reported by validateBoxDAG at validate time, but the collector itself terminates cleanly even if called on a pathological config.

Verb routing (which executor runs each check)

Every verb dispatches through one of two executors depending on the run mode — this is what makes the same eval: list work both against a disposable container (charly eval box) and a running service (charly eval live):

| Verb + attributes | Under charly eval live (running service) | Under charly eval box (disposable) | |-------------------|-----------------------------------|------------------------------------| | file, package, service, process, user, group, interface, kernel-param, mount | ContainerExecutor (podman exec) | ImageExecutor (podman run --rm) | | port with listening | ContainerExecutor (ss/netstat inside) | ImageExecutor | | port with reachable | Host-side net.DialTimeout | Skipped (no host port binding on a disposable run) | | command (in_container: true, default) | ContainerExecutor | ImageExecutor | | command (in_container: false) | Host-side exec.Command | Skipped | | http, dns, addr | Host-side (from the charly process) | In-container curl / getent hosts / nc | | matching | In-process matcher eval | Same |

The routing table lives in charly/testrun.go (runOne switch) and charly/testrun_verbs.go. When a check is unroutable (e.g. port: reachable under charly eval box), the runner reports it as skipped with a reason rather than failing the run.

in-container `command:` stdin is guarded

An in-container command: script (in_container: true, the default) is delivered to the pod shell over a stdin heredoc (NestedExecutor.wrapWithJump, "stdin-attached exec"). The runner wraps every such script in { <script>; } </dev/null (wrapContainerCommand, charly/evalrun.go) so a subcommand that reads stdin — adb shell, ssh, read, cat — cannot consume the rest of the heredoc (the not-yet-run script lines, which would otherwise truncate the check to its first command). Authors write plain multi-line scripts; no per-call-site </dev/null is needed. Host-side command: (in_container: false) runs via sh -c argv and is unaffected. For Android UI readiness, prefer the typed adb: wait-ui-settled / current-focus / keyevent verbs (/charly-eval:adb) over shell entirely.

Runtime variables

charly eval live resolves these via podman inspect on the running container before any check executes. ${NAME:arg} is parameterized form.

| Variable | Source | Scope | |----------|--------|-------| | ${USER}, ${HOME}, ${UID}, ${GID} | Image metadata (OCI labels) | build + deploy | | ${IMAGE} | Image metadata | build + deploy | | ${DNS}, ${ACME_EMAIL} | Image metadata + charly.yml overlay | build + deploy | | ${INSTANCE} | --instance flag / charly.yml key | deploy | | ${CONTAINER_NAME}, ${CONTAINER_IP} | podman inspect | deploy | | ${HOST_PORT:N} | Port mapping for container port N | deploy | | ${VOLUME_PATH:name} | Host path backing the named volume (bind source, encrypted mount, or _data dir) | deploy | | ${VOLUME_CONTAINER_PATH:name} | In-container mount path for a volume | deploy | | ${ENV_NAME} | Effective env var value on the running container | deploy | | ${PEER_HOST:name} | A SEPARATE deployment's container DNS name on the shared charly net (charly-<name>); cross-deployment addressing (see below) | deploy | | ${PEER_ENDPOINT:name:port} | A host-reachable 127.0.0.1:NNNN for a separate deployment's port (published port / VM ssh-forward); host-vantage cross-deployment addressing | deploy |

Build-scope checks may not reference deploy-scope variables — the validator flags this at charly box validate time.

No bash-style defaults: ${VAR:-fallback} is unsupported (see Authoring Gotcha #7). Plain ${VAR} only.

Cross-deployment probing — `on:` driver, `peer:` siblings, `${PEER_*}`

A check normally probes the deployment under test. Cross-deployment probing lets ONE deployment act as a test DRIVER against a SEPARATE deployment as the SUBJECT — the canonical case being a Chrome DRIVER pod that CDP-probes a separate web-server SUBJECT pod, so a real browser tests the subject without baking Chrome into the subject image. "A different kind of deployment tests another kind" — pod→pod, local→pod, and local→VM, all through one mechanism.

Three pieces compose it:

on: <driver> (the step modifier) — DISPATCH this probe against a named DRIVER deployment instead of the subject. For a cdp:/vnc:/mcp: check it connects to the driver's CDP/VNC/MCP endpoint (charly eval cdp <method> <driver>); for a command: check it runs in the driver's venue. Works in charly eval live, kind: eval beds, and recipes (the driver's own ${HOST_PORT}/${CONTAINER_IP} resolve too).
peer: siblings — bring the driver up ALONGSIDE the subject on the shared charly network (see /charly-core:deploy "Sibling peers"). A kind: eval bed (or a kind: deploy) declares the driver under peer:; it is brought up by the same deploy verbs and is NEVER eval-live'd (an instrument, not a subject).
${PEER_*} address variables — let the driven probe TARGET the subject:
- ${PEER_HOST:<name>} → the deployment's container DNS name on the shared charly net (charly-<name>; also verifies it is running). The pod→pod address: http://${PEER_HOST:<subject>}:8080. Reference the subject by its OWN deploy name (for a bed, that's the bed name — the container is charly-<bedname>).
- ${PEER_ENDPOINT:<name>:<port>} → a host-reachable 127.0.0.1:NNNN for that deployment's <port> (container published port, or an ssh -L forward for a VM/host subject). The host-vantage address a local/host driver uses to reach a pod/VM subject. (Both are runtime-only — a build-scope check can't use them.)
An unresolvable ${PEER_*} — the peer/subject is unreachable (not running, bad port, VM ssh-forward refused) — FAILS the referencing check, never SKIPs it. A skip on an unreachable dependency is a fake pass: it would let a bed go green even though the cross-deployment probe never reached its target. (Legitimate skips — skip: true, exclude_distros, a deploy-only var under build scope — are unaffected; only an unresolved mandatory ${PEER_*} is treated as a failure.)

Worked example — a Chrome pod CDP-probes a web-server pod (pod→pod)

# charly.yml eval: block — kind:eval bed: web SUBJECT (root) + chrome DRIVER (peer)
eval-cross-pod-cdp:
    target: pod
    box: web                       # the SUBJECT under test (nginx fixture on :8080)
    disposable: true
    # No port: — every inherited container port auto-allocates a free 127.0.0.1
    # host port (conflict-free across concurrent beds). Pin with `port:
    # ["H:C"]` only when a fixed host port is needed.
    peer:
        chrome:                    # the DRIVER, a sibling on the charly net
            target: pod
            box: chrome-headless   # headless Chrome + cdp-proxy (publishes 9222)
    eval:
        - id: web-fixture-up       # the subject serves its marker
          scope: deploy
          http: http://127.0.0.1:${HOST_PORT:8080}/
          status: 200
          body: [{contains: charly-fixture-web-content-marker}]
        - id: cdp-open-web-subject # DRIVE: chrome navigates to the subject over the charly net
          scope: deploy
          cdp: open
          on: chrome
          url: http://${PEER_HOST:eval-cross-pod-cdp}:8080
          eventually: 45s
          retry_interval: 3s
        - id: cdp-web-fixture-rendered  # ASSERT: the rendered page carries the marker
          scope: deploy
          cdp: text
          on: chrome
          tab: "1"                 # the first page (cdp `open` lands on tab 1)
          eventually: 30s
          retry_interval: 3s
          stdout: [{contains: charly-fixture-web-content-marker}]

bringUpPeers config+starts the chrome peer alongside the web root; the on: chrome checks dispatch charly eval cdp <method> chrome (connecting to chrome's published 9222) while ${PEER_HOST:eval-cross-pod-cdp} resolves to the web subject's charly-eval-cross-pod-cdp container, reached over the shared net.

Driver-box requirement (CDP): a headless Chrome CDP driver MUST launch with --remote-allow-origins='*' (Chrome 146+ rejects the CDP WebSocket upgrade otherwise — cdp open via HTTP works, but cdp text/eval/screenshot fail). The chrome-headless box sets it; see /charly-eval:cdp "Requirements".

Cross-kind: a host driver reaching a pod OR a VM subject (`${PEER_ENDPOINT}`)

${PEER_ENDPOINT} is kind-agnostic — it routes through the ONE host-vantage port resolver (resolveEvalEndpoint): a pod subject resolves to its auto- published port (podman port); a VM subject resolves to an ssh -L 127.0.0.1:<rand>:127.0.0.1:<guestport> forward over the managed charly-<vm> alias (the SAME forward eval-k3s-vm uses to reach the cluster API host-side). So the identical check — command: curl …${PEER_ENDPOINT:<subject>:<port>} dispatched on: a kind: local host driver — proves a pod subject (eval-cross-local-http) and a VM subject (eval-cross-vm-http) with ZERO kind-specific eval code; only the subject kind differs.

The VM subject is driven from a host-side (target: local) peer, NOT a pod driver. ${PEER_ENDPOINT} is a host-vantage 127.0.0.1:NNNN address; a pod driver can't reach it (that loopback is the pod's own netns), and a rootless qemu:///session VM shares no L2 bridge with rootless pods. The generic, no-hack VM cell therefore uses a local driver peer — the host where both the published port and the ssh-forward live. (${PEER_HOST} — pod-net container DNS — is the pod→pod address and does NOT reach a VM.)

Deploy.yml overlay rules

A local charly.yml can contribute its own eval: list per image. Merge rules applied by charly eval live:

Local entries with an id: matching a baked entry replace that entry.
Entries without a matching id: are appended.
To disable a baked check, reference it with id: and skip: true.

# ~/.config/charly/charly.yml
box:
  redis-ml:
    eval:
      - id: redis-responds                  # overrides image's baked check
        command: redis-cli -h 127.0.0.1 -p 16379 ping
        stdout: PONG
        in_container: false
      - id: external-tunnel                 # appended (no baked match)
        http: https://redis.tailnet.ts.net/health
        status: 200
      - id: old-probe                       # disable a legacy baked check
        skip: true

Typical workflow

# 1. Author tests in candy/<name>/charly.yml.
# 2. Validate schema + references.
charly box validate

# 3. Build the image (tests are auto-embedded as LabelEval).
#    LABELs are emitted LAST in the final stage — a test edit rebuilds
#    in ~2 sec (cache hits every upstream RUN/COPY).
charly box build redis-ml

# 4. Run against a disposable container (build-scope checks only).
charly eval box redis-ml

# 5. Start the service and test end-to-end.
charly start redis-ml
charly eval live redis-ml

# 6. Override a baked deploy check via local charly.yml, re-test.
$EDITOR ~/.config/charly/charly.yml    # add tests: entry with matching id
charly eval live redis-ml

Coverage snapshot (7 currently-tested images)

Reference numbers from the last end-to-end session:

| Image | Tests | Pass | Fail | Skipped | |-------|-------|------|------|---------| | filebrowser | 24 | 24 | 0 | 0 | | jupyter | 32 | 32 | 0 | 0 (includes 3 declarative mcp: checks) | | openwebui | 24 | 24 | 0 | 0 | | hermes | 50 | 50 | 0 | 0 | | immich-ml | 63 | 61 | 0 | 2 (redis port internal) | | selkies-desktop | 91 | 91 | 0 | 0 | | sway-browser-vnc | 92 | 92 | 0 | 0 (includes 7 declarative cdp/wl/dbus/vnc/mcp checks; 2 mcp tests come from the chrome-devtools-mcp layer) |

Total: 376 checks across 7 images (0 failing).

The "skipped" entries are intentional — they reference ports that aren't mapped in the containing image's port: block. Skipping them is correct behavior; they'd pass in an image that does expose those ports.

Realistic run-time expectations

charly eval run default is a multi-phase AI-iteration loop. Representative per-phase durations (measured solving all 92/92 across 9 iterations on a disposable: true eval-sandbox):

| Phase | Recipe | Typical iter1 duration | Notes | |---|---|---|---| | 1 | single-pod-system-state | ~10 min | trivial pod deploy | | 2 | network-and-http | ~few min (cumulative scoring keeps phase 1 in scope) | nginx + curl in fedora:43 | | 3 | cross-pod-nonce-traffic | ~few min | redis + redis-client; EVAL_NONCE_KEY/VALUE substituted per iter | | 4 | mcp-protocol-probe | ~14 min | jupyter image build (~6 min) + 32 scenarios | | 5 | kubernetes-cluster | ~20-30 min | k3s/kwok cluster bring-up | | 6 | desktop-display-and-input | ~50 min iter1, ~10 min iter2 | sway-browser-vnc image build (~14 min) + 13 cdp/wl/vnc scenarios; commonly takes 2 iters due to Chrome SIGBUS instability | | 7 | vm-control-libvirt-spice | ~40-50 min | libvirt domain (cloud-init, qemu-guest-agent) + 9 scenarios | | 8 | nested-deployment-chain | ~40 min | socket-passthrough nested pods + 9 cross-hop scenarios |

Total wall-clock for a fresh charly eval run default: typically 5-7 hours on dev-class hardware. Set expectations accordingly — this is NOT a quick smoke. For a fast canary that exercises the loader + scoring paths end-to-end without real AI work, run charly eval run scaffolding-selftest (~5 min, single recipe composition-import-selftest).

The plateau bound is score.<name>.plateau_iteration (default 3). In progressive mode, plateau ENDS the whole run — it does not advance to the next phase, so an AI that stalls on one phase doesn't collect easy wins on later phases it never engaged with.

Issue 52328 (claude-code Bash + pgrep self-match deadlock) defense

The Claude Code runtime has a known issue documented in this project's own .eval/ISSUE-claude-code-bash-pgrep-self-match-deadlock.md: when the AI uses Bash{run_in_background: true} with a until ! pgrep -f "<pattern>" poll-loop, the literal pattern string is preserved in the spawned bash's argv (because the wrapper uses bash -c '… eval "<command>" …'). pgrep -f then matches itself, the loop spins forever, the parent claude -p process can't exit because it waits for all background bash subprocesses, and the harness orchestrator deadlocks waiting on claude.

The harness defends against this pattern between iterations: at every iter end (in eval_loop.go's commitIterationBestEffort), the orchestrator runs podman exec charly-eval-sandbox pkill -f "while true.*sleep [0-9]+" to clean up orphaned keepalive bashes left dangling by the AI's TaskOutput-timeout pattern, and logs the kill count.

Workaround for the AI inside the loop: prefer kill -0 <PID>-style checks (track the original PID via cmd & echo $!) over pgrep -f when the pattern would also match the wrapping bash. The harness defense is a safety net, not a license to use the broken pattern deliberately.

Related skills

Live-container probe verbs under charly eval — /charly-eval:cdp, /charly-eval:wl, /charly-eval:dbus, /charly-eval:vnc, /charly-build:charly-mcp-cmd, /charly-eval:record, /charly-eval:spice, /charly-eval:libvirt, /charly-kubernetes:eval-k8s are dispatched as charly eval cdp|wl|dbus|vnc|mcp|record|spice|libvirt|k8s. See the Subcommands section above.
/charly-image:layer — layer authoring; eval: field is part of every charly.yml.
/charly-image:image — image-level eval: and deploy_eval: at composition time.
/charly-core:deploy — local charly.yml overlay rules and the eval: merge.
/charly-build:validate — static schema + cross-scope variable checks; the first gate before charly box build.
/charly-build:build — how eval entries are embedded into the OCI label at build time; LABELs-at-end cache behavior.
/charly-build:inspect — view the merged 3-section eval structure as JSON.
/charly-build:migrate — charly migrate brings legacy configs up to the current schema (the eval: block in charly.yml).
/charly-internals:go — implementation map: evalspec.go, evalvars.go, evalrun.go, evalrun_verbs.go, evalrun_charly_verbs.go, evalcollect.go, eval_cmd.go, eval_runner_cmd.go, eval_loop.go, eval_runner_live.go, eval_watchdog.go, validate_eval.go, mcp.go, mcp_client.go, plus the LabelEval constant in labels.go.
/charly-internals:generate-source — how LabelEval is written into the Containerfile via writeJSONLabel, and why the LABEL block lives at the end of the final stage.
/charly-internals:capabilities — the LabelEval three-section OCI label is part of the same capability contract as LabelService.
/charly-internals:agents — the sub-agents (eval-bed-runner, deploy-verifier) and dynamic workflows (/verify-beds, /audit-deploy-configs) that DRIVE these beds, and the R10/disposable/ paste-proof rules that bind any agent or workflow running charly eval run.

When to Use This Skill

MUST be invoked before authoring, running, or debugging eval behavior at any level. Triggers: charly eval (any subcommand), charly eval run, charly eval box, charly eval live, the eval: / deploy_eval: YAML fields, the ai.opencharly.eval OCI label, kind: agent / kind: recipe / kind: score / kind: eval in the eval: block, charly eval run <bed> / --all-beds (kind:eval R10 beds), or any check verb by name (file/port/http/command/package/service/cdp/wl/dbus/vnc/mcp/...).

Eval — unified declarative + AI-iteration evaluation

kind: eval beds + `charly eval run <bed>` — config-driven R10 acceptance

charly box build <image> --dev-local-pkg — build the test artifact (pod beds only). The eval-bed runner sets --dev-local-pkg AUTOMATICALLY, so a bed bakes the in-development charly toolchain (any localpkg: candy built from local source) — never a stale published release. A bed thus always tests the code under development; a production box build downloads the released package. See /charly-tools:charly "Eval-vs-production binary source" + /charly-internals:install-plan "Eval-vs-production charly toolchain".
charly eval box <image> — box-section + baked candy-section probes.
charly deploy add <bed> <ref> — apply the bed (or charly vm create for vm beds).
(pod beds) charly config <bed> + charly start <bed>.
charly eval live <bed> — full three-section live probe pass.
charly update <bed> — fresh-rebuild re-verification (R10 acceptance gate).
charly remove <bed> (or charly vm destroy) — leave the host clean.

Flag discipline — the score/bed config IS the test specification

The ecosystem's `kind: eval` beds

Approximate wall-clock (10-CPU 32-GB reference host)

Prereq for the vm bed

eval-k3s-vm depends on the libvirt user-session daemon. Enable once:

systemctl --user enable --now virtqemud.service     # libvirt >= 8 (modular)
# OR (older monolithic):
systemctl --user enable --now libvirtd.service

(The host libvirt daemon is host infrastructure, not a charly-managed resource — enabling it via systemctl sits outside the charly-CLI-only mandate's scope.)

Why this is config-driven

R10 gate by change class — pick the gate that exercises the change

The 10 Testing Standards (READ FIRST — referenced by CLAUDE.md R1–R10)

Prove every HIGH-RISK assumption BEFORE you edit (RDD — Risk Driven Development) — the proactive bookend to Standard 10's fresh-rebuild gate. Low-risk orientation ("what does layer X do") is a skill lookup (R0, zero risk); every high-risk assumption — including any a skill or the code merely asserts, and above all whether this layer composition at its latest available versions builds / deploys / runs TOGETHER — is proven on a disposable: true bed FIRST (charly eval it). Never accept docs or code as ground truth for a high-risk decision; if the bed disagrees with a skill, the skill is stale — fix it. Standard 0 (validate forward, riskiest-first) and Standard 10 (re-verify on a fresh rebuild) are the two ends of the same loop.
Build a real artifact (R7) — charly box build <image> / go build / charly vm build <vm>. Not just go test. Not just charly box generate.
Verify the emitted artifact's content (R8) — grep -c supervisord-conf .build/<image>/Containerfile for any image that uses supervisord; charly eval libvirt domain-xml <vm> for a VM.
Verify critical OCI / capability labels post-build (R8) — charly box labels <ref> --format init (or any ai.opencharly.<key> shorthand) prints the BUILT ref's label and exits non-zero when absent; charly box labels <ref> lists the whole capability contract (/charly-internals:capabilities). Empty / missing label → the detection path silently returned nil → regression.
Deploy to a DISPOSABLE target (R10) — NEVER experiment on a resource that doesn't carry disposable: true. If no suitable disposable target exists, create one first (charly deploy add <name> <ref> --disposable or mark a VM in vm.yml and charly vm create). The setup is part of the task. On a disposable target: charly update <name> (unattended). On anything else: confirm with the user before any irreversible destroy — EXCEPT preempting a declared-preemptible: holder, which is standing-authorized (reversible: graceful stop + guaranteed restore).
Target must reach steady-state — charly status <image> → running; charly eval libvirt info <vm> → state running; SPICE socket file exists and accepts a handshake. If the service start-limit is hit, the container is crashing — read charly logs <image> and reproduce in a disposable shell (charly shell <image>, running the service command manually).
Run the declarative test suite — charly eval live <image> full three-section pass against the live container (or --uri / --host remote equivalent for a remote target).
Verify the deployed binary is the one you built (R9) — charly version on the target matches the expected CalVer, and charly status <image> (detail Image ref: line; JSON image_ref) shows the running container's image ref carries THIS build's CalVer tag, not the prior one. Source-only changes (Syncthing, git push) do NOT update the deployed binary; you must build AND deploy on the target host.
Verify runtime deps are installed via package management (R9) — on the HOST: charly doctor (dependency status), or the host package manager directly (rpm -q <pkg> / pacman -Q <pkg> — host packages are not charly-managed resources, so the package manager IS their interface); inside a container: charly cmd <box> 'rpm -q <pkg>'. Manual installs do NOT count — they won't survive a fresh install on a synced host. Every runtime dep must live in setup.sh + pkg/arch/PKGBUILD.
Leave the target healthy, not paused/errored — the final charly status (and charly eval libvirt info <vm> for a VM) is healthy. If the target is in a broken state during exploration, charly update it back to the committed config before continuing — never layer experiments on broken state.
Re-verify on a FRESH rebuild after committing the source-level fix (R10) — charly update <disposable-target> one more time from clean, with the new source applied. Run standards 1–9 again against this fresh rebuild. THIS IS THE ACCEPTANCE GATE. A fix that works on a hand-patched target but not on a fresh rebuild is a regression waiting for the next unrelated rebuild to wipe your patch. Paste BOTH the exploratory-pass output AND the fresh-rebuild-pass output into the conversation — the user sees both.

`charly eval live parent.child` reaches the actual leaf

charly eval live eval-vm                            # → SSHExecutor (1 hop)
charly eval live eval-vm.inner                      # → SSH + podman exec charly-eval-vm_inner (2 hops)
charly eval live eval-vm.inner.deeper               # → SSH + 2× podman exec (3 hops)

Same chain primitive (NestedExecutor) used everywhere — charly deploy add, charly eval live, charly eval run (AI iteration scoring).

Specific anti-patterns observed and banned

"Unit tests pass → cutover done" — no. Build + deploy + run + test, every time.
"Retested after update → still passing!" but the pre-update test was against the old running image and the post-update container failed to come up — see the is not running error and conclude the update broke it, not that "tests pass in aggregate."
"Service start failed, probably a transient" — no. A dependency job for X failed + immediate exit is a real error. Read charly logs <image>; charly shell <image> reproduces it in a disposable container.
"Lifecycle: dev implies disposable" — no. disposable: true is the ONE authorization for autonomous destroy + rebuild. Lifecycle tags are human-facing metadata; they do not authorize anything. See /charly-internals:disposable.
"This is a dev box so I can just nuke it" — no. The only authorization for autonomous destroy is the explicit disposable: true field on a specific deploy. Everything else requires user confirmation, regardless of hostname.
"I tested on the VM I've been patching all afternoon, looks fine" — incomplete. Run charly update <disposable-target> once more from clean and re-verify before claiming success. Without the fresh-rebuild re-verification, your "fix" may be latent on hand-patched state.
"I'll test it later / Phase 2" — no. If the plan said "clean cutover in one PR", don't invent a Phase 2.
"The skill/code says so, so I don't need to test it" — for a HIGH-RISK assumption, no: docs drift and code has bugs; prove it on a disposable bed FIRST (Standard 0). Conversely, do NOT bed-test a LOW-risk orientation fact a skill already states — that's wasted effort. Risk, not documentation status, is the trigger.

Overview

Quick Reference

Exit codes

charly eval uses a goss/pytest-style three-way exit convention so automation (and charly eval run <bed>) can tell "the thing under test is broken" apart from "the eval couldn't run":

Image preflight (host-target runs only)

For each discovered image, the algorithm is:

LocalImageExists(podman, ref) — short-circuit if already present. Idempotent on re-runs.
charly box pull <ref> — preferred path. Resolves short names via cfg.Images[<name>], full registry refs pass-through, remote @github.com/... refs walk through ResolveRemoteImage.
charly box build <name> — fallback when pull fails AND the identifier is a short name resolvable via the project's charly.yml. Build fallback is local-only; for any non-short identifier that fails to pull, the preflight aborts with an actionable error.

Failures abort the eval BEFORE any scenario runs, so operators see problems early rather than mid-scenario.

Three primary modes (`charly eval box` / `live` / `run`)

The surface is three orthogonal verbs, each named for what it evaluates:

charly eval box <image> — evaluates the image artifact in isolation. Spawns a disposable container via podman run --rm <image-ref> (ImageExecutor). Only eval:-block checks at scope: build run; deploy-scope checks are skipped with a clear message. The right choice for build-time invariants ("did this binary land at this path?", "is this package installed?") where a running service would only add noise.
charly eval live <name> — evaluates a running deployment (pod / vm / host / k8s; auto-resolved by <name> against charly.yml). Uses ContainerExecutor / SSHExecutor / NestedExecutor (for dotted-path children) + ResolveEvalVarsRuntime so deploy-scope checks see real supervisord state, real HOST_PORT:<N> mappings (including host-networked containers via HostConfig.NetworkMode detection), and real in-container env. Same dispatcher that powers charly eval live parent.child for pod-in-vm topologies.
charly eval run <score> — drives an AI runner through plateau-bounded iterations against a kind: score. See "AI iteration loop semantics" below.

The mode is explicit in the verb; there is no autodetect or implicit fallback. Choose the mode by picking the right verb.

The banner after charly eval box reports the image ref:

Image: ghcr.io/overthinkos/fedora-coder:latest

Agent Driven Evaluation (ADE) — `charly box/eval feature run` + the agent grader

The six-stage loop

The BIND contract — a step binds to its verifier BY SHAPE

description:
  scenario:
    - name: dashboard-loads
      step:
        - then: the page title is Dashboard
          cdp: eval
          expression: document.title        # DETERMINISTIC — embedded check verb
        - then: the chart looks populated, not empty   # PROSE-ONLY — agent-graded

The two run verbs

charly box feature run <image> — BUILD scope. A disposable container (podman run --rm per check), deterministic steps only. A prose-only step has no stable target to probe, so it reports unbound (advisory skip; --strict fails it). Resolves the image against local storage (never charly.yml), like charly eval box.
charly eval feature run <deployment> — DEPLOY scope. Against a running image-backed (pod) deployment; deterministic steps run their checks and prose-only steps bind to the agent grader (unless --no-agent). Flags: -i/--instance, --format text|json|tap|junit, --tag <expr>, --agent <name>, --timeout <dur>, --no-agent, --strict.

Exit codes follow the 0/1/2 convention (a scenario fail → exit 2 via EvalFailedError; an infra error → 1). No scenarios baked → a no-op pass.

The agent grader contract

Worked example (end to end)

# 1. Specify two scenarios on the layer that provides the behaviour.
charly candy add-scenario web-layer dashboard-loads \
  --then "the dashboard renders without a 500"          # prose → agent-graded
$EDITOR candy/web-layer/charly.yml                        # add a deterministic step with cdp: eval

# 2. See what's still prose-only (the authoring gaps).
charly feature pending layer:web-layer

# 3. Build → run build-scope acceptance (deterministic steps).
charly box build web
charly box feature run web

# 4. Deploy, then run deploy-scope acceptance with the agent grader.
charly start web
charly eval feature run web                # prose steps agent-graded against the live pod
charly eval feature run web --no-agent     # deterministic-only (CI): prose reports unbound

Subcommands: live-container verbs

Authoring: the `eval:` list

Every check is a list entry with exactly one verb discriminator plus shared modifiers and verb-specific attributes. This mirrors the task: pattern in charly.yml.

Gold-standard pattern (redis candy)

# candy/redis/charly.yml
eval:
  # Build-scope — run inside the built image via `podman run --rm`.
  - id: redis-binary
    file: /usr/bin/redis-server
    exists: true
  - id: redis-cli-binary
    file: /usr/bin/redis-cli
    exists: true
  - id: redis-package
    package: valkey-compat-redis     # see "Authoring Gotchas" on Fedora 43 renames
    installed: true

  # Deploy-scope — HOST_PORT:N resolves to the effective port mapping, so
  # the same test works whether the deploy publishes 6379:6379 or 16379:6379.
  - id: redis-responds
    scope: deploy
    command: redis-cli -h 127.0.0.1 -p ${HOST_PORT:6379} ping
    stdout: PONG
    in_container: false    # run redis-cli from the host
  - id: redis-port-open
    scope: deploy
    addr: 127.0.0.1:${HOST_PORT:6379}
    reachable: true

The six entries cover: binary existence, package identity, a functional live probe, and raw TCP reachability. Adopt this shape for any primary service layer.

Cross-distro package names (`package_map:`)

# candy/sshd/charly.yml — cross-distro authoring
- id: openssh-server-package
  package: openssh-server              # Fedora / Debian default
  package_map:
    arch: openssh                 # Arch ships the metapackage as 'openssh'
    fedora: openssh-server
    fedora:43: openssh-server          # explicit version tag — matches before 'fedora'
  installed: true

Verb catalog (goss-parity feature set)

Shared modifiers

`exclude_distros:` worked example

- id: fastfetch-binary
  file: /usr/bin/fastfetch
  exists: true
  exclude_distros:
    - ubuntu:24.04

Matcher forms

Any MatcherList attribute (stdout, stderr, body, headers, contains, opts, value) accepts three shapes — picked by yaml.v3 at parse time:

stdout: PONG                         # scalar  → [{equals: PONG}]
stdout: [PONG, READY]                # list    → [{equals: PONG}, {equals: READY}]
stdout:
  - equals: PONG                     # operator map
  - contains: ["ready", "ok"]
  - matches: "^[A-Z]+$"
  - not_contains: "error"

Supported operators: equals, not_equals, contains, not_contains, matches, not_matches, plus lt/le/gt/ge (numeric, added as verbs need them).

status: is NOT a MatcherList — it's a plain int on the http verb. One code per test. See Authoring Gotcha #2.

Live-container verb catalog: `cdp`, `wl`, `dbus`, `vnc`, `mcp`

Method allowlist — `cdp` (15 methods + 6 SPA-nested)

- id: cdp-up
  cdp: status
  stdout:
    equals: ok

- id: cdp-page-title
  cdp: eval
  tab: "1"
  expression: "document.title"
  stdout: "Dashboard"

- id: cdp-screenshot-valid
  cdp: screenshot
  tab: "1"
  artifact: /tmp/cdp.png
  artifact_min_bytes: 10000       # PNG must be non-empty

Method allowlist — `wl` (22 top-level + 4 overlay + 12 sway)

- id: wl-desktop-captured
  wl: screenshot
  artifact: /tmp/wl.png
  artifact_min_bytes: 10000

- id: wl-has-windows
  wl: toplevel
  stdout:
    matches: "."        # at least one line of output

- id: wl-sway-has-workspace-1
  wl: sway-workspaces
  stdout:
    contains: '"name":"1"'

Method allowlist — `dbus` (4 methods)

Queries: list, call, introspect. Actions: notify.

- id: dbus-notifications-registered
  dbus: list
  stdout:
    contains: "org.freedesktop.Notifications"

- id: dbus-get-capabilities
  dbus: call
  dest: org.freedesktop.Notifications
  path: /org/freedesktop/Notifications
  method: org.freedesktop.Notifications.GetCapabilities
  args: []
  stdout:
    contains: body

Method allowlist — `vnc` (8 methods)

Queries: status, screenshot, rfb. Actions: click, mouse, type, key, passwd.

- id: vnc-up
  vnc: status
  stdout:
    equals: ok

- id: vnc-framebuffer-captured
  vnc: screenshot
  artifact: /tmp/vnc.png
  artifact_min_bytes: 5000

Method allowlist — `mcp` (7 methods)

Queries: ping, servers, list-tools, list-resources, list-prompts, read. Actions: call.

Transport dispatch: transport: http (or empty) → Streamable HTTP; transport: sse → SSE. Anything else is rejected at dial time.

Disambiguation: if an image declares multiple mcp_provide entries, add mcp_name: <server> on the check; otherwise the single entry auto-picks.

- id: mcp-ping
  scope: deploy
  mcp: ping
  timeout: 10s                # optional; default 30s

- id: mcp-list-tools
  scope: deploy
  mcp: list-tools
  stdout:
    - contains: insert_cell   # plaintext "name\tdescription" per line
    - contains: execute_cell

- id: mcp-call-tool
  scope: deploy
  mcp: call
  tool: list_notebooks        # required
  input: "{}"                 # optional JSON arg blob
  exit_status: 0              # assert no IsError

- id: mcp-read-resource
  scope: deploy
  mcp: read
  uri: file:///tmp/example.txt
  stdout:
    - matches: "."            # non-empty body

Ad-hoc CLI equivalents (each leaf also supports --json for programmatic use and --name <server> for multi-server images):

charly eval mcp ping       <image> [-i <instance>] [--name <server>]
charly eval mcp servers    <image> [-i <instance>]
charly eval mcp list-tools <image> [-i <instance>] [--name <server>]
charly eval mcp call       <image> <tool> '<json-args>' [-i <instance>]
charly eval mcp read       <image> <uri>            [-i <instance>]

Artifact-validation modifiers

artifact_min_bytes: <N> — assert the file is at least N bytes after the run. Guards against zero-byte files. Cheap (os.Stat only).
artifact_min_dimensions: WxH — assert decoded image width and height are each at least the given values. Reads only the PNG/JPEG header via image.DecodeConfig, so essentially free. Catches "the screenshot ran but the compositor produced 320×240 instead of the expected 1920×1080".
artifact_not_uniform: true — assert the image is not uniformly one color. Decodes the full image and samples 100 pixels at deterministic stride positions; fails if every sampled pixel has the same RGBA. Catches the failure mode artifact_min_bytes is blind to: a 100KB all-black PNG passes the byte check but fails this one. Use on every screenshot probe where "the image has real content" matters.
artifact_min_cast_events: <N> — for asciinema .cast artifacts (e.g. record: stop in terminal mode): validates the first line is the asciinema v2 header and counts at least N event lines after it. Catches "the recording started and stopped but captured nothing because nothing was typed".

# Combined screenshot validity assertion — bytes + dimensions + content.
- id: cdp-screenshot-real
  cdp: screenshot
  tab: "1"
  artifact: /tmp/cdp.png
  artifact_min_bytes: 5000
  artifact_min_dimensions: 800x600
  artifact_not_uniform: true

# Recording validity — bytes + event count.
- id: record-cast-has-events
  record: stop
  artifact: /tmp/session.cast
  artifact_min_bytes: 200
  artifact_min_cast_events: 5

Gotcha — stale container-baked `charly` binary

Authoring Gotchas (learned the hard way)

These are the non-obvious issues that surface only when you run the tests against real containers. Skim before writing new checks.

1. Host-side deploy tests: `127.0.0.1:${HOST_PORT:N}`, NOT `${CONTAINER_IP}:${HOST_PORT:N}`

# ❌ WRONG — 10.89.0.x:HOST_PORT is unreachable from either side.
addr: ${CONTAINER_IP}:${HOST_PORT:8080}

# ✓ RIGHT — host-side access via the forwarded port.
addr: 127.0.0.1:${HOST_PORT:8080}

# ✓ ALSO RIGHT — in-container, using the container port directly.
command: curl -fsS http://127.0.0.1:8080/health
in_container: true

Use ${CONTAINER_IP} only from inside another container in the same pod.

2. `http` status is a single int — no lists

# ❌ WRONG — yaml unmarshals a list to int fails with "cannot unmarshal !!seq"
status: [200, 301, 302]

# ✓ RIGHT — pick the most-likely code. Or author multiple tests.
status: 200

For endpoints that return 400/401 when probed without auth (MCP servers, API endpoints), status: 400 is a legitimate "endpoint exists and refuses empty input" check.

3. `pgrep`, `ss`, `nc` are NOT in minimal images

The process:, port: listening, and nc-based command tests silently fail on images that skip procps-ng / iproute / ncat. Portable alternatives:

4. `supervisorctl status` exits 3 when ANY program is non-RUNNING

Layers like hermes declare autostart=false programs (e.g. hermes-whatsapp), which leaves them STOPPED. supervisorctl status returns exit 3 — fails exit_status: 0 checks.

# ❌ BRITTLE — fails if any program is autostart=false.
command: supervisorctl status
exit_status: 0

# ✓ ROBUST — `pid` asks for supervisord's own PID, 0 iff the socket
# responds.
command: supervisorctl pid
exit_status: 0
in_container: true

5. Know which stream a `--version`-style command writes to

charly version writes to stdout (via fmt.Println) — the canonical candy/charly/charly.yml test asserts a stdout: matcher.

- id: charly-version
  command: /usr/local/bin/charly version
  exit_status: 0
  stdout:
    - matches: "[0-9]{4}\\.[0-9]+"

6. Drop `$` anchor in `matches:` regexes on command output

Command stdout typically ends with \n. Go's regexp treats $ as end-of-input (not end-of-line by default), so \.ipynb$ won't match path.ipynb\n. Drop the anchor:

# ❌ WRONG — fails because trailing newline
stdout:
  - matches: "\\.ipynb$"

# ✓ RIGHT — substring match is usually what you want
stdout:
  - matches: "\\.ipynb"

7. `${VAR:-default}` is NOT supported by charly's resolver

charly's variable resolver accepts ${IDENT} only — no bash-style defaults, no parameter expansion. An unresolved variable is reported as a unresolved variables: <name> SKIP.

# ❌ Parsed as a single var name; reported as unresolved if not set
command: psql -U ${POSTGRES_USER:-postgres}

# ✓ Plain env-var reference; resolves from the container's environment
# at deploy scope. For a true default, set it in the layer's `env:`.
command: psql -U ${POSTGRES_USER}

8. Fedora 43 package renames — query the installed name, not the dnf request

The package: verb does exact string equality against the rpm database, not the dnf install request. When dnf resolves a virtual provides, the test must name the actual installed package.

Always verify with charly shell <image> -c 'rpm -qf /path/to/binary'.

9. Volume paths don't exist at build-scope

# ❌ Fails under `charly eval box` — workspace isn't mounted
- id: workspace-dir
  file: /workspace
  filetype: directory

# ✓ Scope to deploy so it runs only against live containers
- id: workspace-dir
  scope: deploy
  file: /workspace
  filetype: directory

10. Disposable tests run as the image's default USER — not root

# ❌ Reports "exists=false" even though the file IS there (0750 dir)
- id: sudoers-charly-user
  file: /etc/sudoers.d/charly-user
  exists: true

# ✓ Verify the semantic (can I sudo?) instead of the file bit
- id: sudoers-charly-user
  command: sudo -n -l
  exit_status: 0
  stdout:
    - contains: "NOPASSWD"

11. Bootc images keep USER=root — tests must cover both modes

The fix: drop into user explicitly when running as root, stay as-is otherwise. Use runuser -u user -- <cmd> as the bridge — it's portable across util-linux versions and preserves stdout:

# Dual-mode: container (USER=1000) AND bootc (USER=root) both pass
- id: sudoers-charly-user
  command: |
    if [ "$(id -u)" = "0" ]; then
      runuser -u user -- sudo -n -l
    else
      sudo -n -l
    fi
  exit_status: 0
  stdout:
    - contains: "NOPASSWD"

12. `charly eval box <short-name>` is ambiguous with multiple CalVer tags — use the full registry ref

charly: error: ambiguous short name "openclaw-desktop" in local storage;
           candidates: ghcr.io/overthinkos/openclaw-desktop:latest,
           ghcr.io/overthinkos/openclaw-desktop:2026.109.1418,
           ... Re-run with a full ref.

Use the fully-qualified registry ref:

charly eval box ghcr.io/overthinkos/openclaw-desktop:latest

Three levels, three sections

Verb routing (which executor runs each check)

in-container `command:` stdin is guarded

Runtime variables

charly eval live resolves these via podman inspect on the running container before any check executes. ${NAME:arg} is parameterized form.

Build-scope checks may not reference deploy-scope variables — the validator flags this at charly box validate time.

No bash-style defaults: ${VAR:-fallback} is unsupported (see Authoring Gotcha #7). Plain ${VAR} only.

Cross-deployment probing — `on:` driver, `peer:` siblings, `${PEER_*}`

Three pieces compose it:

on: <driver> (the step modifier) — DISPATCH this probe against a named DRIVER deployment instead of the subject. For a cdp:/vnc:/mcp: check it connects to the driver's CDP/VNC/MCP endpoint (charly eval cdp <method> <driver>); for a command: check it runs in the driver's venue. Works in charly eval live, kind: eval beds, and recipes (the driver's own ${HOST_PORT}/${CONTAINER_IP} resolve too).
peer: siblings — bring the driver up ALONGSIDE the subject on the shared charly network (see /charly-core:deploy "Sibling peers"). A kind: eval bed (or a kind: deploy) declares the driver under peer:; it is brought up by the same deploy verbs and is NEVER eval-live'd (an instrument, not a subject).
${PEER_*} address variables — let the driven probe TARGET the subject:
- ${PEER_HOST:<name>} → the deployment's container DNS name on the shared charly net (charly-<name>; also verifies it is running). The pod→pod address: http://${PEER_HOST:<subject>}:8080. Reference the subject by its OWN deploy name (for a bed, that's the bed name — the container is charly-<bedname>).
- ${PEER_ENDPOINT:<name>:<port>} → a host-reachable 127.0.0.1:NNNN for that deployment's <port> (container published port, or an ssh -L forward for a VM/host subject). The host-vantage address a local/host driver uses to reach a pod/VM subject. (Both are runtime-only — a build-scope check can't use them.)
An unresolvable ${PEER_*} — the peer/subject is unreachable (not running, bad port, VM ssh-forward refused) — FAILS the referencing check, never SKIPs it. A skip on an unreachable dependency is a fake pass: it would let a bed go green even though the cross-deployment probe never reached its target. (Legitimate skips — skip: true, exclude_distros, a deploy-only var under build scope — are unaffected; only an unresolved mandatory ${PEER_*} is treated as a failure.)

Worked example — a Chrome pod CDP-probes a web-server pod (pod→pod)

# charly.yml eval: block — kind:eval bed: web SUBJECT (root) + chrome DRIVER (peer)
eval-cross-pod-cdp:
    target: pod
    box: web                       # the SUBJECT under test (nginx fixture on :8080)
    disposable: true
    # No port: — every inherited container port auto-allocates a free 127.0.0.1
    # host port (conflict-free across concurrent beds). Pin with `port:
    # ["H:C"]` only when a fixed host port is needed.
    peer:
        chrome:                    # the DRIVER, a sibling on the charly net
            target: pod
            box: chrome-headless   # headless Chrome + cdp-proxy (publishes 9222)
    eval:
        - id: web-fixture-up       # the subject serves its marker
          scope: deploy
          http: http://127.0.0.1:${HOST_PORT:8080}/
          status: 200
          body: [{contains: charly-fixture-web-content-marker}]
        - id: cdp-open-web-subject # DRIVE: chrome navigates to the subject over the charly net
          scope: deploy
          cdp: open
          on: chrome
          url: http://${PEER_HOST:eval-cross-pod-cdp}:8080
          eventually: 45s
          retry_interval: 3s
        - id: cdp-web-fixture-rendered  # ASSERT: the rendered page carries the marker
          scope: deploy
          cdp: text
          on: chrome
          tab: "1"                 # the first page (cdp `open` lands on tab 1)
          eventually: 30s
          retry_interval: 3s
          stdout: [{contains: charly-fixture-web-content-marker}]

Cross-kind: a host driver reaching a pod OR a VM subject (`${PEER_ENDPOINT}`)

Deploy.yml overlay rules

A local charly.yml can contribute its own eval: list per image. Merge rules applied by charly eval live:

Local entries with an id: matching a baked entry replace that entry.
Entries without a matching id: are appended.
To disable a baked check, reference it with id: and skip: true.

# ~/.config/charly/charly.yml
box:
  redis-ml:
    eval:
      - id: redis-responds                  # overrides image's baked check
        command: redis-cli -h 127.0.0.1 -p 16379 ping
        stdout: PONG
        in_container: false
      - id: external-tunnel                 # appended (no baked match)
        http: https://redis.tailnet.ts.net/health
        status: 200
      - id: old-probe                       # disable a legacy baked check
        skip: true

Typical workflow

# 1. Author tests in candy/<name>/charly.yml.
# 2. Validate schema + references.
charly box validate

# 3. Build the image (tests are auto-embedded as LabelEval).
#    LABELs are emitted LAST in the final stage — a test edit rebuilds
#    in ~2 sec (cache hits every upstream RUN/COPY).
charly box build redis-ml

# 4. Run against a disposable container (build-scope checks only).
charly eval box redis-ml

# 5. Start the service and test end-to-end.
charly start redis-ml
charly eval live redis-ml

# 6. Override a baked deploy check via local charly.yml, re-test.
$EDITOR ~/.config/charly/charly.yml    # add tests: entry with matching id
charly eval live redis-ml

Coverage snapshot (7 currently-tested images)

Reference numbers from the last end-to-end session:

Total: 376 checks across 7 images (0 failing).

Realistic run-time expectations

charly eval run default is a multi-phase AI-iteration loop. Representative per-phase durations (measured solving all 92/92 across 9 iterations on a disposable: true eval-sandbox):

Issue 52328 (claude-code Bash + pgrep self-match deadlock) defense

Related skills

Live-container probe verbs under charly eval — /charly-eval:cdp, /charly-eval:wl, /charly-eval:dbus, /charly-eval:vnc, /charly-build:charly-mcp-cmd, /charly-eval:record, /charly-eval:spice, /charly-eval:libvirt, /charly-kubernetes:eval-k8s are dispatched as charly eval cdp|wl|dbus|vnc|mcp|record|spice|libvirt|k8s. See the Subcommands section above.
/charly-image:layer — layer authoring; eval: field is part of every charly.yml.
/charly-image:image — image-level eval: and deploy_eval: at composition time.
/charly-core:deploy — local charly.yml overlay rules and the eval: merge.
/charly-build:validate — static schema + cross-scope variable checks; the first gate before charly box build.
/charly-build:build — how eval entries are embedded into the OCI label at build time; LABELs-at-end cache behavior.
/charly-build:inspect — view the merged 3-section eval structure as JSON.
/charly-build:migrate — charly migrate brings legacy configs up to the current schema (the eval: block in charly.yml).
/charly-internals:go — implementation map: evalspec.go, evalvars.go, evalrun.go, evalrun_verbs.go, evalrun_charly_verbs.go, evalcollect.go, eval_cmd.go, eval_runner_cmd.go, eval_loop.go, eval_runner_live.go, eval_watchdog.go, validate_eval.go, mcp.go, mcp_client.go, plus the LabelEval constant in labels.go.
/charly-internals:generate-source — how LabelEval is written into the Containerfile via writeJSONLabel, and why the LABEL block lives at the end of the final stage.
/charly-internals:capabilities — the LabelEval three-section OCI label is part of the same capability contract as LabelService.
/charly-internals:agents — the sub-agents (eval-bed-runner, deploy-verifier) and dynamic workflows (/verify-beds, /audit-deploy-configs) that DRIVE these beds, and the R10/disposable/ paste-proof rules that bind any agent or workflow running charly eval run.

Adoption

overthinkos/eval

$ install --global

Security Scan Results

SKILL.md

Eval — unified declarative + AI-iteration evaluation

kind: eval beds + charly eval run <bed> — config-driven R10 acceptance

Flag discipline — the score/bed config IS the test specification

The ecosystem's kind: eval beds

Approximate wall-clock (10-CPU 32-GB reference host)

Prereq for the vm bed

Why this is config-driven

R10 gate by change class — pick the gate that exercises the change

The 10 Testing Standards (READ FIRST — referenced by CLAUDE.md R1–R10)

charly eval live parent.child reaches the actual leaf

Specific anti-patterns observed and banned

Overview

Quick Reference

Exit codes

Image preflight (host-target runs only)

Three primary modes (charly eval box / live / run)

Agent Driven Evaluation (ADE) — charly box/eval feature run + the agent grader

The six-stage loop

The BIND contract — a step binds to its verifier BY SHAPE

The two run verbs

The agent grader contract

Worked example (end to end)

Subcommands: live-container verbs

Authoring: the eval: list

Gold-standard pattern (redis candy)

Cross-distro package names (package_map:)

Verb catalog (goss-parity feature set)

Shared modifiers

exclude_distros: worked example

Matcher forms

Live-container verb catalog: cdp, wl, dbus, vnc, mcp

Method allowlist — cdp (15 methods + 6 SPA-nested)

Method allowlist — wl (22 top-level + 4 overlay + 12 sway)

Method allowlist — dbus (4 methods)

Method allowlist — vnc (8 methods)

Method allowlist — mcp (7 methods)

Artifact-validation modifiers

Gotcha — stale container-baked charly binary

Authoring Gotchas (learned the hard way)

1. Host-side deploy tests: 127.0.0.1:${HOST_PORT:N}, NOT ${CONTAINER_IP}:${HOST_PORT:N}

2. http status is a single int — no lists

3. pgrep, ss, nc are NOT in minimal images

4. supervisorctl status exits 3 when ANY program is non-RUNNING

5. Know which stream a --version-style command writes to

6. Drop $ anchor in matches: regexes on command output

7. ${VAR:-default} is NOT supported by charly's resolver

8. Fedora 43 package renames — query the installed name, not the dnf request

9. Volume paths don't exist at build-scope

10. Disposable tests run as the image's default USER — not root

11. Bootc images keep USER=root — tests must cover both modes

12. charly eval box <short-name> is ambiguous with multiple CalVer tags — use the full registry ref

Three levels, three sections

Verb routing (which executor runs each check)

in-container command: stdin is guarded

Runtime variables

Cross-deployment probing — on: driver, peer: siblings, ${PEER_*}

Worked example — a Chrome pod CDP-probes a web-server pod (pod→pod)

Cross-kind: a host driver reaching a pod OR a VM subject (${PEER_ENDPOINT})

Deploy.yml overlay rules

Typical workflow

Coverage snapshot (7 currently-tested images)

Realistic run-time expectations

Issue 52328 (claude-code Bash + pgrep self-match deadlock) defense

Related skills

When to Use This Skill

Related Skills

overthinkos/plugin

overthinkos/cue

overthinkos/egress

overthinkos/check-k8s

overthinkos/eval

$ install --global

Security Scan Results

SKILL.md

Eval — unified declarative + AI-iteration evaluation

kind: eval beds + `charly eval run <bed>` — config-driven R10 acceptance

The ecosystem's `kind: eval` beds

`charly eval live parent.child` reaches the actual leaf

Three primary modes (`charly eval box` / `live` / `run`)

Agent Driven Evaluation (ADE) — `charly box/eval feature run` + the agent grader

Authoring: the `eval:` list

Cross-distro package names (`package_map:`)

`exclude_distros:` worked example

Live-container verb catalog: `cdp`, `wl`, `dbus`, `vnc`, `mcp`

Method allowlist — `cdp` (15 methods + 6 SPA-nested)

Method allowlist — `wl` (22 top-level + 4 overlay + 12 sway)

Method allowlist — `dbus` (4 methods)

Method allowlist — `vnc` (8 methods)

Method allowlist — `mcp` (7 methods)

Gotcha — stale container-baked `charly` binary

1. Host-side deploy tests: `127.0.0.1:${HOST_PORT:N}`, NOT `${CONTAINER_IP}:${HOST_PORT:N}`

2. `http` status is a single int — no lists

3. `pgrep`, `ss`, `nc` are NOT in minimal images

4. `supervisorctl status` exits 3 when ANY program is non-RUNNING

5. Know which stream a `--version`-style command writes to

6. Drop `$` anchor in `matches:` regexes on command output

7. `${VAR:-default}` is NOT supported by charly's resolver

12. `charly eval box <short-name>` is ambiguous with multiple CalVer tags — use the full registry ref

in-container `command:` stdin is guarded

Cross-deployment probing — `on:` driver, `peer:` siblings, `${PEER_*}`

Cross-kind: a host driver reaching a pod OR a VM subject (`${PEER_ENDPOINT}`)

kind: eval beds + `charly eval run <bed>` — config-driven R10 acceptance

The ecosystem's `kind: eval` beds

`charly eval live parent.child` reaches the actual leaf

Three primary modes (`charly eval box` / `live` / `run`)

Agent Driven Evaluation (ADE) — `charly box/eval feature run` + the agent grader

Authoring: the `eval:` list

Cross-distro package names (`package_map:`)

`exclude_distros:` worked example

Live-container verb catalog: `cdp`, `wl`, `dbus`, `vnc`, `mcp`

Method allowlist — `cdp` (15 methods + 6 SPA-nested)

Method allowlist — `wl` (22 top-level + 4 overlay + 12 sway)

Method allowlist — `dbus` (4 methods)

Method allowlist — `vnc` (8 methods)

Method allowlist — `mcp` (7 methods)

Gotcha — stale container-baked `charly` binary

1. Host-side deploy tests: `127.0.0.1:${HOST_PORT:N}`, NOT `${CONTAINER_IP}:${HOST_PORT:N}`

2. `http` status is a single int — no lists

3. `pgrep`, `ss`, `nc` are NOT in minimal images

4. `supervisorctl status` exits 3 when ANY program is non-RUNNING

5. Know which stream a `--version`-style command writes to

6. Drop `$` anchor in `matches:` regexes on command output

7. `${VAR:-default}` is NOT supported by charly's resolver

12. `charly eval box <short-name>` is ambiguous with multiple CalVer tags — use the full registry ref

in-container `command:` stdin is guarded

Cross-deployment probing — `on:` driver, `peer:` siblings, `${PEER_*}`

Cross-kind: a host driver reaching a pod OR a VM subject (`${PEER_ENDPOINT}`)