hermes-skills/arifos-container-diagnosis/SKILL.md
Diagnose broken arifOS Federation containers: wrong image, wrong entry point, port mismatches, import chain failures. Use when a container won't start, serves wrong content, or shows as unhealthy despite having a valid image.
npx skillsauth add ariffazil/openclaw-workspace arifos-container-diagnosisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Images like arifos-geox:v1.0.0 often contain MULTIPLE possible entry points:
arifosmcp/runtime/server.py — arifOS kernelgeox_mcp/server.py — GEOX FastMCPgeox_unified_mcp_server.py — GEOX unified shimarifosmcp/geox/legacy_servers/The container CMD in the Dockerfile may be wrong, but the actual server files usually exist inside.
docker exec <container> cat /proc/1/cmdline | tr '\0' ' '
docker inspect <container> --format '{{.Config.Cmd}}'
docker exec <container> find / -name "*.py" -path "*/server.py" 2>/dev/null | grep -v __pycache__
docker exec <container> find /usr/src/app -name "*mcp_server.py" 2>/dev/null
docker exec <container> find /app -name "*mcp_server.py" 2>/dev/null
docker inspect <container> --format '{{json .Mounts}}'
Empty Mounts = overlayfs (ephemeral /app). All /app edits survive docker restart but NOT image rebuild.
If overlayfs: Find the host source path by matching the container's CMD to a host workspace file:
docker inspect <container> --format '{{.Config.Cmd}}'
# e.g. [python control_plane/fastmcp/server.py]
find /root -type d -name "control_plane" 2>/dev/null | head -5
# → /root/geox/control_plane
Test persistence: Edit a file, docker restart, verify edit survived. If gone → overlayfs.
Never use /tmp/ for fixtures inside overlayfs containers — wiped on every restart:
docker exec <container> mkdir -p /data/fixtures
docker cp fixtures/BOKOR_1_demo.las <container>:/data/fixtures/
docker exec <container> python3 -c "
import sys
sys.path.insert(0, '/path/to/parent/dir')
try:
from module.path import mcp
print('OK - mcp imported')
app = mcp.streamable_http_app()
print('OK - app created')
except Exception as e:
print('ERROR:', type(e).__name__, str(e)[:200])
"
docker exec <container> python3 -c "
import socket
for port in [5000, 5001, 8000, 8001, 8080, 8081, 8082, 9000]:
s = socket.socket()
s.settimeout(0.3)
try:
s.connect(('127.0.0.1', port))
print(f'Port {port}: OPEN')
except:
print(f'Port {port}: closed')
s.close()
"
docker exec <container> python3 -c "import sys; print(sys.path[:3])"
# arifOS runs on 8080, GEOX needs 8081
# Override CMD with:
python3 -c "import uvicorn; from geox_mcp.server import mcp; uvicorn.run(mcp.streamable_http_app(), host='0.0.0.0', port=8081)"
qdrant image has no curl and python3 is not available via CMD-SHELL. If healthcheck fails with "executable file not found", either remove the healthcheck (service is fine) or use a host-side check. Do NOT assume CMD-SHELL runs on the host.
# WEALTH: uvicorn inside 8082, compose mapped 8000:8000 (wrong) → 8000:8082 (correct)
ports:
- "127.0.0.1:8000:8082"
FastMCP's streamable_http_app() exposes only /mcp, not /health. GET requests return 406.
The working healthcheck uses curl with POST + proper Accept headers:
healthcheck:
test: ["CMD-SHELL", "curl -sf -X POST http://localhost:8081/mcp -H 'Content-Type: application/json' -H 'Accept: application/json, text/event-stream' -d '{\"jsonrpc\":\"2.0\",\"method\":\"initialize\",\"id\":1,\"params\":{\"protocolVersion\":\"2024-11-05\",\"capabilities\":{},\"clientInfo\":{\"name\":\"health\",\"version\":\"1.0\"}}}' --max-time 3"]
interval: 30s
timeout: 10s
retries: 3
The python3 urllib approach fails because FastMCP returns 406 on GET. Only POST with SSE Accept headers works.
geox:
image: arifos-geox:v1.0.0
container_name: geox
restart: unless-stopped
command: >
/bin/sh -c "python3 -c 'import uvicorn; from geox_mcp.server import mcp;
uvicorn.run(mcp.streamable_http_app(), host=\"0.0.0.0\", port=8081)'"
environment:
PYTHONPATH: /usr/src/app/arifosmcp/geox
PORT: "8081"
ports:
- "127.0.0.1:8081:8081"
networks: [arifos_core]
healthcheck:
test: ["CMD-SHELL", "curl -sf -X POST http://localhost:8081/mcp -H 'Content-Type: application/json' -H 'Accept: application/json, text/event-stream' -d '{\"jsonrpc\":\"2.0\",\"method\":\"initialize\",\"id\":1,\"params\":{\"protocolVersion\":\"2024-11-05\",\"capabilities\":{},\"clientInfo\":{\"name\":\"health\",\"version\":\"1.0\"}}}' --max-time 3"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
Three distinct root causes found 2026-05-04. Diagnose in order:
Caddy and arifOS MCP may be on different Docker bridge networks after docker restart.
Symptom: Caddy logs show empty response; arifosmcp health endpoint inside container works fine.
Test: docker exec caddy wget -O- http://arifosmcp:8080/health
If "bad address" → network isolation.
Fix: docker network connect arifos_core_network arifosmcp
(Caddy is on arifos_core_network; arifOS MCP may be on default bridge.)
docker build without --build-arg DEPLOY_GIT_COMMIT produces image labeled unknown.
Symptom: /health shows "version: kanon-unknown" and "git_commit: unknown"
but docker ps shows the correct tag (e.g., de038a0f).
Container may also show wrong entrypoint (see TWO Dockerfiles above).
Fix — two parts:
(a) Use correct Dockerfile (root-level, not arifosmcp/Dockerfile):
cd /root/arifOS && docker build -f Dockerfile -t ghcr.io/ariffazil/arifos:<tag> .
(b) Pass git metadata at run time:
docker run -d ... -e DEPLOY_GIT_COMMIT=<sha> ghcr.io/ariffazil/arifos:<tag>
Or bake in at build time — add to Dockerfile:
ARG DEPLOY_GIT_COMMIT=unknown
ENV DEPLOY_GIT_COMMIT=${DEPLOY_GIT_COMMIT}
Then rebuild with:
docker build --build-arg DEPLOY_GIT_COMMIT=$(git rev-parse --short HEAD) \
-f Dockerfile -t ghcr.io/ariffazil/arifos:<tag> .
snarkjs installed on VPS host is NOT inside the Docker image.
Symptom: _snarkjs_available() returns False inside container.
VPS host has snarkjs but container does not.
Fix: Add to Dockerfile runtime stage:
RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - && \
apt-get install -y nodejs && \
npm install -g snarkjs && \
rm -rf /var/lib/apt/lists/*
Rebuild and redeploy.
VAULT999/SEALED_EVENTS.jsonl is NOT in the container filesystem — it's a named volume.
Container path: /app/VAULT999/ (empty inside container image)
Host volume: /var/lib/docker/volumes/<vault999-data>/_data/
Verify: docker volume ls | grep vault999
docker exec <container> python3 -c "
import sys; sys.path.insert(0,'/app')
from arifos.security.zkpc_v2 import _snarkjs_available, _groth16_verify
import json
proof = json.load(open('/app/arifos/security/zkp_artifacts/proof.json'))
public = json.load(open('/app/arifos/security/zkp_artifacts/public.json'))
v, out = _groth16_verify(proof, public)
print('snarkjs available:', _snarkjs_available())
print('groth16 verified:', v)
"
Expected: snarkjs available: True, groth16 verified: True (with real proof artifacts)
This is the #1 failure mode. Every docker build + docker push creates a NEW image digest, even if the tag is the same. The running container does NOT automatically update.
# BAD assumption:
docker build -t ghcr.io/ariffazil/arifos:2026.05.04 .
docker push ghcr.io/ariffazil/arifos:2026.05.04
# → Container still running OLD image. Tag looks right. Content is wrong.
# ALWAYS verify the running container's actual image SHA:
docker ps --filter 'name=arifosmcp' --format '{{.ID}} {{.Image}}'
# → Compare the image SHA (second column) against:
docker images ghcr.io/ariffazil/arifos --format '{{.Tag}}\t{{.ID}}'
# If SHA differs → the running container has an older image
Correct restart sequence:
docker stop <container>
docker rm <container>
docker pull ghcr.io/ariffazil/arifos:<tag> # pull latest digest
docker run ... --name <container> ghcr.io/ariffazil/arifos:<tag>
docker diff + diff against local sourceWhen you suspect the running container has different code than both (a) the latest pushed image and (b) the local source, use this two-step diff:
# Step 1: docker diff — what has changed inside the container vs image
docker diff <container>
# Step 2: diff running container's file against LOCAL source
diff <(docker exec <container> cat /app/internal/monolith.py) /root/WEALTH/internal/monolith.py
What this reveals:
docker diff shows files changed inside container (overlayfs) — confirms container has been modifieddiff <(docker exec cat) shows exactly what lines differ between running container and local sourcebriefing_handler exists locally but not in container) → container image is OLDER than localExample output showing container is stale:
4565a4566,4567
> BRIEFING_PATH = "/root/arif-sites/..."
>
4568a4571,4581
> async def briefing_handler(request):
> ...
> Route("/briefing", briefing_handler, methods=["GET"]),
The > lines exist in LOCAL but not in running container → container is running an older image.
Example output showing container has been hot-patched:
< some_old_function()
---
> some_new_function()
The < lines exist in container but not in local source → overlayfs patch applied directly to container.
Symptoms:
curl https://mcp.arif-fazil.com/briefing → HTTP 301 redirect to arifos.arif-fazil.com/briefingcurl http://127.0.0.1:8082/briefing → 404 Not Founddocker exec wealth-organ grep briefing_handler /app/internal/monolith.py → no outputdocker diff wealth-organ → /app/internal/monolith.py CHANGEDThree root causes always check in order:
/briefing handler doesn't exist in running container's code. Fix: rebuild and redeploy WEALTH image.BRIEFING_PATH=/root/arif-sites/... but container has no volume mount to that path. docker inspect <container> --format '{{json .Mounts}}' returns []. Fix: add mount to docker-compose:
volumes:
- /root/arif-sites/sites/arif-fazil.com/public/data/wealth:/app/data/wealth:ro
And update BRIEFING_PATH to /app/data/wealth/latest.json (container-internal path).mcp.arif-fazil.com/briefing falls through to catch-all redirect instead of proxying to wealth-organ:8082. Fix: add explicit handle /briefing route to Caddyfile under mcp.arif-fazil.com block.Diagnostic sequence:
# 1. Check container has the route
docker exec wealth-organ grep -c "briefing_handler" /app/internal/monolith.py
# 0 = route missing in container
# 2. Check mount exists
docker inspect wealth-organ --format '{{json .Mounts}}'
# [] = no mounts
# 3. Check Caddy routing
curl -sI https://mcp.arif-fazil.com/briefing | grep -E "HTTP|Location"
# 301 to arifos = Caddyfile catch-all redirect
# 4. Check internal route works (bypass Caddy)
curl http://127.0.0.1:8082/briefing
# 404 despite container running = route missing in container code
Verify filesystem content, not image labels:
# Don't trust: docker ps shows the "right" tag
# DO trust: actual filesystem inside running container
docker exec <container> find /app -name 'zkpc_v2.py' # check code files exist
docker exec <container> node --version # check binaries exist
docker exec <container> python3 -c "import sys; print(sys.path[:2])"
The correct restart sequence (always in this order):
docker build -t ghcr.io/ariffazil/arifos:latest . # build new image
docker push ghcr.io/ariffazil/arifos:latest # push to registry
docker restart <container> # ONLY THEN restart
NOT: docker restart first (pulls nothing new) → then build → then push (container still running old image).
Verify container is actually running the new image:
docker inspect <container> --format '{{.Image}}'
# Compare against:
docker images ghcr.io/ariffazil/arifos:latest --format '{{.ID}}'
# If digests match → container is running the newly pushed image
Volume mounts survive docker rm — network connections do NOT:
docker network connect arifos_core_network arifosmcp # runtime only
# After docker rm + docker run: must re-run network connect
rest_routes.py May Be a Package, Not a FileThe rest_routes.py may have been refactored into a package directory:
arifosmcp/runtime/rest_routes/ ← package directory
arifosmcp/runtime/rest_routes/__init__.py
arifosmcp/runtime/rest_routes/rest_routes.py ← actual file
If editing rest_routes.py directly doesn't take effect, check:
docker exec <container> find /app -name "rest_routes.py" 2>/dev/null
# If it returns /app/arifosmcp/runtime/rest_routes/rest_routes.py → it's a package
The old path arifosmcp/runtime/rest_routes.py would be stale.
Path(__file__) parent depth is DIFFERENT inside the package:
/app/arifosmcp/runtime/rest_routes/rest_routes.py → parents[0] = rest_routes/, parents[1] = runtime/, parents[2] = arifosmcp//app/arifosmcp/runtime/rest_routes.py (old monolithic path) → parents[0] = runtime/, parents[1] = arifosmcp/, parents[2] = /appAlways verify depth inside the running container before committing path-construction code.
_JUDGE_STATE_REGISTRY and _JUDGE_CHAIN_REGISTRY (used by vault seal / arif_vault_seal) are plain in-memory dicts (line 987-988 of runtime/tools.py). They are NOT persisted to the session store.
Symptom: After a container restart, vault seal returns:
status: HOLD, verdict: HOLD
reason: "judge contract required — irreversible execution requires a prior judge packet"
failed_floors: ["F11"]
Why: The pre-restart _JUDGE_STATE_REGISTRY[judge_state_hash] and _JUDGE_CHAIN_REGISTRY[constitutional_chain_id] are gone. Both lookups return None → _resolve_judge_contract returns a HOLD.
Diagnosis:
# Check if session store is the only thing surviving restart
docker exec <container> python3 -c "
from arifosmcp.runtime.tools import _JUDGE_STATE_REGISTRY, _JUDGE_CHAIN_REGISTRY
print('JUDGE_STATE size:', len(_JUDGE_STATE_REGISTRY))
print('JUDGE_CHAIN size:', len(_JUDGE_CHAIN_REGISTRY))
"
# Both should be > 0 during normal operation. If 0 after restart → registry wipe.
Fix: Wire judge registries into the existing _FileSessionStore (same pattern already used for sessions). Add _load_judge_registries() on startup and _save_judge_registry() on write.
arifOS has two Dockerfiles with different entrypoints:
/root/arifOS/Dockerfile ← CORRECT (root-level)
CMD ["python", "-m", "arifosmcp.runtime.server"] ✅ serves on :8080
/root/arifOS/arifosmcp/Dockerfile ← WRONG (sub-directory)
CMD ["python", "-m", "arifosmcp.runtime.__main__"] ❌ exits immediately
Symptom when using wrong Dockerfile: Container starts, runs, then immediately exits with code 0 and no logs. The __main__ module completes instantly without serving anything.
Diagnosis:
# Check what entrypoint the container actually used
docker inspect <container> --format '{{.Config.Cmd}}'
# [python -m arifosmcp.runtime.__main__] = wrong Dockerfile used
# Verify inside running container
docker exec <container> cat /proc/1/cmdline | tr '\0' ' '
# Should say: python -m arifosmcp.runtime.server
Fix: Always build from the root directory using root-level Dockerfile:
cd /root/arifOS
docker build --pull -f Dockerfile -t ghcr.io/ariffazil/arifos:<tag> .
# ^^^^^^^ NOT arifosmcp/Dockerfile
When a container entrypoint process fails immediately, Docker may show exit code 0 with no error output. This is the trickiest "it doesn't work" scenario.
Debugging technique — background process with watch patterns:
# Start container in background with watch
docker run --rm -p 8080:8080 --name arifosmcp_test ghcr.io/ariffazil/arifos:<tag> 2>&1
# OR use Hermes terminal background=true with watch_patterns
docker run --rm -p 8080:8080 --name arifosmcp_test ghcr.io/ariffazil/arifos:<tag> 2>&1 &
sleep 10 && curl -s http://127.0.0.1:8080/health
# OR run foreground and capture exit code
docker run --rm ghcr.io/ariffazil/arifos:<tag> 2>&1; echo "EXIT_CODE=$?"
Other diagnostic commands:
# Get the container's actual PID 1 command
docker exec <container> cat /proc/1/cmdline | tr '\0' ' '
# Try running the server manually inside container
docker run --rm ghcr.io/ariffazil/arifos:<tag> python -m arifosmcp.runtime.server
# If this exits immediately → entrypoint mismatch (see TWO Dockerfiles above)
# Verify port is listening
docker exec <container> python3 -c "import socket; s=socket.socket(); s.connect(('127.0.0.1',8080)); print('OPEN')"
/health shows version: kanon-unknown, git_commit: unknown, build_commit: unknown.
Root cause: get_build_info() in rest_routes.py reads git commit from environment variable DEPLOY_GIT_COMMIT. Inside the Docker container, no .git directory is mounted and no env var is set → everything falls back to "unknown".
Two fixes — use both for complete coverage:
docker stop arifosmcp && docker rm arifosmcp
docker run -d \
--name arifosmcp \
--restart unless-stopped \
-p 8080:8080 \
-e DEPLOY_GIT_COMMIT=de038a0f \
ghcr.io/ariffazil/arifos:de038a0f
Add build args to the root Dockerfile:
ARG DEPLOY_GIT_COMMIT=unknown
ARG DEPLOY_GIT_BRANCH=main
ARG DEPLOY_BUILD_TIME=unknown
ENV DEPLOY_GIT_COMMIT=${DEPLOY_GIT_COMMIT}
ENV DEPLOY_GIT_BRANCH=${DEPLOY_GIT_BRANCH}
ENV DEPLOY_BUILD_TIME=${DEPLOY_BUILD_TIME}
Then at build time:
GIT_SHA=$(git log --oneline -1 --format=%H)
GIT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
docker build \
--build-arg DEPLOY_GIT_COMMIT=$GIT_SHA \
--build-arg DEPLOY_GIT_BRANCH=$GIT_BRANCH \
--build-arg DEPLOY_BUILD_TIME=$BUILD_TIME \
-f Dockerfile \
-t ghcr.io/ariffazil/arifos:$GIT_SHA .
Docker build fails with no space left on device when builder cache fills the disk. Signs:
df -h / shows 100%docker system df shows large build cacheFix:
docker builder prune -af
df -h /
Build cache (27GB+) gets reclaimed. Then rebuild.
When docker run fails but docker exec works → the entry point is wrong, not the image.
The image is almost always fine. The CMD/command override is what's broken.
502 from Caddy = network isolation FIRST, then stale image (wrong container), then backend not listening.
Symptom: arifOS MCP returns HTTP 406 on /mcp when called without Accept: application/json header. WEALTH and GEOX return 200 with the same call.
Root cause: FastMCP's StreamableHTTPServerTransport._check_accept_headers enforces strict content negotiation. arifOS passes json_response=True but has no monkey-patch → rejects Accept: */*. GEOX and WELL both have the monkey-patch; arifOS doesn't.
The monkey-patch (GEOX/WELL pattern):
from mcp.server.streamable_http import StreamableHTTPServerTransport
_orig_check = StreamableHTTPServerTransport._check_accept_headers
def _patched_check(self, request):
if getattr(self, "is_json_response_enabled", False):
return # accept anything when json_response=True
return _orig_check(self, request)
StreamableHTTPServerTransport._check_accept_headers = _patched_check
Diagnosis:
# arifOS without Accept → 406
curl -s -o /dev/null -w "%{http_code}" -X POST http://127.0.0.1:8080/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","method":"initialize","id":1,"params":{...}}'
# → 406
# arifOS with Accept → 200
curl -s -o /dev/null -w "%{http_code}" -X POST http://127.0.0.1:8080/mcp \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{"jsonrpc":"2.0","method":"initialize","id":1,"params":{...}}'
# → 200
# GEOX without Accept → 200 (has monkey-patch)
curl -s -o /dev/null -w "%{http_code}" -X POST http://127.0.0.1:8081/mcp \
-H 'Content-Type: application/json' \
-d '{"jsonrpc":"2.0","method":"initialize","id":1,"params":{...}}'
# → 200
Two fixes (use both for defense in depth):
OpenClaw transport workaround (no code change needed):
Add headers: {"Accept": "application/json"} to the arifOS MCP server entry in /root/.openclaw/openclaw.json. This fixes OpenClaw→arifOS bundle calls without touching arifOS source.
arifOS server.py monkey-patch (defense in depth):
Add the monkey-patch above to arifosmcp/server.py before if __name__ == "__main__":. This makes arifOS fully match WEALTH/GEOX behavior for ALL callers, not just OpenClaw.
json_response=True alone is NOT sufficient. The parameter is necessary but the monkey-patch is what makes the server tolerant of Accept: */*. arifOS has json_response=True at line ~266 but lacks the patch → 406 for */* callers.
Verify GEOX/WELL monkey-patch location:
docker exec geox_eic grep -n '_patched_check' /app/server.py
# → lines 3525 (definition), 3529 (application)
docker exec well grep -n '_patched_check' /app/server.py
# → lines 3525 (definition), 3529 (application)
Three independent witnesses must all pass before claiming ZKPC is live:
# Witness 1: binary present inside container
docker exec <container> node --version
docker exec <container> npm list -g snarkjs --depth=0 | grep snarkjs
# Witness 2: Groth16 mathematical proof
docker exec <container> python3 -c "
import sys; sys.path.insert(0,'/app')
from arifos.security.zkpc_v2 import _snarkjs_available, _groth16_verify
import json
proof = json.load(open('/app/arifos/security/zkp_artifacts/proof.json'))
public = json.load(open('/app/arifos/security/zkp_artifacts/public.json'))
v, out = _groth16_verify(proof, public)
print('snarkjs:', _snarkjs_available())
print('groth16 verified:', v)
print('output:', out.strip() if out else None)
"
# Witness 3: verification key present
docker exec <container> cat /app/arifos/security/zkp_artifacts/verification_key.json | python3 -c "import sys,json; d=json.load(sys.stdin); print('protocol:', d.get('protocol'))"
# All three must PASS before claiming TriWitness complete
VAULT999 data is NOT in the container filesystem — it's a named Docker volume.
Container path: /app/VAULT999/ (empty at image build time)
Host volume: /var/lib/docker/volumes/<vault999-data>/_data/
Write to: /app/VAULT999/SEALED_EVENTS.jsonl (from inside container)
Read from host: /var/lib/docker/volumes/<vault999-data>/_data/SEALED_EVENTS.jsonl
Verify: docker volume ls | grep vault999
VAULT999 is intentionally gitignored (runtime data, not source). Seal events are written from inside the container and persist in the named volume across container restarts.
development
Governed intelligence skill for AAA as the abstraction, attestation, and abduction control plane across arifOS, APEX, A-FORGE, GEOX, WEALTH, WELL, and the ariffazil profile repository. Use when the user asks to explain or design AAA, route agentic work, reduce chaos/entropy in an arifOS federation task, create AREP/task declarations, classify risk, plan multi-repo changes, review governance boundaries, or translate human intent into evidence-backed, authority-safe, recursively agentic workflows. Provides deterministic F1-F13 floor checking, bounded abduction, and FederationReceipt composition.
development
Check every skill’s “use when” and “do not use when” clauses for collisions, missing negatives, and vague verbs like “help,” “assist,” or “improve.” Load when linting, reviewing, or validating trigger boundaries.
development
Bootstrap, design, and package new skills. Load when capturing user intent for a new skill or drafting its initial instruction framework.
content-media
Diagnose which federation services are up, down, or drifting. Produce a prioritized remediation plan.