hermes-skills/arifos-mcp-vps-debug/SKILL.md
Debug why an arifOS MCP VPS runtime doesn't match git state. Use when: git has recent commits but the running container doesn't reflect them, /status.json returns wrong content, tool responses are stale, or any 'it was committed but not visible' symptom. This is the most common deployment trap in arifOS.
npx skillsauth add ariffazil/openclaw-workspace arifos-mcp-vps-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Git commit ≠ running container. The container runs from a Docker image (ghcr.io/ariffazil/arifos:fb39550). A git commit updates source files on the host, but does NOT update the image. The container continues running the old image.
Files affected by this drift:
/usr/src/app/arifosmcp/ (from image, not host mounts)Files that ARE bind-mounted and always current:
Caddyfile (bind-mounted into container)□ Git has recent commits but /status.json still returns old HTML
□ A fix was applied to public_surface.py but not visible externally
□ /health works but /status.json returns catch-all HTML
□ Tools that were patched still fail in the running container
□ Version in /health doesn't match latest git commit
Caddyfile drift = external surface returns wrong route (308/HTML from Caddy catch-all). Container drift = external returns right route but wrong content from old Python.
# Check what /status.json actually returns
curl -s --max-time 8 -w "\nHTTP:%{http_code}" https://mcp.arif-fazil.com/status.json
# If it returns HTML text "arifOS MCP API — use /mcp" → Caddyfile catch-all (route not matched)
# If it returns JSON with old version → container still running old image
# Check local container health
curl -s http://localhost:8080/health | python3 -c "import json,sys; d=json.load(sys.stdin); print('version:', d.get('version')); print('git:', d.get('git_commit'))"
# Compare with git
cd /root/arifOS && git log --oneline -1
docker ps --format "{{.Names}}\t{{.Status}}" | grep arif
# Check image tag vs git commit
docker exec arifosmcp cat /usr/src/app/arifosmcp/server.py | head -5
# Or check build_time in /health
curl -s http://localhost:8080/health | python3 -c "import json,sys; d=json.load(sys.stdin); print('build:', d.get('build_time'))"
# Compare with git log
cd /root/arifOS && git log --oneline -1
# If build_time < git commit time → container is stale
# External /status.json returns HTML → Caddyfile missing route
# Local /status.json returns JSON → Caddyfile not routing correctly
# Test local endpoint directly
curl -s http://localhost:8080/status.json | python3 -c "import json,sys; print(json.dumps(json.load(sys.stdin), indent=2)[:200])"
# If local works but external doesn't → Caddyfile routing issue
# If local also broken → Python route not in image
Use hot-patch ONLY for urgent production fixes when rebuild is too slow.
# HOT-PATCH pattern (temporary — still need rebuild):
docker exec arifosmcp cat /usr/src/app/arifosmcp/core/constitution_kernel.py | grep -n "evaluate_intent" | head -3
# If method missing:
docker exec arifosmcp tee /usr/src/app/arifosmcp/core/constitution_kernel.py > /dev/null << 'PYEOF'
<patched content>
PYEOF
# Clear bytecode cache
docker exec arifosmcp find /usr/src/app -name "__pycache__" -exec rm -rf {} + 2>/dev/null
# Restart
docker restart arifosmcp
# VERIFY
sleep 3
curl -s http://localhost:8080/health | python3 -c "import json,sys; d=json.load(sys.stdin); print('ok:', d.get('status'))"
See arifos-container-hotpatch-guide for full hot-patch dataclass gotchas.
When git has the fix but container doesn't:
cd /root/arifOS
# 1. Verify what needs rebuilding
git log --oneline -3
# 2. Build and push new image
docker build -t ghcr.io/ariffazil/arifos:$(git rev-parse --short HEAD) .
docker push ghcr.io/ariffazil/arifos:$(git rev-parse --short HEAD)
# 3. Update docker-compose.yml to use new tag
# (edit image: ghcr.io/ariffazil/arifos:<new-sha>)
# 4. Pull and restart
docker compose pull
docker compose up -d --force-recreate arifosmcp
# 5. Verify
sleep 5
curl -s http://localhost:8080/health | python3 -c "import json,sys; d=json.load(sys.stdin); print('build:', d.get('build_time'))"
curl -s --max-time 8 https://mcp.arif-fazil.com/status.json | python3 -c "import json,sys; d=json.load(sys.stdin); print('status:', d.get('status', 'HTML?'))"
WELL state.json on host ≠ WELL container state. Always check container directly:
# Container is authoritative
curl -s http://localhost:8083/health
# Host state file is advisory / operator-level
cat /root/WELL/state.json | python3 -c "import json,sys; d=json.load(sys.stdin); print('backend_status:', d.get('backend_status')); print('safe_mode:', d.get('safe_mode')); print('contamination:', d.get('test_contamination'))"
# If host state shows DEGRADED but container is healthy:
# → The container is functioning correctly
# → The host state file reflects operator-level telemetry flags
# → WELL container healthcheck: curl localhost:8083/health → healthy = container is fine
cd /root/arifOS
git add <files>
git commit -m "description"
git push
git commit -m "description" --no-verify
# VAULT999 is intentionally ignored — do NOT force-add
# Instead, changes are auto-tracked by the deploy pipeline
# If "fatal: no submodule mapping found" appears:
# Check .gitmodules case
cat .gitmodules
# Fix if needed (GEOX vs geox):
git config --global submodule.recurse false
# Then edit .gitmodules to match actual path case
git add .gitmodules
git submodule sync
git submodule update --init <submodule>
arifOS-family MCP servers use streamable-http transport. Two endpoints required:
GET /mcp → SSE stream with data: endpoint /messages/?session_id=XXX (endpoint discovery)POST /messages/?session_id=XXX → JSON-RPC communicationRouting only /mcp gives 200 but MCP client cannot communicate.
For each MCP service, Caddyfile block MUST include ALL of:
service-name.arif-fazil.com {
import tls_origin
handle /mcp {
reverse_proxy CONTAINER_NAME:INTERNAL_PORT {
flush_interval -1
}
}
handle /mcp/* {
reverse_proxy CONTAINER_NAME:INTERNAL_PORT {
flush_interval -1
}
}
handle /messages {
reverse_proxy CONTAINER_NAME:INTERNAL_PORT {
flush_interval -1
}
}
handle /messages/* {
reverse_proxy CONTAINER_NAME:INTERNAL_PORT {
flush_interval -1
}
}
handle /health {
reverse_proxy CONTAINER_NAME:INTERNAL_PORT {
flush_interval -1
}
}
}
flush_interval -1 is REQUIRED — without it Caddy buffers SSE and MCP times out.
After edit: docker exec caddy caddy reload --config /etc/caddy/Caddyfile
Verify MCP endpoint:
# SSE endpoint discovery
curl -s -X GET -H "Accept: text/event-stream" http://localhost:INTERNAL_PORT/mcp
# Should return: data: endpoint /messages/?session_id=XXX
# JSON-RPC (need session_id from above)
curl -s -X POST -H "Content-Type: application/json" \
"http://localhost:INTERNAL_PORT/messages/?session_id=SESSION_ID" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'
Problem: Rebuilding docker build -t image . && docker run loses runtime ENV vars from original container.
Symptom: Container starts, ports correct, but app crashes silently or returns wrong version.
Before rebuilding any prod container:
docker inspect CONTAINER --format='{{json .Config.Env}}'
Check for critical vars (tokens, keys, hardcoded version strings). Add as ENV KEY=VALUE in Dockerfile, OR add || 'fallback' in code.
AAA example: Original aaa-a2a container had A2A_TOKEN and A2A_API_KEY env vars. After rebuild, those vars were gone → server.js read undefined → silent crash. Fix: add || 'dev-fallback' in server.js.
Symptom: Container is Up in docker ps, service claims healthy in logs, but curl localhost:<port> returns connection reset or empty response.
This is NOT container drift — the service IS running inside the container.
# Step 1 — Confirm it's unreachable from host
curl -sv http://localhost:<PORT>/health 2>&1 | grep -E "Connection refused|Empty reply|HTTP/"
# Step 2 — Check what port is mapped (host side)
docker port <container> # e.g. docker port geox → 8081/tcp -> 127.0.0.1:8081
# Step 3 — Check what port the process listens on INSIDE the container
docker exec <container> python3 -c "
import socket
for line in socket.getaddrinfo('', 0, socket.AF_INET, socket.SOCK_STREAM):
print(line)
" 2>/dev/null
# OR check uvicorn/FastMCP config in source:
grep -n 'port.*=\s*int\|uvicorn.run\|8081\|8082\|8000' /path/to/server.py
# Step 4 — If inside-container port ≠ mapped host port → MISMATCH
# Example: WEALTH uvicorn runs on 8082 inside container
# but compose maps 8000:8000 (nothing listening on 8000 inside)
# Wrong — maps host 8000 to container 8000, but service on 8082
ports: ["127.0.0.1:8000:8000"]
# Correct — maps host 8000 to container 8082 where uvicorn actually listens
ports: ["127.0.0.1:8000:8082"]
Many arifOS-family MCPs (GEOX, WEALTH, WELL) use SSE-only transport. They only respond to GET with Accept: text/event-stream. POST requests return 405/406.
# Test with correct Accept header
curl -s -H "Accept: text/event-stream" http://localhost:<PORT>/mcp
# If HTML or 405 → SSE-only server, use SSE client library
| Service | Internal Port | Common Wrong Map | Correct Map |
|---------|---------------|-----------------|-------------|
| WEALTH monolith | 8082 | 8000:8000 | 8000:8082 |
| GEOX | 8081 | 8081:8081 | 8081:8081 ✅ |
| WELL | 8083 | varies | varies |
| arifOS | 8080 | 8080:8080 | 8080:8080 ✅ |
Rule: Always verify the port inside the container from the source code (uvicorn.run(port=N)) before trusting the compose mapping.
development
Governed intelligence skill for AAA as the abstraction, attestation, and abduction control plane across arifOS, APEX, A-FORGE, GEOX, WEALTH, WELL, and the ariffazil profile repository. Use when the user asks to explain or design AAA, route agentic work, reduce chaos/entropy in an arifOS federation task, create AREP/task declarations, classify risk, plan multi-repo changes, review governance boundaries, or translate human intent into evidence-backed, authority-safe, recursively agentic workflows. Provides deterministic F1-F13 floor checking, bounded abduction, and FederationReceipt composition.
development
Check every skill’s “use when” and “do not use when” clauses for collisions, missing negatives, and vague verbs like “help,” “assist,” or “improve.” Load when linting, reviewing, or validating trigger boundaries.
development
Bootstrap, design, and package new skills. Load when capturing user intent for a new skill or drafting its initial instruction framework.
content-media
Diagnose which federation services are up, down, or drifting. Produce a prioritized remediation plan.