skills/infrastructure/service-health-check/SKILL.md
Service health monitoring: Discover, Check, Report in 3 phases.
npx skillsauth add notque/claude-code-toolkit service-health-checkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides deterministic service health monitoring using the Discover-Check-Report pattern. It finds services, gathers health signals from multiple sources (process table, health files, port binding), and produces actionable reports identifying degraded or failed services.
Core principle: Health assessment is evidence-based. Never report a service healthy without verifying process status independently of health file content. Never assume a running process is functional — always cross-check against health files and port binding.
Goal: Identify all services to check before running any health probes.
Step 1: Locate service definitions
Search for service configuration in this order:
services.json in project rootStep 2: Build service manifest
For each service, establish:
## Service Manifest
| Service | Process Pattern | Health File | Port | Stale Threshold |
|---------|----------------|-------------|------|-----------------|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |
Validation constraints:
Step 3: Validate manifest
Confirm each entry passes the constraints above. If a pattern is too broad, use ps aux | grep to identify distinguishing arguments, then update the pattern.
Gate: Service manifest complete with at least one service. Proceed only when gate passes.
Goal: Gather health signals for every service in the manifest. Always check process status independently of health file content—a running process and a healthy health file are separate signals.
Step 1: Check process status
For each service, run process check:
pgrep -f "<process_pattern>"
Record: running (true/false), PIDs, process count.
Rationale: Process existence is the primary signal. A missing process always means the service is DOWN. A running process alone is insufficient—the service may have crashed or failed to bind to its port.
Step 2: Parse health files (if configured)
Read and parse JSON health files. Evaluate:
Critical constraint: Never trust health file content alone. The file could be stale from before a process crash. Always verify:
Step 3: Probe ports (if configured)
Check if expected ports are listening:
ss -tlnp "sport = :<port>"
Rationale: Verify ports are actually bound. A process can start but fail to bind to its configured port—that is effectively a DOWN state, not HEALTHY.
Step 4: Evaluate health per service
Apply this decision tree (constraints embedded in logic):
Gate: All services evaluated with evidence-based status. No status is determined without concrete signal (process check, health file, or port probe). Proceed only when gate passes.
Goal: Produce structured, actionable health report with specific remediation commands.
Step 1: Generate summary
SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N
RESULTS:
service-name [OK ] HEALTHY PID 12345, uptime 2d 4h
background-worker [WARN] WARNING Health file stale (15 min)
cache-service [DOWN] DOWN Process not found
RECOMMENDATIONS:
background-worker: Restart recommended - health file not updated in 900s
cache-service: Start service - process not running
SUGGESTED ACTIONS:
systemctl restart background-worker
systemctl start cache-service
Step 2: Set exit status
Step 3: Present to user
Gate: Report delivered with actionable recommendations for all non-healthy services.
User says: "Are all services up?" Actions:
User says: "The background worker seems stuck" Actions:
Cause: No services.json, docker-compose, or systemd units discovered Solution:
Cause: Pattern too broad (e.g., "python" matches all Python processes) Solution:
ps aux | grep to identify distinguishing argumentsCause: Malformed JSON, permissions issue, or file being written during read Solution:
ls -laServices should write health files as:
{
"timestamp": "ISO8601, updated every 30-60s",
"status": "healthy|degraded|error",
"connection": "connected|disconnected|reconnecting",
"last_activity": "ISO8601 of last meaningful action",
"running": true,
"uptime_seconds": 12345,
"metrics": {}
}
| Constraint | Rationale | Application |
|-----------|-----------|-------------|
| Process status verified independently of health file | Running process ≠ functional service | Always check process before trusting health file |
| Health file staleness detected by timestamp freshness | File could be stale from before crash | Check timestamp against 300s (configurable) threshold |
| Port binding verified when configured | Process running doesn't mean port is bound | Always verify expected port listening when port specified |
| No auto-restart without explicit flag | Restart masks root cause | Report findings first; only execute restart if user flags it |
| Narrow process patterns required | "python" matches all processes, giving false matches | Use full paths or specific args; validate with ps aux \| grep |
| Evidence-based status only | Status must have supporting signal | No status without concrete evidence (process, health file, or port) |
documentation
Document translation: quick/normal/refined modes with chunked parallel subagents and glossary support.
development
AI image generation: Gemini and Nano Banana backends; single/series/batch workflows with prompt-to-disk.
testing
Unified voice content generation pipeline with mandatory validation and joy-check. 13-phase pipeline: LOAD, GROUND, STATS-CHECKPOINT, GENERATE, HOOK-GATE, VALIDATE, REFINE, VARIETY-GATE, JOY-CHECK, ANTI-AI, CLOSE-GATE, OUTPUT, CLEANUP. Use when writing articles, blog posts, or any content that uses a voice profile. Use for "write article", "blog post", "write in voice", "generate content", "draft article", "write about".
documentation
Critique-and-rewrite loop for voice fidelity validation.