Service Health Check Skill

Overview

This skill provides deterministic service health monitoring using the Discover-Check-Report pattern. It finds services, gathers health signals from multiple sources (process table, health files, port binding), and produces actionable reports identifying degraded or failed services.

Core principle: Health assessment is evidence-based. Never report a service healthy without verifying process status independently of health file content. Never assume a running process is functional — always cross-check against health files and port binding.

Reference Loading Table

| Signal | Load These Files | Why | |---|---|---| | Endpoint validation request | references/endpoint-validator.md | Full endpoint validation methodology | | Security header WARNs, HSTS/CSP/X-Frame issues | references/security-headers.md | Deep security header reference | | Config errors, hardcoded IPs, timeout problems | references/endpoint-config-preferred-patterns.md | Endpoint config patterns | | 401/403 failures, Bearer/API-key/cookie auth | references/auth-endpoint-patterns.md | Auth endpoint patterns | | CVE source audit request | references/cve-source-check.md | Full CVE source check methodology | | CVE registry schema questions | references/registry-schema.md | Registry shape and entry format | | CVE source URL verification | references/source-verification.md | HEAD-check semantics | | CVE report format questions | references/output-formats.md | JSON schema and Markdown sections |

Instructions

Phase 1: DISCOVER

Goal: Identify all services to check before running any health probes.

Step 1: Locate service definitions

Search for service configuration in this order:

services.json in project root
Docker/docker-compose files for service definitions
systemd unit files or process manager configs
User-provided service specification

Step 2: Build service manifest

For each service, establish:

## Service Manifest
| Service | Process Pattern | Health File | Port | Stale Threshold |
|---------|----------------|-------------|------|-----------------|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |

Validation constraints:

Each process pattern must be specific enough to avoid false matches (e.g., "python" matches all Python processes—use full paths or arguments instead)
Health file paths must be absolute
Port numbers must be valid (1-65535)
Pattern specificity matters: narrow patterns with full command paths, distinguishing arguments, or specific binary names

Step 3: Validate manifest

Confirm each entry passes the constraints above. If a pattern is too broad, use ps aux | grep to identify distinguishing arguments, then update the pattern.

Gate: Service manifest complete with at least one service. Proceed only when gate passes.

Phase 2: CHECK

Goal: Gather health signals for every service in the manifest. Always check process status independently of health file content—a running process and a healthy health file are separate signals.

Step 1: Check process status

For each service, run process check:

pgrep -f "<process_pattern>"

Record: running (true/false), PIDs, process count.

Rationale: Process existence is the primary signal. A missing process always means the service is DOWN. A running process alone is insufficient—the service may have crashed or failed to bind to its port.

Step 2: Parse health files (if configured)

Read and parse JSON health files. Evaluate:

Does the file exist?
Does it parse as valid JSON?
How old is the timestamp (staleness)? Default stale threshold is 300 seconds.
What status does the service self-report?
What is the connection state?

Critical constraint: Never trust health file content alone. The file could be stale from before a process crash. Always verify:

Process is still running
Health file timestamp is fresh (within configured threshold)
Status field matches evidence (e.g., "error" requires restart)

Step 3: Probe ports (if configured)

Check if expected ports are listening:

ss -tlnp "sport = :<port>"

Rationale: Verify ports are actually bound. A process can start but fail to bind to its configured port—that is effectively a DOWN state, not HEALTHY.

Step 4: Evaluate health per service

Apply this decision tree (constraints embedded in logic):

Process not running → DOWN (definitive)
Process running + health file missing → WARNING (limited visibility, but process is alive)
Process running + health file stale (> threshold) → WARNING (file hasn't updated in configured time, suggests no activity or crash recovery in progress)
Process running + status=error → ERROR (restart recommended immediately)
Process running + disconnected > 30 minutes → WARNING (long disconnect suggests stuck state, restart recommended)
Process running + disconnected < 30 minutes → DEGRADED (allow reconnection window, monitor)
Process running + port not listening (when port is configured) → ERROR (process running but failed to bind port)
Process running + healthy → HEALTHY (all checks pass)
Process running + no health file configured → RUNNING (limited visibility, process verified only)

Gate: All services evaluated with evidence-based status. No status is determined without concrete signal (process check, health file, or port probe). Proceed only when gate passes.

Phase 3: REPORT

Goal: Produce structured, actionable health report with specific remediation commands.

Step 1: Generate summary

SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N

RESULTS:
  service-name         [OK  ] HEALTHY     PID 12345, uptime 2d 4h
  background-worker    [WARN] WARNING     Health file stale (15 min)
  cache-service        [DOWN] DOWN        Process not found

RECOMMENDATIONS:
  background-worker: Restart recommended - health file not updated in 900s
  cache-service: Start service - process not running

SUGGESTED ACTIONS:
  systemctl restart background-worker
  systemctl start cache-service

Step 2: Set exit status

All HEALTHY/RUNNING → exit 0
Any WARNING/DEGRADED/ERROR/DOWN → exit 1

Step 3: Present to user

Lead with the summary line (X/N healthy)
Highlight any services needing action
Provide copy-pasteable commands for remediation
Never auto-restart without explicit user flag. Always report findings first, let user decide.

Gate: Report delivered with actionable recommendations for all non-healthy services.

Examples

Example 1: Routine Health Check

User says: "Are all services up?" Actions:

Locate services.json, build manifest (DISCOVER)
Check each process, parse health files, probe ports (CHECK)
Output structured report showing 3/3 healthy (REPORT) Result: Clean report, no action needed

Example 2: Stale Worker Detection

User says: "The background worker seems stuck" Actions:

Identify worker service from config (DISCOVER)
Find process running but health file 20 minutes stale (CHECK) — triggers WARNING decision in tree
Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command

Error Handling

Error: "No Service Configuration Found"

Cause: No services.json, docker-compose, or systemd units discovered Solution:

Ask user for service name and process pattern
Build minimal manifest from user input
Proceed with manual configuration

Error: "Process Pattern Matches Too Many PIDs"

Cause: Pattern too broad (e.g., "python" matches all Python processes) Solution:

Narrow pattern with full command path or arguments
Use ps aux | grep to identify distinguishing arguments
Update manifest with more specific pattern
Rationale: False positives hide real failures. Specificity is required to avoid misdiagnosis.

Error: "Health File Exists But Cannot Parse"

Cause: Malformed JSON, permissions issue, or file being written during read Solution:

Check file permissions with ls -la
Attempt raw read to inspect content
If mid-write, retry after 2-second delay
Report as WARNING with parse error details

References

Health File Format Reference

Services should write health files as:

{
    "timestamp": "ISO8601, updated every 30-60s",
    "status": "healthy|degraded|error",
    "connection": "connected|disconnected|reconnecting",
    "last_activity": "ISO8601 of last meaningful action",
    "running": true,
    "uptime_seconds": 12345,
    "metrics": {}
}

Key Constraints Summary

| Constraint | Rationale | Application | |-----------|-----------|-------------| | Process status verified independently of health file | Running process ≠ functional service | Always check process before trusting health file | | Health file staleness detected by timestamp freshness | File could be stale from before crash | Check timestamp against 300s (configurable) threshold | | Port binding verified when configured | Process running doesn't mean port is bound | Always verify expected port listening when port specified | | No auto-restart without explicit flag | Restart masks root cause | Report findings first; only execute restart if user flags it | | Narrow process patterns required | "python" matches all processes, giving false matches | Use full paths or specific args; validate with ps aux \| grep | | Evidence-based status only | Status must have supporting signal | No status without concrete evidence (process, health file, or port) |

Service Health Check Skill

Overview

Reference Loading Table

Instructions

Phase 1: DISCOVER

Goal: Identify all services to check before running any health probes.

Step 1: Locate service definitions

Search for service configuration in this order:

services.json in project root
Docker/docker-compose files for service definitions
systemd unit files or process manager configs
User-provided service specification

Step 2: Build service manifest

For each service, establish:

## Service Manifest
| Service | Process Pattern | Health File | Port | Stale Threshold |
|---------|----------------|-------------|------|-----------------|
| api-server | gunicorn.*app:app | /tmp/api_health.json | 8000 | 300s |
| worker | celery.*worker | /tmp/worker_health.json | - | 300s |
| cache | redis-server | - | 6379 | - |

Validation constraints:

Each process pattern must be specific enough to avoid false matches (e.g., "python" matches all Python processes—use full paths or arguments instead)
Health file paths must be absolute
Port numbers must be valid (1-65535)
Pattern specificity matters: narrow patterns with full command paths, distinguishing arguments, or specific binary names

Step 3: Validate manifest

Confirm each entry passes the constraints above. If a pattern is too broad, use ps aux | grep to identify distinguishing arguments, then update the pattern.

Gate: Service manifest complete with at least one service. Proceed only when gate passes.

Phase 2: CHECK

Goal: Gather health signals for every service in the manifest. Always check process status independently of health file content—a running process and a healthy health file are separate signals.

Step 1: Check process status

For each service, run process check:

pgrep -f "<process_pattern>"

Record: running (true/false), PIDs, process count.

Step 2: Parse health files (if configured)

Read and parse JSON health files. Evaluate:

Does the file exist?
Does it parse as valid JSON?
How old is the timestamp (staleness)? Default stale threshold is 300 seconds.
What status does the service self-report?
What is the connection state?

Critical constraint: Never trust health file content alone. The file could be stale from before a process crash. Always verify:

Process is still running
Health file timestamp is fresh (within configured threshold)
Status field matches evidence (e.g., "error" requires restart)

Step 3: Probe ports (if configured)

Check if expected ports are listening:

ss -tlnp "sport = :<port>"

Rationale: Verify ports are actually bound. A process can start but fail to bind to its configured port—that is effectively a DOWN state, not HEALTHY.

Step 4: Evaluate health per service

Apply this decision tree (constraints embedded in logic):

Process not running → DOWN (definitive)
Process running + health file missing → WARNING (limited visibility, but process is alive)
Process running + health file stale (> threshold) → WARNING (file hasn't updated in configured time, suggests no activity or crash recovery in progress)
Process running + status=error → ERROR (restart recommended immediately)
Process running + disconnected > 30 minutes → WARNING (long disconnect suggests stuck state, restart recommended)
Process running + disconnected < 30 minutes → DEGRADED (allow reconnection window, monitor)
Process running + port not listening (when port is configured) → ERROR (process running but failed to bind port)
Process running + healthy → HEALTHY (all checks pass)
Process running + no health file configured → RUNNING (limited visibility, process verified only)

Gate: All services evaluated with evidence-based status. No status is determined without concrete signal (process check, health file, or port probe). Proceed only when gate passes.

Phase 3: REPORT

Goal: Produce structured, actionable health report with specific remediation commands.

Step 1: Generate summary

SERVICE HEALTH REPORT
=====================
Checked: N services
Healthy: X/N

RESULTS:
  service-name         [OK  ] HEALTHY     PID 12345, uptime 2d 4h
  background-worker    [WARN] WARNING     Health file stale (15 min)
  cache-service        [DOWN] DOWN        Process not found

RECOMMENDATIONS:
  background-worker: Restart recommended - health file not updated in 900s
  cache-service: Start service - process not running

SUGGESTED ACTIONS:
  systemctl restart background-worker
  systemctl start cache-service

Step 2: Set exit status

All HEALTHY/RUNNING → exit 0
Any WARNING/DEGRADED/ERROR/DOWN → exit 1

Step 3: Present to user

Lead with the summary line (X/N healthy)
Highlight any services needing action
Provide copy-pasteable commands for remediation
Never auto-restart without explicit user flag. Always report findings first, let user decide.

Gate: Report delivered with actionable recommendations for all non-healthy services.

Examples

Example 1: Routine Health Check

User says: "Are all services up?" Actions:

Locate services.json, build manifest (DISCOVER)
Check each process, parse health files, probe ports (CHECK)
Output structured report showing 3/3 healthy (REPORT) Result: Clean report, no action needed

Example 2: Stale Worker Detection

User says: "The background worker seems stuck" Actions:

Identify worker service from config (DISCOVER)
Find process running but health file 20 minutes stale (CHECK) — triggers WARNING decision in tree
Report WARNING with restart recommendation (REPORT) Result: Specific diagnosis with actionable command

Error Handling

Error: "No Service Configuration Found"

Cause: No services.json, docker-compose, or systemd units discovered Solution:

Ask user for service name and process pattern
Build minimal manifest from user input
Proceed with manual configuration

Error: "Process Pattern Matches Too Many PIDs"

Cause: Pattern too broad (e.g., "python" matches all Python processes) Solution:

Narrow pattern with full command path or arguments
Use ps aux | grep to identify distinguishing arguments
Update manifest with more specific pattern
Rationale: False positives hide real failures. Specificity is required to avoid misdiagnosis.

Error: "Health File Exists But Cannot Parse"

Cause: Malformed JSON, permissions issue, or file being written during read Solution:

Check file permissions with ls -la
Attempt raw read to inspect content
If mid-write, retry after 2-second delay
Report as WARNING with parse error details

References

Health File Format Reference

Services should write health files as:

{
    "timestamp": "ISO8601, updated every 30-60s",
    "status": "healthy|degraded|error",
    "connection": "connected|disconnected|reconnecting",
    "last_activity": "ISO8601 of last meaningful action",
    "running": true,
    "uptime_seconds": 12345,
    "metrics": {}
}

Adoption

notque/service-health-check

$ install --global

Security Scan Results

SKILL.md

Service Health Check Skill

Overview

Reference Loading Table

Instructions

Phase 1: DISCOVER

Phase 2: CHECK

Phase 3: REPORT

Examples

Example 1: Routine Health Check

Example 2: Stale Worker Detection

Error Handling

Error: "No Service Configuration Found"

Error: "Process Pattern Matches Too Many PIDs"

Error: "Health File Exists But Cannot Parse"

References

Health File Format Reference

Key Constraints Summary

Related Skills

notque/shell-config

notque/kubernetes

notque/swift

notque/php

notque/service-health-check

$ install --global

Security Scan Results

SKILL.md

Service Health Check Skill

Overview

Reference Loading Table

Instructions

Phase 1: DISCOVER

Phase 2: CHECK

Phase 3: REPORT

Examples

Example 1: Routine Health Check

Example 2: Stale Worker Detection

Error Handling

Error: "No Service Configuration Found"

Error: "Process Pattern Matches Too Many PIDs"

Error: "Health File Exists But Cannot Parse"

References

Health File Format Reference

Key Constraints Summary

Related Skills

notque/shell-config

notque/kubernetes

notque/swift

notque/php