Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

wallacedobbs428/monitoring-observability

Name: monitoring-observability
Author: wallacedobbs428

.claude/skills/monitoring-observability/SKILL.md

npx skillsauth add wallacedobbs428/thecalltaker monitoring-observability

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Monitoring & Observability

Build reliable monitoring that catches real problems and avoids alert fatigue.

The 3 Pillars

1. Logs — What Happened

Structured, searchable records of events.

import logging
import json
from datetime import datetime

class StructuredFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "engine": getattr(record, 'engine', 'unknown'),
            "command": getattr(record, 'command', ''),
            "message": record.getMessage(),
        }
        if record.exc_info:
            log_entry["error"] = self.formatException(record.exc_info)
        return json.dumps(log_entry)

Log levels:

DEBUG — Verbose tracing (never in production logs)
INFO — Normal operations (email sent, lead scored)
WARNING — Recoverable issues (API retry, rate limited)
ERROR — Failed operations (API 500, state corruption)
CRITICAL — System failure (engine crash, service down)

2. Metrics — How Much / How Fast

Numeric measurements over time.

# Metric types
metrics = {
    "emails_sent": 0,        # Counter — only goes up
    "api_latency_ms": [],    # Histogram — distribution
    "active_pilots": 3,      # Gauge — current value
    "queue_depth": 47,       # Gauge — current value
}

# Daily metrics file
def save_daily_metrics(metrics_dict):
    date = datetime.now().strftime("%Y-%m-%d")
    path = f"logs/metrics/{date}.json"
    with open(path, 'w') as f:
        json.dump({
            "date": date,
            "collected_at": datetime.now().isoformat(),
            **metrics_dict
        }, f, indent=2)

3. Alerts — What Needs Attention

Notifications when metrics cross thresholds.

Alert priority matrix:

| Severity | Response Time | Channel | Example | |----------|--------------|---------|---------| | CRITICAL | < 5 min | Push + SMS | Engine crash, payment failure | | HIGH | < 1 hour | Push notification | Hot lead, demo call | | MEDIUM | Same day | Daily digest | Low send volume, high bounce | | LOW | Next review | Dashboard only | Minor API errors, slow responses |

Health Check Pattern

Every service should expose a health check:

def health_check():
    """Returns health status with component details."""
    checks = {
        "state_file": check_state_file(),
        "ghl_api": check_ghl_api(),
        "last_run": check_last_run(),
        "error_rate": check_error_rate(),
    }

    status = "green"
    for name, check in checks.items():
        if check["status"] == "red":
            status = "red"
            break
        elif check["status"] == "yellow":
            status = "yellow"

    return {
        "service": "max-engine",
        "status": status,  # green/yellow/red
        "checks": checks,
        "timestamp": datetime.now().isoformat(),
    }

def check_state_file():
    try:
        with open(STATE_FILE) as f:
            state = json.load(f)
        age = time.time() - os.path.getmtime(STATE_FILE)
        if age > 7200:  # 2 hours
            return {"status": "yellow", "message": f"State file {age/3600:.1f}h old"}
        return {"status": "green", "message": "OK"}
    except Exception as e:
        return {"status": "red", "message": str(e)}

def check_ghl_api():
    try:
        start = time.time()
        resp = ghl_get("/contacts?limit=1")
        latency = (time.time() - start) * 1000
        if latency > 5000:
            return {"status": "yellow", "message": f"Slow: {latency:.0f}ms"}
        return {"status": "green", "message": f"{latency:.0f}ms"}
    except Exception as e:
        return {"status": "red", "message": str(e)}

Heartbeat Monitoring

For launchd services that run on schedules:

def write_heartbeat(service_name):
    """Write heartbeat file after successful run."""
    path = f"logs/{service_name}.heartbeat"
    with open(path, 'w') as f:
        json.dump({
            "service": service_name,
            "last_run": datetime.now().isoformat(),
            "status": "ok",
            "pid": os.getpid(),
        }, f)

def check_heartbeat(service_name, max_age_minutes=60):
    """Check if a service has run recently."""
    path = f"logs/{service_name}.heartbeat"
    if not os.path.exists(path):
        return {"status": "red", "message": "No heartbeat file"}

    age = time.time() - os.path.getmtime(path)
    if age > max_age_minutes * 60:
        return {"status": "red", "message": f"Last heartbeat {age/60:.0f}m ago"}
    return {"status": "green", "message": f"Last heartbeat {age/60:.0f}m ago"}

Alert Fatigue Prevention

Deduplication

# Don't send same alert within window
DEDUPE_WINDOW = 1800  # 30 minutes
_recent_alerts = {}

def should_alert(key):
    now = time.time()
    if key in _recent_alerts and now - _recent_alerts[key] < DEDUPE_WINDOW:
        return False
    _recent_alerts[key] = now
    return True

Alert Escalation

def escalate(issue_key, severity="medium"):
    """Escalate if issue persists."""
    state = load_state()
    issue = state.get("open_issues", {}).get(issue_key, {})

    if not issue:
        # First occurrence — log, don't alert
        state.setdefault("open_issues", {})[issue_key] = {
            "first_seen": datetime.now().isoformat(),
            "count": 1,
            "severity": severity,
        }
    else:
        issue["count"] += 1
        if issue["count"] >= 3 and severity == "medium":
            # 3 consecutive failures — escalate
            send_alert(issue_key, severity="high")
        elif issue["count"] >= 5:
            # 5 failures — critical
            send_alert(issue_key, severity="critical")

    save_state(state)

Quiet Hours

def in_quiet_hours():
    """Don't send non-critical alerts 10pm-7am."""
    hour = datetime.now().hour
    return hour >= 22 or hour < 7

def smart_alert(message, severity):
    if severity == "critical":
        send_now(message)  # Always send critical
    elif in_quiet_hours():
        queue_for_morning(message)  # Buffer non-critical
    else:
        send_now(message)

Dashboard Metrics

Key metrics to display:

DASHBOARD_METRICS = {
    # Business metrics
    "mrr": {"type": "gauge", "unit": "$"},
    "active_pilots": {"type": "gauge", "unit": "count"},
    "demo_calls_today": {"type": "counter", "unit": "count"},
    "leads_generated_today": {"type": "counter", "unit": "count"},

    # System metrics
    "api_calls_today": {"type": "counter", "unit": "count"},
    "api_error_rate": {"type": "gauge", "unit": "%"},
    "emails_sent_today": {"type": "counter", "unit": "count"},
    "sms_sent_today": {"type": "counter", "unit": "count"},

    # Health metrics
    "engines_healthy": {"type": "gauge", "unit": "count"},
    "services_running": {"type": "gauge", "unit": "count"},
    "last_crash": {"type": "timestamp", "unit": ""},
    "uptime_pct": {"type": "gauge", "unit": "%"},
}

Incident Response Checklist

When an alert fires:

Acknowledge — Note the time, check if it's a real issue or false positive
Assess scope — One service or cascade? Affecting customers?
Mitigate — Restart service, rollback change, disable feature
Investigate — Check logs, state files, recent deployments
Fix — Apply root cause fix
Verify — Confirm service is healthy, check downstream
Document — Postmortem: what happened, why, how to prevent

wallacedobbs428/monitoring-observability

.claude/skills/monitoring-observability/SKILL.md

Design and implement monitoring, alerting, and observability systems. Use when setting up health checks, log aggregation, metrics dashboards, alert routing, uptime monitoring, or debugging production issues. Covers structured logging, health endpoints, metric collection, alert fatigue prevention, and incident response.

development

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add wallacedobbs428/thecalltaker monitoring-observability

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 5:12 AM19.5s1 file scanned

SKILL.md

name:: monitoring-observability
description:: Design and implement monitoring, alerting, and observability systems. Use when setting up health checks, log aggregation, metrics dashboards, alert routing, uptime monitoring, or debugging production issues. Covers structured logging, health endpoints, metric collection, alert fatigue prevention, and incident response.
category:: infrastructure

Monitoring & Observability

Build reliable monitoring that catches real problems and avoids alert fatigue.

The 3 Pillars

1. Logs — What Happened

Structured, searchable records of events.

import logging
import json
from datetime import datetime

class StructuredFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "engine": getattr(record, 'engine', 'unknown'),
            "command": getattr(record, 'command', ''),
            "message": record.getMessage(),
        }
        if record.exc_info:
            log_entry["error"] = self.formatException(record.exc_info)
        return json.dumps(log_entry)

Log levels:

DEBUG — Verbose tracing (never in production logs)
INFO — Normal operations (email sent, lead scored)
WARNING — Recoverable issues (API retry, rate limited)
ERROR — Failed operations (API 500, state corruption)
CRITICAL — System failure (engine crash, service down)

2. Metrics — How Much / How Fast

Numeric measurements over time.

# Metric types
metrics = {
    "emails_sent": 0,        # Counter — only goes up
    "api_latency_ms": [],    # Histogram — distribution
    "active_pilots": 3,      # Gauge — current value
    "queue_depth": 47,       # Gauge — current value
}

# Daily metrics file
def save_daily_metrics(metrics_dict):
    date = datetime.now().strftime("%Y-%m-%d")
    path = f"logs/metrics/{date}.json"
    with open(path, 'w') as f:
        json.dump({
            "date": date,
            "collected_at": datetime.now().isoformat(),
            **metrics_dict
        }, f, indent=2)

3. Alerts — What Needs Attention

Notifications when metrics cross thresholds.

Alert priority matrix:

Health Check Pattern

Every service should expose a health check:

def health_check():
    """Returns health status with component details."""
    checks = {
        "state_file": check_state_file(),
        "ghl_api": check_ghl_api(),
        "last_run": check_last_run(),
        "error_rate": check_error_rate(),
    }

    status = "green"
    for name, check in checks.items():
        if check["status"] == "red":
            status = "red"
            break
        elif check["status"] == "yellow":
            status = "yellow"

    return {
        "service": "max-engine",
        "status": status,  # green/yellow/red
        "checks": checks,
        "timestamp": datetime.now().isoformat(),
    }

def check_state_file():
    try:
        with open(STATE_FILE) as f:
            state = json.load(f)
        age = time.time() - os.path.getmtime(STATE_FILE)
        if age > 7200:  # 2 hours
            return {"status": "yellow", "message": f"State file {age/3600:.1f}h old"}
        return {"status": "green", "message": "OK"}
    except Exception as e:
        return {"status": "red", "message": str(e)}

def check_ghl_api():
    try:
        start = time.time()
        resp = ghl_get("/contacts?limit=1")
        latency = (time.time() - start) * 1000
        if latency > 5000:
            return {"status": "yellow", "message": f"Slow: {latency:.0f}ms"}
        return {"status": "green", "message": f"{latency:.0f}ms"}
    except Exception as e:
        return {"status": "red", "message": str(e)}

Heartbeat Monitoring

For launchd services that run on schedules:

def write_heartbeat(service_name):
    """Write heartbeat file after successful run."""
    path = f"logs/{service_name}.heartbeat"
    with open(path, 'w') as f:
        json.dump({
            "service": service_name,
            "last_run": datetime.now().isoformat(),
            "status": "ok",
            "pid": os.getpid(),
        }, f)

def check_heartbeat(service_name, max_age_minutes=60):
    """Check if a service has run recently."""
    path = f"logs/{service_name}.heartbeat"
    if not os.path.exists(path):
        return {"status": "red", "message": "No heartbeat file"}

    age = time.time() - os.path.getmtime(path)
    if age > max_age_minutes * 60:
        return {"status": "red", "message": f"Last heartbeat {age/60:.0f}m ago"}
    return {"status": "green", "message": f"Last heartbeat {age/60:.0f}m ago"}

Alert Fatigue Prevention

Deduplication

# Don't send same alert within window
DEDUPE_WINDOW = 1800  # 30 minutes
_recent_alerts = {}

def should_alert(key):
    now = time.time()
    if key in _recent_alerts and now - _recent_alerts[key] < DEDUPE_WINDOW:
        return False
    _recent_alerts[key] = now
    return True

Alert Escalation

def escalate(issue_key, severity="medium"):
    """Escalate if issue persists."""
    state = load_state()
    issue = state.get("open_issues", {}).get(issue_key, {})

    if not issue:
        # First occurrence — log, don't alert
        state.setdefault("open_issues", {})[issue_key] = {
            "first_seen": datetime.now().isoformat(),
            "count": 1,
            "severity": severity,
        }
    else:
        issue["count"] += 1
        if issue["count"] >= 3 and severity == "medium":
            # 3 consecutive failures — escalate
            send_alert(issue_key, severity="high")
        elif issue["count"] >= 5:
            # 5 failures — critical
            send_alert(issue_key, severity="critical")

    save_state(state)

Quiet Hours

def in_quiet_hours():
    """Don't send non-critical alerts 10pm-7am."""
    hour = datetime.now().hour
    return hour >= 22 or hour < 7

def smart_alert(message, severity):
    if severity == "critical":
        send_now(message)  # Always send critical
    elif in_quiet_hours():
        queue_for_morning(message)  # Buffer non-critical
    else:
        send_now(message)

Dashboard Metrics

Key metrics to display:

DASHBOARD_METRICS = {
    # Business metrics
    "mrr": {"type": "gauge", "unit": "$"},
    "active_pilots": {"type": "gauge", "unit": "count"},
    "demo_calls_today": {"type": "counter", "unit": "count"},
    "leads_generated_today": {"type": "counter", "unit": "count"},

    # System metrics
    "api_calls_today": {"type": "counter", "unit": "count"},
    "api_error_rate": {"type": "gauge", "unit": "%"},
    "emails_sent_today": {"type": "counter", "unit": "count"},
    "sms_sent_today": {"type": "counter", "unit": "count"},

    # Health metrics
    "engines_healthy": {"type": "gauge", "unit": "count"},
    "services_running": {"type": "gauge", "unit": "count"},
    "last_crash": {"type": "timestamp", "unit": ""},
    "uptime_pct": {"type": "gauge", "unit": "%"},
}

Incident Response Checklist

When an alert fires:

Acknowledge — Note the time, check if it's a real issue or false positive
Assess scope — One service or cascade? Affecting customers?
Mitigate — Restart service, rollback change, disable feature
Investigate — Check logs, state files, recent deployments
Fix — Apply root cause fix
Verify — Confirm service is healthy, check downstream
Document — Postmortem: what happened, why, how to prevent

Related Skills

wallacedobbs428/writer-memory

documentation

VerifiedTrustedCommunity

Agentic memory system for writers - track characters, relationships, scenes, and themes

SKILL.mdUpdated Apr 17, 2026

wallacedobbs428/writer-memory

wallacedobbs428/workflow-automation

tools

VerifiedTrustedCommunity

Automate repetitive development tasks and workflows. Use when creating build scripts, automating deployments, or setting up development workflows. Handles npm scripts, Makefile, GitHub Actions workflows, and task automation.

SKILL.mdUpdated Apr 17, 2026

wallacedobbs428/workflow-automation

wallacedobbs428/web-design-guidelines

development

VerifiedTrustedCommunity

Review UI code for Web Interface Guidelines compliance. Use when asked to "review my UI", "check accessibility", "audit design", "review UX", or "check my site against best practices". Fetches latest Vercel guidelines and checks files against all rules.

SKILL.mdUpdated Apr 17, 2026

wallacedobbs428/web-design-guidelines

wallacedobbs428/web-accessibility

development

VerifiedTrustedCommunity

Implement web accessibility (a11y) standards following WCAG 2.1 guidelines. Use when building accessible UIs, fixing accessibility issues, or ensuring compliance with disability standards. Handles ARIA attributes, keyboard navigation, screen readers, semantic HTML, and accessibility testing.

SKILL.mdUpdated Apr 17, 2026

wallacedobbs428/web-accessibility

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/wallacedobbs428/thecalltaker.git

# Copy into Claude Code skills folder (global)
cp -r thecalltaker/.claude/skills/monitoring-observability ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

wallacedobbs428/thecalltaker

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT