Alerting & On-Call

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

Setting up alerting rules and thresholds
Configuring on-call rotations and schedules
Implementing alert routing and escalation
Reducing alert fatigue
Managing incident response workflows

Prerequisites

Monitoring system (Prometheus, Datadog, etc.)
On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

# Severity levels
critical:
  - Service completely down
  - Data loss imminent
  - Security breach
  response: Immediate page, wake people up

high:
  - Service degraded significantly
  - Error rate above SLO
  - Capacity near limit
  response: Page during business hours, notify after hours

medium:
  - Performance degradation
  - Non-critical component failure
  - Warning thresholds exceeded
  response: Notify via Slack, review next business day

low:
  - Informational alerts
  - Capacity planning triggers
  - Routine maintenance needed
  response: Email notification, weekly review

Alert Design Principles

# Good alert characteristics
alerts:
  actionable:
    - Every alert should require human action
    - Include runbook links
    - Clear remediation steps

  relevant:
    - Alert on symptoms, not causes
    - Focus on user impact
    - Avoid alerting on expected behavior

  timely:
    - Appropriate thresholds
    - Suitable evaluation windows
    - Account for normal variance

  unique:
    - No duplicate alerts
    - Proper alert grouping
    - Clear ownership

Prometheus Alerting

Alert Rules

# prometheus/rules/alerts.yml
groups:
  - name: service_alerts
    rules:
      # High-level service health
      - alert: ServiceDown
        expr: up{job="myapp"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."
          runbook_url: "https://wiki.example.com/runbooks/service-down"

      # Error rate alert
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"

      # Latency alert (SLO-based)
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 0.5
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 0s
      repeat_interval: 1h

    # High severity during business hours
    - match:
        severity: high
      receiver: 'slack-high'
      active_time_intervals:
        - business-hours

    # Route by team
    - match_re:
        team: platform.*
      receiver: 'platform-team'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'xxx'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.firing" . }}'

  - name: 'slack-high'
    slack_configs:
      - channel: '#alerts-high'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook_url }}'
          - type: button
            text: 'Dashboard'
            url: '{{ .CommonAnnotations.dashboard_url }}'

  - name: 'platform-team'
    slack_configs:
      - channel: '#platform-alerts'

time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '17:00'

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: high
    equal: ['service']

PagerDuty Integration

Service Configuration

# Terraform example
resource "pagerduty_service" "myapp" {
  name                    = "MyApp Production"
  description             = "Production application service"
  escalation_policy       = pagerduty_escalation_policy.default.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 14400  # 4 hours
  acknowledgement_timeout = 600    # 10 minutes

  incident_urgency_rule {
    type    = "use_support_hours"
    
    during_support_hours {
      type    = "constant"
      urgency = "high"
    }
    
    outside_support_hours {
      type    = "constant"
      urgency = "low"
    }
  }
}

resource "pagerduty_escalation_policy" "default" {
  name      = "Default Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.manager.id
    }
  }
}

Schedule Configuration

resource "pagerduty_schedule" "primary" {
  name      = "Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2024-01-01T00:00:00-05:00"
    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 1 week
    users                        = [for user in pagerduty_user.oncall : user.id]
  }

  # Override layer for holidays
  layer {
    name                         = "Holiday Coverage"
    start                        = "2024-01-01T00:00:00-05:00"
    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
    rotation_turn_length_seconds = 86400
    users                        = [pagerduty_user.holiday_coverage.id]

    restriction {
      type              = "daily_restriction"
      start_time_of_day = "00:00:00"
      duration_seconds  = 86400
      start_day_of_week = 0  # Sunday
    }
  }
}

Grafana OnCall

Integration Setup

# docker-compose.yml addition
services:
  oncall:
    image: grafana/oncall
    environment:
      - SECRET_KEY=your-secret-key
      - BASE_URL=http://oncall:8080
      - GRAFANA_API_URL=http://grafana:3000
    ports:
      - "8080:8080"

Escalation Chain

# Example escalation chain structure
escalation_chains:
  - name: "Production Critical"
    steps:
      - step: 1
        type: notify
        persons:
          - "@oncall-primary"
        wait_delay: 0
        
      - step: 2
        type: notify
        persons:
          - "@oncall-secondary"
        wait_delay: 5m
        
      - step: 3
        type: notify
        persons:
          - "@engineering-manager"
        wait_delay: 10m
        
      - step: 4
        type: trigger_action
        action: "escalate_to_incident_commander"
        wait_delay: 15m

Alert Templates

Slack Alert Template

{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "slack.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
{{ end }}

PagerDuty Details Template

{{ define "pagerduty.firing" }}
{{ range .Alerts.Firing }}
Alert: {{ .Labels.alertname }}
Service: {{ .Labels.service }}
Instance: {{ .Labels.instance }}
Value: {{ .Annotations.value }}
Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}

On-Call Best Practices

Rotation Guidelines

on_call_guidelines:
  rotation_length: 1 week
  handoff_time: "10:00 AM Monday"
  
  responsibilities:
    - Monitor alerts during shift
    - Respond within SLA (critical: 5min, high: 15min)
    - Document incidents
    - Handoff unresolved issues
    
  support:
    - Secondary on-call for backup
    - Clear escalation path
    - Manager availability for major incidents
    
  wellness:
    - Maximum 1 week on-call per month
    - Comp time after high-alert periods
    - No-interrupt recovery day after shift

Runbook Template

# Alert: High Error Rate

## Summary
Error rate has exceeded the threshold of 5% for the service.

## Impact
Users may experience errors when accessing the application.

## Investigation Steps
1. Check service logs: `kubectl logs -l app=myapp -n production`
2. Review recent deployments: `kubectl rollout history deployment/myapp`
3. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432`
4. Review error traces in APM dashboard

## Remediation
### If caused by recent deployment:
```bash
kubectl rollout undo deployment/myapp -n production

If database related:

kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

Database team: @db-oncall
Platform team: @platform-oncall


## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues Solution: Tune thresholds, increase evaluation windows

Best Practices

Define clear severity levels
Every alert needs a runbook
Test on-call notifications regularly
Review and tune alerts weekly
Implement proper escalation paths
Use alert grouping and inhibition
Track alert metrics (MTTR, frequency)
Practice incident response regularly

Related Skills

prometheus-grafana - Monitoring setup
incident-response - Incident handling
runbook-creation - Runbook creation

Alerting & On-Call

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

Setting up alerting rules and thresholds
Configuring on-call rotations and schedules
Implementing alert routing and escalation
Reducing alert fatigue
Managing incident response workflows

Prerequisites

Monitoring system (Prometheus, Datadog, etc.)
On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

# Severity levels
critical:
  - Service completely down
  - Data loss imminent
  - Security breach
  response: Immediate page, wake people up

high:
  - Service degraded significantly
  - Error rate above SLO
  - Capacity near limit
  response: Page during business hours, notify after hours

medium:
  - Performance degradation
  - Non-critical component failure
  - Warning thresholds exceeded
  response: Notify via Slack, review next business day

low:
  - Informational alerts
  - Capacity planning triggers
  - Routine maintenance needed
  response: Email notification, weekly review

Alert Design Principles

# Good alert characteristics
alerts:
  actionable:
    - Every alert should require human action
    - Include runbook links
    - Clear remediation steps

  relevant:
    - Alert on symptoms, not causes
    - Focus on user impact
    - Avoid alerting on expected behavior

  timely:
    - Appropriate thresholds
    - Suitable evaluation windows
    - Account for normal variance

  unique:
    - No duplicate alerts
    - Proper alert grouping
    - Clear ownership

Prometheus Alerting

Alert Rules

# prometheus/rules/alerts.yml
groups:
  - name: service_alerts
    rules:
      # High-level service health
      - alert: ServiceDown
        expr: up{job="myapp"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."
          runbook_url: "https://wiki.example.com/runbooks/service-down"

      # Error rate alert
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"

      # Latency alert (SLO-based)
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 0.5
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 0s
      repeat_interval: 1h

    # High severity during business hours
    - match:
        severity: high
      receiver: 'slack-high'
      active_time_intervals:
        - business-hours

    # Route by team
    - match_re:
        team: platform.*
      receiver: 'platform-team'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'xxx'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.firing" . }}'

  - name: 'slack-high'
    slack_configs:
      - channel: '#alerts-high'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook_url }}'
          - type: button
            text: 'Dashboard'
            url: '{{ .CommonAnnotations.dashboard_url }}'

  - name: 'platform-team'
    slack_configs:
      - channel: '#platform-alerts'

time_intervals:
  - name: business-hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '17:00'

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: high
    equal: ['service']

PagerDuty Integration

Service Configuration

# Terraform example
resource "pagerduty_service" "myapp" {
  name                    = "MyApp Production"
  description             = "Production application service"
  escalation_policy       = pagerduty_escalation_policy.default.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 14400  # 4 hours
  acknowledgement_timeout = 600    # 10 minutes

  incident_urgency_rule {
    type    = "use_support_hours"
    
    during_support_hours {
      type    = "constant"
      urgency = "high"
    }
    
    outside_support_hours {
      type    = "constant"
      urgency = "low"
    }
  }
}

resource "pagerduty_escalation_policy" "default" {
  name      = "Default Escalation"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.primary.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "user_reference"
      id   = pagerduty_user.manager.id
    }
  }
}

Schedule Configuration

resource "pagerduty_schedule" "primary" {
  name      = "Primary On-Call"
  time_zone = "America/New_York"

  layer {
    name                         = "Weekly Rotation"
    start                        = "2024-01-01T00:00:00-05:00"
    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
    rotation_turn_length_seconds = 604800  # 1 week
    users                        = [for user in pagerduty_user.oncall : user.id]
  }

  # Override layer for holidays
  layer {
    name                         = "Holiday Coverage"
    start                        = "2024-01-01T00:00:00-05:00"
    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
    rotation_turn_length_seconds = 86400
    users                        = [pagerduty_user.holiday_coverage.id]

    restriction {
      type              = "daily_restriction"
      start_time_of_day = "00:00:00"
      duration_seconds  = 86400
      start_day_of_week = 0  # Sunday
    }
  }
}

Grafana OnCall

Integration Setup

# docker-compose.yml addition
services:
  oncall:
    image: grafana/oncall
    environment:
      - SECRET_KEY=your-secret-key
      - BASE_URL=http://oncall:8080
      - GRAFANA_API_URL=http://grafana:3000
    ports:
      - "8080:8080"

Escalation Chain

# Example escalation chain structure
escalation_chains:
  - name: "Production Critical"
    steps:
      - step: 1
        type: notify
        persons:
          - "@oncall-primary"
        wait_delay: 0
        
      - step: 2
        type: notify
        persons:
          - "@oncall-secondary"
        wait_delay: 5m
        
      - step: 3
        type: notify
        persons:
          - "@engineering-manager"
        wait_delay: 10m
        
      - step: 4
        type: trigger_action
        action: "escalate_to_incident_commander"
        wait_delay: 15m

Alert Templates

Slack Alert Template

{{ define "slack.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "slack.text" }}
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Severity:* {{ .Labels.severity }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}
{{ end }}

PagerDuty Details Template

{{ define "pagerduty.firing" }}
{{ range .Alerts.Firing }}
Alert: {{ .Labels.alertname }}
Service: {{ .Labels.service }}
Instance: {{ .Labels.instance }}
Value: {{ .Annotations.value }}
Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
{{ end }}

On-Call Best Practices

Rotation Guidelines

on_call_guidelines:
  rotation_length: 1 week
  handoff_time: "10:00 AM Monday"
  
  responsibilities:
    - Monitor alerts during shift
    - Respond within SLA (critical: 5min, high: 15min)
    - Document incidents
    - Handoff unresolved issues
    
  support:
    - Secondary on-call for backup
    - Clear escalation path
    - Manager availability for major incidents
    
  wellness:
    - Maximum 1 week on-call per month
    - Comp time after high-alert periods
    - No-interrupt recovery day after shift

Runbook Template

# Alert: High Error Rate

## Summary
Error rate has exceeded the threshold of 5% for the service.

## Impact
Users may experience errors when accessing the application.

## Investigation Steps
1. Check service logs: `kubectl logs -l app=myapp -n production`
2. Review recent deployments: `kubectl rollout history deployment/myapp`
3. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432`
4. Review error traces in APM dashboard

## Remediation
### If caused by recent deployment:
```bash
kubectl rollout undo deployment/myapp -n production

If database related:

kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

Database team: @db-oncall
Platform team: @platform-oncall


## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues Solution: Tune thresholds, increase evaluation windows

Best Practices

Define clear severity levels
Every alert needs a runbook
Test on-call notifications regularly
Review and tune alerts weekly
Implement proper escalation paths
Use alert grouping and inhibition
Track alert metrics (MTTR, frequency)
Practice incident response regularly

Related Skills

prometheus-grafana - Monitoring setup
incident-response - Incident handling
runbook-creation - Runbook creation

Adoption

bagelhole/alerting-oncall

$ install --global

Security Scan Results

SKILL.md

Alerting & On-Call

When to Use This Skill

Prerequisites

Alerting Best Practices

Alert Categories

Alert Design Principles

Prometheus Alerting

Alert Rules

Alertmanager Configuration

PagerDuty Integration

Service Configuration

Schedule Configuration

Grafana OnCall

Integration Setup

Escalation Chain

Alert Templates

Slack Alert Template

PagerDuty Details Template

On-Call Best Practices

Rotation Guidelines

Runbook Template

If database related:

Escalation

Common Issues

Issue: Alert Storm

Issue: Missed Alerts

Issue: False Positives

Best Practices

Related Skills

Related Skills

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes

bagelhole/alerting-oncall

$ install --global

Security Scan Results

SKILL.md

Alerting & On-Call

When to Use This Skill

Prerequisites

Alerting Best Practices

Alert Categories

Alert Design Principles

Prometheus Alerting

Alert Rules

Alertmanager Configuration

PagerDuty Integration

Service Configuration

Schedule Configuration

Grafana OnCall

Integration Setup

Escalation Chain

Alert Templates

Slack Alert Template

PagerDuty Details Template

On-Call Best Practices

Rotation Guidelines

Runbook Template

If database related:

Escalation

Common Issues

Issue: Alert Storm

Issue: Missed Alerts

Issue: False Positives

Best Practices

Related Skills

Related Skills

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes