skills/alertguardian-intelligent-alert-life-cycle/SKILL.md
Build intelligent alert lifecycle management systems for cloud infrastructure using graph-based denoising, RAG-powered summarization, and multi-agent rule refinement. Trigger phrases: - "reduce alert fatigue in our monitoring system" - "deduplicate and correlate alerts" - "summarize alerts for on-call engineers" - "refine our alerting rules automatically" - "build an alert denoising pipeline" - "too many alerts, help me triage"
npx skillsauth add ndpvt-web/arxiv-claude-skills alertguardian-intelligent-alert-life-cycleInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement alert lifecycle management systems inspired by the AlertGuardian framework (ASE 2025). The approach combines three complementary techniques — graph-based alert denoising with virtual noise injection, RAG-powered alert summarization, and multi-agent iterative rule refinement — to reduce alert volumes by up to 94.8% while maintaining 90.5% fault diagnosis accuracy. Use this skill when building or improving alerting pipelines for cloud-native infrastructure, Kubernetes clusters, microservice architectures, or any system drowning in monitoring noise.
AlertGuardian treats alert management as a three-phase lifecycle rather than a single-point problem. Phase 1 (Alert Denoise) constructs a heterogeneous alert graph where nodes are individual alert instances and edges encode temporal co-occurrence (alerts firing within a configurable time window) and topological affinity (alerts from the same service dependency chain). A graph neural network (GNN) propagates information across this graph to learn which alerts are correlated and should be grouped. The critical innovation is virtual noise injection: during training, synthetic noise alerts are injected into the graph to force the model to distinguish genuine correlations from spurious co-occurrences, improving generalization without requiring perfectly labeled training data.
Phase 2 (Alert Summary) applies Retrieval-Augmented Generation to produce concise, actionable summaries for on-call engineers. When a cluster of denoised alerts arrives, the system retrieves semantically similar historical incidents from a vector-indexed knowledge base, then prompts an LLM with the current alert context plus retrieved examples to generate a structured summary containing: root cause hypothesis, affected components, recommended actions, and severity justification. This replaces the manual work of reading dozens of individual alerts.
Phase 3 (Alert Rule Refinement) uses a multi-agent loop with four specialized roles — Rule Generator, Validator, Optimizer, and Reviewer — that iterate on alerting rule definitions. The Generator proposes candidate rules from observed alert patterns; the Validator tests them against held-out data measuring precision and recall; the Optimizer adjusts thresholds and conditions to reduce false positives; the Reviewer audits edge cases and interpretability. This loop cycles until convergence criteria are met, then presents refined rules for human approval. In production, 32% of refined rules were accepted by SREs, meaning the system does meaningful work while keeping humans in the loop.
Ingest and normalize alerts. Parse alert data from the monitoring source (Prometheus AlertManager JSON, PagerDuty webhooks, Datadog events) into a uniform schema: {alert_id, timestamp, source_service, severity, labels, description, status}. Store in a time-series-friendly format (e.g., sorted by timestamp with service indexing).
Build the alert correlation graph. For each time window (e.g., 5 minutes), create nodes for each alert instance. Add temporal edges between alerts that fire within the window. Add topological edges between alerts whose source services share a dependency (use a service dependency map from Kubernetes labels, Consul, or a static topology file). Encode node features as vectors: [severity_one_hot, alert_type_embedding, source_service_embedding, time_delta_normalized].
Train or apply the denoising GNN with virtual noise. Implement a 2-3 layer Graph Attention Network (GAT) or GraphSAGE model. During training, inject virtual noise by randomly adding synthetic alert nodes with shuffled features and random edges (10-20% noise ratio). Train with binary classification: real correlated alerts vs. noise. At inference, the model scores each alert; low-scoring alerts are suppressed, high-scoring alerts are grouped into incident clusters.
Index historical incidents for RAG retrieval. Build a vector store (FAISS, Pinecone, pgvector) of past incident reports, postmortems, and resolved alert clusters. Each entry includes: alert signatures, root cause, resolution steps, and affected services. Use an embedding model to index these documents.
Generate structured alert summaries. For each incident cluster from step 3, retrieve the top-k (k=3-5) most similar historical incidents. Construct a prompt:
Given these active alerts: {cluster_alerts}
Similar past incidents: {retrieved_incidents}
Generate a structured summary with:
- Root cause hypothesis
- Affected components and blast radius
- Recommended immediate actions
- Severity assessment (P1-P4)
Parse the LLM response into a structured format for downstream consumption.
Extract candidate alerting rules from alert patterns. Analyze the denoised alert clusters to identify recurring patterns: which metric thresholds trigger most frequently, which conditions produce false positives. Use a Rule Generator agent to propose new or modified rules in the target query language (PromQL, Datadog monitor JSON, etc.).
Validate rules against historical data. Replay proposed rules against a held-out dataset of historical alerts with known ground-truth labels. Compute precision (what fraction of triggered alerts correspond to real incidents) and recall (what fraction of real incidents are detected). Flag rules below configurable thresholds.
Optimize rules through multi-agent iteration. Run the Optimizer agent to adjust thresholds, add exclusion conditions, or modify time windows to improve precision without sacrificing recall. Pass back to Validator. Repeat for 3-5 iterations or until metrics plateau.
Review and surface rules for human approval. The Reviewer agent checks edge cases, evaluates rule interpretability, and generates a human-readable diff showing old rule vs. proposed rule with expected impact metrics. Present to SREs for approval via PR, Slack message, or dashboard.
Deploy and establish feedback loops. Ship approved rules to the monitoring system. Log all suppressed alerts for periodic audit. Re-run the denoising model retraining on a weekly/monthly cadence as system behavior evolves. Track alert volume reduction ratio and false-negative rate as ongoing KPIs.
Example 1: Building an alert denoising pipeline for Kubernetes
User: "We run 200 microservices on Kubernetes and get 5,000+ alerts per day from Prometheus. Most are noise. Help me build a denoising system."
Approach:
kube-state-metrics labels and Istio service mesh topologyOutput structure:
# alert_graph.py
import torch
from torch_geometric.nn import GATConv
from torch_geometric.data import Data
class AlertDenoiser(torch.nn.Module):
def __init__(self, num_features, hidden_dim=64):
super().__init__()
self.conv1 = GATConv(num_features, hidden_dim, heads=4)
self.conv2 = GATConv(hidden_dim * 4, hidden_dim, heads=1)
self.classifier = torch.nn.Linear(hidden_dim, 2) # noise vs. real
def forward(self, x, edge_index):
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index).relu()
return self.classifier(x)
def inject_virtual_noise(data: Data, noise_ratio=0.15):
"""Add synthetic noise nodes with shuffled features and random edges."""
num_noise = int(data.num_nodes * noise_ratio)
noise_features = data.x[torch.randperm(data.num_nodes)[:num_noise]]
noise_labels = torch.zeros(num_noise, dtype=torch.long) # label 0 = noise
# ... attach random edges and concatenate with original graph
return augmented_data
Example 2: RAG-powered alert summarization for on-call
User: "When an incident fires, our on-call engineer has to read 30+ alerts manually. Build a summarizer that gives them a one-paragraph actionable brief."
Approach:
Output:
Incident Summary (auto-generated):
- Root Cause Hypothesis: Database connection pool exhaustion on payments-db
triggered cascading timeouts in checkout-service and order-service.
- Affected Components: payments-db, checkout-service, order-service,
api-gateway (degraded)
- Blast Radius: ~12% of checkout requests failing (estimated from
error rate alerts)
- Recommended Actions:
1. Check payments-db connection pool metrics and recent deployment changes
2. Restart checkout-service pods if connection pool is stuck
3. Verify no recent schema migrations on payments-db
- Severity: P2 (customer-facing degradation, not full outage)
- Similar Past Incident: INC-2847 (2024-09-14) - resolved by rolling
back payments-db connection pool config change
Example 3: Multi-agent alert rule refinement
User: "Our HighCPUUsage alert fires 200 times a day but only 5 are real. Help me refine the rule automatically."
Approach:
node_cpu_seconds_total > 0.85 for 5munless on(instance) node_cpu_seconds_total{mode="iowait"} > 0.3 to exclude IO-bound spikesOutput:
# Before
- alert: HighCPUUsage
expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) > 0.85
for: 5m
# After (proposed)
- alert: HighCPUUsage
expr: |
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) > 0.90
unless on(instance)
avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) > 0.30
for: 10m
labels:
refinement_id: "AG-2025-0142"
expected_precision: "78%"
expected_recall: "95%"
Paper: AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems (ASE 2025) Key insight: Treating alert management as a three-phase lifecycle (denoise -> summarize -> refine rules) with graph learning, RAG, and multi-agent iteration achieves 94.8% alert reduction while maintaining diagnostic accuracy — look for the virtual noise injection technique in Section 4 and the multi-agent convergence criteria in Section 6.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".