skills/anomaly-detection/SKILL.md
--- name: anomaly-detection description: >- Anomaly detection methodologies, strategy selection, and application. Use when: identifying outliers in data, detecting unusual patterns in metrics or logs, choosing between statistical and ML-based detection methods, building alerting thresholds, analyzing time series for anomalies, investigating suspicious data points, designing anomaly detection pipelines, or evaluating whether a detected "anomaly" is real or an artifact. Works on time s
npx skillsauth add michaelsvanbeek/personal-agent-skills skills/anomaly-detectionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use these when data is well-understood, roughly stationary, and you can define "normal" numerically.
| Method | How It Works | Best For | Assumptions | Limitations | |--------|-------------|----------|-------------|-------------| | Z-Score | Flag points > k standard deviations from the mean | Univariate, roughly normal data | Normal distribution, no trend | Fails with skewed data, sensitive to outliers in training set | | Modified Z-Score (MAD) | Uses median absolute deviation instead of std dev | Same as Z-score but robust to existing outliers | Symmetric distribution | Less sensitive than Z-score for normally distributed data | | IQR (Interquartile Range) | Flag points outside Q1 − 1.5·IQR or Q3 + 1.5·IQR | Non-normal, skewed distributions | None (non-parametric) | Misses anomalies in heavy-tailed distributions | | Grubbs' Test | Hypothesis test for a single outlier | Confirming a specific point is an outlier | Normal distribution, one outlier at a time | Only tests one point per run | | Control Charts (Shewhart) | Mean ± 3σ boundaries on sequential data | Manufacturing, SPC, stable processes | Stationary process | Doesn't handle seasonality or trend |
When to use Tier 1: You have a single metric, the distribution is understandable, and you can eyeball what "normal" looks like. Most operational alerting starts here.
Use when your data has trend, seasonality, or autocorrelation that invalidates simple statistical bounds.
| Method | How It Works | Best For | Assumptions | Limitations | |--------|-------------|----------|-------------|-------------| | Moving Average ± Band | Rolling mean ± k · rolling std | Slow-moving metrics with noise | Local stationarity | Lags behind sudden shifts | | Exponential Smoothing (ETS) | Weighted average favoring recent observations | Trended or seasonal univariate series | Consistent trend/seasonality pattern | Struggles with abrupt regime changes | | STL Decomposition + Residual | Decompose into trend + seasonal + residual; flag large residuals | Strongly seasonal data (daily/weekly patterns) | Additive or multiplicative seasonality | Needs ≥ 2 full seasonal cycles | | ARIMA / SARIMA | Fit autoregressive model; flag points outside prediction interval | Stationary or differenced series | Linear relationships, stationary residuals | Requires parameter tuning (p, d, q) | | Prophet | Additive model with trend, seasonality, holidays | Business metrics with known calendar effects | Additive components | Can overfit with little data; slow on large series | | CUSUM / EWMA Charts | Accumulate deviations from target; signal when cumulative sum exceeds threshold | Detecting small, persistent shifts | Stationary baseline | Not designed for sudden spikes (better for drift) |
When to use Tier 2: Your metric has daily, weekly, or seasonal patterns. A flat threshold would fire every Monday morning or every holiday.
Use when the data is multivariate, the patterns are complex, or statistical methods can't capture what "normal" means.
| Method | How It Works | Best For | Pros | Cons | |--------|-------------|----------|------|------| | Isolation Forest | Randomly partitions feature space; anomalies are isolated in fewer splits | Tabular multivariate data | No distribution assumptions, handles high dimensionality | Requires training data, not inherently temporal | | Local Outlier Factor (LOF) | Compares local density of a point to its neighbors | Clusters of varying density | Detects local anomalies that global methods miss | Expensive on large datasets; sensitive to k | | One-Class SVM | Learns a boundary around "normal" data | Small datasets with clean normal examples | Works in high dimensions | Sensitive to kernel choice, slow on large data | | Autoencoder (Neural) | Learns to reconstruct normal data; high reconstruction error = anomaly | Complex, high-dimensional data (images, logs, embeddings) | Captures non-linear patterns | Needs significant normal data; opaque | | DBSCAN | Density-based clustering; points not in any cluster are anomalies | Spatial or clustered data | No need to specify cluster count | Sensitive to epsilon and min_samples |
When to use Tier 3: Simple methods produce too many false positives, the data is multivariate, or "normal" is a complex region in feature space — not a simple range.
Follow this decision tree to choose a method:
1. Is the data univariate (single metric)?
├── Yes → Is there seasonality or trend?
│ ├── No → Tier 1: Z-Score, IQR, or Control Chart
│ └── Yes → Tier 2: STL + Residual, SARIMA, or Prophet
└── No (multivariate) →
Is labeled anomaly data available?
├── Yes → Supervised classifier (beyond this skill's scope)
└── No → Tier 3: Isolation Forest, LOF, or Autoencoder
2. How much data is available?
├── < 100 points → Tier 1 only (not enough for ML)
├── 100–10,000 → Tier 1 or Tier 2
└── > 10,000 → Any tier is viable
3. What's the cost of a false positive vs. false negative?
├── FP is expensive (alert fatigue, manual investigation)
→ Favor higher thresholds, conservative methods (IQR, Grubbs')
└── FN is expensive (missed fraud, missed outage)
→ Favor lower thresholds, ensemble methods
Before selecting a method, understand what you're working with:
## Data Profile
**Source**: <e.g., "Prometheus metric: api_request_latency_p99">
**Type**: <univariate | multivariate> — <time series | tabular | event stream>
**Volume**: <row/point count, time range>
**Frequency**: <per-second | per-minute | daily | irregular>
**Distribution**: <roughly normal | skewed | heavy-tailed | multimodal | unknown>
**Stationarity**: <stationary | trend | seasonal | both | unknown>
**Known patterns**: <e.g., "higher on weekdays, spikes during deploys">
**Labeled anomalies available?**: <yes (n examples) | no>
An anomaly is domain-specific. Define it before applying any method:
## Anomaly Definition
**Point anomaly**: A single observation that deviates significantly
→ Example: "latency > 500ms when baseline is 50ms"
**Contextual anomaly**: Normal in one context, abnormal in another
→ Example: "1000 rps is normal at 2pm, anomalous at 3am"
**Collective anomaly**: A sequence of observations that together are unusual
→ Example: "5 consecutive requests from the same IP to /admin"
**Detection goal**: <Which type(s) are we looking for?>
**Action on detection**: <Alert | Log | Auto-remediate | Queue for investigation>
Build a model of "normal" before looking for deviations:
## Baseline
**Reference period**: <date range used for baseline>
**Exclusions**: <removed known incidents, maintenance windows, deploys>
**Statistics**: mean=<X>, median=<X>, std=<X>, p50=<X>, p95=<X>, p99=<X>
**Seasonality**: <daily cycle? weekly? none?>
Choose a method from the tiers above using the selection strategy. Document the choice:
## Method Selection
**Method**: <name>
**Tier**: <1 | 2 | 3>
**Why this method**: <reasoning tied to data profile>
**Parameters**: <e.g., "Z-score threshold = 3.0", "IQR multiplier = 1.5", "Isolation Forest contamination = 0.01">
**Alternatives considered**: <what else was viable and why it wasn't chosen>
Every detection method needs calibration. Check these before trusting results:
| Check | What to Look For | |-------|-----------------| | False positive rate | Are flagged points actually anomalous? Manually inspect a sample. | | False negative rate | Are known anomalies being caught? Test against historical incidents. | | Threshold sensitivity | How much do results change with ±10% threshold adjustment? | | Temporal stability | Does the method degrade over time as data distribution shifts? | | Edge behavior | Does it flag every Monday? Every deploy? Every restart? (Contextual blindness) |
## Evaluation
**Flagged anomalies**: <count>
**Manually verified (sample of N)**: <true positive rate>
**Known incidents caught**: <X of Y>
**False positive pattern**: <e.g., "flags every deploy; needs deploy-window exclusion">
**Recommended threshold adjustment**: <raise / lower / add context filter>
For ongoing detection (not one-time analysis):
## Operational Design
**Detection frequency**: <real-time | every 5 min | hourly | daily>
**Baseline refresh**: <static | rolling window of N days | retrain weekly>
**Alert routing**: <PagerDuty | Slack channel | log only>
**Suppression rules**: <e.g., "suppress during deploy windows", "deduplicate within 30 min">
**Escalation**: <when does a detected anomaly become an incident?>
**Dashboard**: <link to Grafana / CloudWatch / custom dashboard>
Every anomaly detection analysis should produce:
Real-world pipelines often layer methods for better results:
| Pattern | How It Works | When to Use | |---------|-------------|-------------| | Decompose + Threshold | STL/seasonal decomposition → Z-score or IQR on residuals | Seasonal data where raw Z-scores fire on every peak | | Ensemble voting | Run 2–3 methods; flag only when ≥ 2 agree | Reducing false positives when no single method is reliable | | Tiered detection | Tier 1 for fast screening → Tier 3 on flagged windows | High-volume data where ML on every point is too expensive | | Context layering | Statistical method + deploy/event calendar → suppress known causes | Operational alerting where deploys and maintenance cause expected spikes |
Rule of thumb: Don't combine methods to increase sensitivity — combine them to increase precision (fewer false positives).
| Library | Language | Best For |
|---------|----------|----------|
| scipy.stats | Python | Z-scores, Grubbs', distribution tests |
| statsmodels | Python | STL decomposition, ARIMA, ETS, control charts |
| scikit-learn | Python | Isolation Forest, LOF, One-Class SVM, DBSCAN |
| prophet | Python | Business metrics with calendar seasonality |
| pyod | Python | Unified API for 30+ anomaly detection algorithms |
| adtk | Python | Time series anomaly detection toolkit |
| ruptures | Python | Change point detection |
| Grafana ML | Dashboard | Built-in anomaly detection on Prometheus/InfluxDB metrics |
Use the statistics skill for assumption validation (normality tests, distribution checks) before applying Tier 1 methods.
| Pitfall | Why It Fails | |---------|-------------| | Training on dirty data | If the baseline includes anomalies, the model learns them as normal | | Flat thresholds on seasonal data | A fixed "alert at > 100ms" fires every peak hour and gets ignored | | Ignoring concept drift | A model trained 6 months ago may flag today's normal traffic as anomalous | | Over-alerting | 50 alerts per day = zero alerts acted on. Fewer, high-confidence alerts win. | | Skipping manual inspection | Auto-detected anomalies must be spot-checked. Precision matters. | | Using ML when Z-scores work | Complexity for its own sake adds maintenance burden and opacity | | Confusing anomaly with root cause | Detection tells you something is unusual, not why. Investigation is a separate step. | | No baseline exclusions | Including Black Friday in baseline data makes next week look anomalous |
development
TypeScript coding standards and type safety conventions. Use when: creating TypeScript files, defining interfaces and types, writing type-safe code, reviewing TypeScript for type correctness, auditing a codebase for type safety gaps, eliminating any or ts-ignore usage, or improving strict-mode compliance. Covers strict typing, avoiding any and ts-ignore, discriminated unions, Zod runtime validation, immutability patterns, and proper type definitions.
testing
Writing clear, actionable tickets in any issue tracker (Jira, Linear, GitHub Issues, ServiceNow, etc.). Use when: creating epics, stories, tasks, bugs, or spikes; writing acceptance criteria; decomposing work for a sprint; linking dependencies between tickets; auditing backlog items for clarity; or coaching a team on ticket quality. Covers title conventions, description templates, acceptance criteria, decomposition rules, dependency linking, and org-specific pluggable configuration.
development
Testing strategy, patterns, and evaluation for software and LLM/AI systems. Use when: writing tests, choosing test boundaries, designing test data, structuring test suites, evaluating LLM outputs, building evaluation pipelines, setting coverage thresholds, auditing test coverage gaps in existing projects, or improving test quality and structure.
development
Writing effective status updates for different audiences and cadences. Use when: writing a weekly status update, preparing a monthly summary, drafting a quarterly review, sending updates to leadership, sharing progress with stakeholders, or improving the clarity and impact of team communications. Covers weekly, monthly, and quarterly formats tailored for upward, lateral, and downward communication.