.claude/skills/data-quality-check/SKILL.md
# Skill: Data Quality Check ## Purpose Validate data completeness, consistency, and coverage before any analysis begins, flagging issues with severity ratings so the analyst knows what blocks analysis vs. what to note as a caveat. ## When to Use Apply this skill at the start of every new analysis, when connecting to a new data source, or when results look suspicious. Run quality checks BEFORE drawing conclusions from data. ## Instructions ### Check Sequence Run these checks in order. Stop a
npx skillsauth add ai-analyst-lab/ai-analyst .claude/skills/data-quality-checkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Validate data completeness, consistency, and coverage before any analysis begins, flagging issues with severity ratings so the analyst knows what blocks analysis vs. what to note as a caveat.
Apply this skill at the start of every new analysis, when connecting to a new data source, or when results look suspicious. Run quality checks BEFORE drawing conclusions from data.
Run these checks in order. Stop and report blockers immediately.
-- Null rate per column
SELECT
column_name,
COUNT(*) AS total_rows,
COUNT(*) - COUNT(column_name) AS null_count,
ROUND(100.0 * (COUNT(*) - COUNT(column_name)) / COUNT(*), 1) AS null_pct
FROM table_name
GROUP BY column_name;
-- Missing date ranges (for time-series data)
WITH date_spine AS (
SELECT generate_series(MIN(date_col), MAX(date_col), INTERVAL '1 day') AS expected_date
FROM table_name
)
SELECT expected_date
FROM date_spine
LEFT JOIN table_name ON date_col = expected_date
WHERE table_name.date_col IS NULL;
-- Unexpected zeros in numeric columns
SELECT column_name, COUNT(*) AS zero_count
FROM table_name
WHERE numeric_column = 0
GROUP BY column_name;
Severity rules:
-- Duplicate detection
SELECT id_column, COUNT(*) AS dupes
FROM table_name
GROUP BY id_column
HAVING COUNT(*) > 1;
-- Referential integrity
SELECT child.fk_column, COUNT(*)
FROM child_table child
LEFT JOIN parent_table parent ON child.fk_column = parent.pk_column
WHERE parent.pk_column IS NULL
GROUP BY child.fk_column;
-- Date format consistency
SELECT DISTINCT LENGTH(date_column), LEFT(date_column, 4)
FROM table_name
WHERE date_column IS NOT NULL;
Severity rules:
Use check_temporal_coverage() for time-series gap detection and
check_value_domain() for categorical completeness:
from helpers.sql_helpers import check_temporal_coverage, check_value_domain
# Temporal coverage — detect missing days/weeks/months
coverage = check_temporal_coverage(df, "order_date", freq="D")
if coverage["status"] == "FAIL":
print(f"BLOCKER: {coverage['message']}")
# Value domain — verify expected categories exist
domain = check_value_domain(df["device_type"], ["desktop", "mobile", "tablet"])
if domain["status"] == "FAIL":
print(f"WARNING: {domain['message']}")
SQL checks for segment coverage:
-- Expected segments present
SELECT segment_column, COUNT(*) AS row_count,
MIN(date_col) AS earliest, MAX(date_col) AS latest
FROM table_name
GROUP BY segment_column
ORDER BY row_count DESC;
-- Missing cohorts
SELECT date_trunc('month', created_at) AS cohort_month, COUNT(DISTINCT user_id)
FROM users
GROUP BY 1
ORDER BY 1;
Severity rules:
Use the helper functions for systematic outlier and null concentration checks:
from helpers.tieout_helpers import check_null_concentration, check_outliers
# Null concentration — flags columns with high null rates
null_results = check_null_concentration(df)
for r in null_results:
if r["status"] == "FAIL":
print(f"BLOCKER: {r['column']} — {r['detail']}")
elif r["status"] == "WARN":
print(f"WARNING: {r['column']} — {r['detail']}")
# Outlier detection — IQR method (default) or z-score
for col in numeric_columns:
iqr_result = check_outliers(df[col], method="iqr")
zscore_result = check_outliers(df[col], method="zscore")
# Use IQR as primary, z-score as cross-check
if iqr_result["status"] in ("WARN", "FAIL"):
print(f"WARNING: {col} — {iqr_result['detail']}")
For domain-specific sanity checks (impossible values, suspicious distributions):
def sanity_check(df, column):
"""Run statistical sanity checks on a numeric column."""
stats = {
"mean": df[column].mean(),
"median": df[column].median(),
"std": df[column].std(),
"min": df[column].min(),
"max": df[column].max(),
"p1": df[column].quantile(0.01),
"p99": df[column].quantile(0.99),
"skew": df[column].skew(),
}
issues = []
if column in ["conversion_rate", "percentage"] and (stats["max"] > 1 or stats["min"] < 0):
issues.append(("BLOCKER", f"{column} has values outside [0,1] range"))
if abs(stats["skew"]) > 3:
issues.append(("WARNING", f"{column} is highly skewed (skew={stats['skew']:.1f})"))
return stats, issues
Severity rules:
For each date-indexed metric column in the dataset:
import pandas as pd
import numpy as np
def anomaly_scan(df, date_col, metric_col, window=14, threshold=2.0):
"""Detect time-series anomalies using rolling mean +/- std bands.
IMPORTANT: Aggregate to daily/weekly granularity FIRST.
Do NOT run on raw event rows.
Args:
df: DataFrame with date and metric columns (pre-aggregated).
date_col: Name of the date column.
metric_col: Name of the metric column.
window: Rolling window size in periods. Default: 14.
threshold: Number of standard deviations for anomaly band. Default: 2.0.
Returns:
Dict with 'anomalies' (list of dicts) and 'summary' (str).
"""
ts = df.sort_values(date_col).copy()
ts["rolling_mean"] = ts[metric_col].rolling(window, min_periods=3).mean()
ts["rolling_std"] = ts[metric_col].rolling(window, min_periods=3).std()
ts["upper"] = ts["rolling_mean"] + threshold * ts["rolling_std"]
ts["lower"] = ts["rolling_mean"] - threshold * ts["rolling_std"]
anomalies = []
for _, row in ts.iterrows():
if pd.notna(row["upper"]) and row[metric_col] > row["upper"]:
pct = ((row[metric_col] - row["rolling_mean"]) / row["rolling_mean"]) * 100
anomalies.append({
"date": row[date_col], "value": row[metric_col],
"direction": "spike", "pct_above_normal": round(pct, 1)
})
elif pd.notna(row["lower"]) and row[metric_col] < row["lower"]:
pct = ((row["rolling_mean"] - row[metric_col]) / row["rolling_mean"]) * 100
anomalies.append({
"date": row[date_col], "value": row[metric_col],
"direction": "drop", "pct_below_normal": round(pct, 1)
})
return {"anomalies": anomalies, "summary": f"{len(anomalies)} anomalies in {metric_col}"}
Sequencing: Run AFTER basic data profiling in the Data Explorer step, not before. Requires aggregated data.
Severity rules:
Output format:
Notable patterns detected:
- [metric] spiked [X]% above normal on [date range]
- [metric] dropped [X]% below normal on [date range]
These are observations, not conclusions — present as starting points for investigation.
For each table with a date/timestamp column:
import pandas as pd
from datetime import datetime, timedelta
def freshness_check(df, date_col, current_date=None):
"""Check data freshness and infer data cadence.
Args:
df: DataFrame with a date column.
date_col: Name of the date/timestamp column.
current_date: Override for current date (for testing). Default: today.
Returns:
Dict with 'max_date', 'days_ago', 'cadence', 'status'.
"""
current_date = current_date or datetime.now().date()
dates = pd.to_datetime(df[date_col]).dt.date
max_date = dates.max()
days_ago = (current_date - max_date).days
# Infer cadence from median gap between consecutive distinct dates
distinct_dates = sorted(dates.dropna().unique())
if len(distinct_dates) >= 2:
gaps = [(distinct_dates[i+1] - distinct_dates[i]).days
for i in range(len(distinct_dates) - 1)]
median_gap = sorted(gaps)[len(gaps) // 2]
if median_gap <= 1.5:
cadence = "daily"
stale_threshold = 2
elif median_gap <= 8:
cadence = "weekly"
stale_threshold = 10
else:
cadence = "static/historical"
stale_threshold = None
else:
cadence = "unknown"
stale_threshold = None
# Determine status
if days_ago > 90:
cadence = "static/historical"
status = "OK"
note = f"Historical dataset, date range ends {max_date}"
elif stale_threshold and days_ago > stale_threshold:
status = "WARNING"
note = f"Data is {days_ago} days old (expected {cadence} refresh)"
else:
status = "OK"
note = f"Data is {days_ago} days old"
return {
"max_date": str(max_date), "days_ago": days_ago,
"cadence": cadence, "status": status, "note": note
}
Output format:
Data freshness:
- events: most recent = [date] ([N] days ago) [OK/WARNING]
- orders: most recent = [date] ([N] days ago) [OK/WARNING]
- users: most recent = [date] ([N] days ago) [OK/WARNING]
Severity rules:
# Data Quality Report: [Dataset Name]
## Date: [YYYY-MM-DD]
## Analyst: AI Product Analyst
### Summary
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER | X | [Must fix before analysis] |
| WARNING | X | [Note as caveat in analysis] |
| INFO | X | [For awareness only] |
### BLOCKERS
[List each blocker with: what's wrong, which column/table, how many rows affected, suggested fix]
### WARNINGS
[List each warning with: what's wrong, potential impact on analysis, recommended handling]
### INFO
[List each info item briefly]
### Data Profile
| Table | Rows | Columns | Date Range | Key Columns |
|-------|------|---------|------------|-------------|
| ... | ... | ... | ... | ... |
### Recommendation
[Can analysis proceed? With what caveats?]
- PROCEED: No blockers, warnings noted
- PROCEED WITH CAUTION: No blockers, significant warnings — note in findings
- BLOCKED: Blockers found — fix data before analyzing
### Summary
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER | 0 | — |
| WARNING | 1 | 8% null in `referral_source` column |
| INFO | 2 | Weekend gaps in daily data; minor casing inconsistency in `country` |
### Recommendation
PROCEED — the null referral_source values should be noted as "unknown" in any segmentation by acquisition channel. All other columns are complete and consistent.
### Summary
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER | 2 | Duplicate order IDs (1,247 rows); revenue column has negative values (-$45K total) |
| WARNING | 3 | March 2025 data missing entirely; `device_type` has 12% nulls; conversion rates >1.0 for 89 rows |
| INFO | 1 | `country` has mixed casing ("US" vs "us") |
### BLOCKERS
1. **Duplicate order_ids**: 1,247 rows have duplicate `order_id` values. This will inflate revenue calculations. Must deduplicate before analysis — keep earliest record per order_id.
2. **Negative revenue**: 342 rows have negative `revenue` values totaling -$45K. These may be refunds. Must classify and handle separately (exclude from revenue analysis or create separate refund analysis).
### Recommendation
BLOCKED — Fix duplicate order_ids and classify negative revenue before proceeding. Estimated fix time: 15 minutes with SQL dedup + refund classification.
testing
# Skill: {{BLANK_1_SKILL_NAME}} ## Purpose {{BLANK_2_WHEN_TO_FIRE}} ## When to Use Fires automatically when the user asks Claude to do something that matches the trigger condition above. ## Instructions 1. Detect the trigger condition 2. Execute your guardrail check 3. If the check matters, print a clear, visible warning with "{{BLANK_3_SIGNATURE_PHRASE}}" as the first line 4. Continue with the analysis, incorporating the warning into the output ## Anti-Patterns - Do not fire when the condit
development
# Skill: Visualization Patterns ## Purpose Ensure every chart Claude Code produces follows high-quality design standards with named themes, consistent styling, and clear data communication. ## When to Use Apply this skill whenever generating a chart, graph, or data visualization. Always apply the active theme unless the user specifies otherwise. Default theme: `minimal`. ## Instructions ### Pre-flight: Load Learnings Before executing, check `.knowledge/learnings/index.md` for relevant entrie
development
# Skill: Triangulation / Sanity Check ## Purpose Cross-reference analytical findings against multiple data sources, external benchmarks, and common sense to catch errors before they become bad decisions. ## When to Use Apply this skill after every analysis, before presenting findings to stakeholders, and whenever a result seems surprising. If a finding would change a decision, it MUST be triangulated first. ## Instructions ### Triangulation Framework Every finding gets checked through four
data-ai
# Skill: Tracking Gap Identification ## Purpose Assess whether the data needed for an analysis actually exists, identify what's missing, and produce prioritized instrumentation requests for engineering when gaps are found. ## When to Use Apply this skill after the Data Explorer agent inventories available data, when an analysis requires data that might not exist, or when initial query results suggest incomplete tracking. Run before committing to an analysis approach. ## Instructions ### Gap