.claude/skills/data-profiling/SKILL.md
# Skill: Data Profiling ## Purpose Deep-profile the active dataset to understand schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Produces a comprehensive profile report that serves as the foundation for analysis planning and data quality assessment. ## When to Use - After connecting a new dataset (post-bootstrap, pre-analysis) - Before the first analysis on any dataset - When explicitly invoked by the user - When the existing profile is
npx skillsauth add ai-analyst-lab/ai-analyst .claude/skills/data-profilingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deep-profile the active dataset to understand schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Produces a comprehensive profile report that serves as the foundation for analysis planning and data quality assessment.
last_profiled in manifest.yaml)from helpers.data_helpers import get_connection_for_profiling
from helpers.schema_profiler import profile_source
# Get connection (auto-detects DuckDB vs CSV from active dataset)
conn_info = get_connection_for_profiling()
# Run full schema profile — introspects all tables: column names, types,
# nullability, row counts, sample values, basic statistics, date detection
schema = profile_source(conn_info)
Record the output. schema contains the full table inventory with column-level metadata. Use this to identify:
For each table in the schema, load the data and run the deep profiling functions. Prioritize tables with the most rows and the most date/numeric columns.
from helpers.data_helpers import read_table
from helpers.deep_profiler import (
profile_distributions,
profile_temporal_patterns,
profile_completeness,
)
for table_info in schema["tables"]:
table_name = table_info["name"]
df = read_table(table_name)
# Distribution analysis on all numeric columns
distributions = profile_distributions(df)
# Completeness assessment — null rates, zeros, empty strings, constant cols
completeness = profile_completeness(df)
# Temporal pattern analysis (only if the table has date columns)
temporal = None
if table_info.get("date_columns"):
primary_date = table_info["date_columns"][0]
temporal = profile_temporal_patterns(df, primary_date, freq="D")
Important: For large tables (>50K rows), profile_source() already samples. But read_table() loads the full CSV. If a table has >100K rows, sample before running deep profiling:
if len(df) > 100_000:
df = df.sample(n=100_000, random_state=42)
Run correlation and anomaly detection on tables that contain key business metrics (revenue, counts, rates). Identify these tables by looking for columns with names like revenue, amount, total, count, rate, price, quantity.
from helpers.deep_profiler import profile_correlations, profile_anomalies
# Correlations — find relationships between numeric columns
correlations = profile_correlations(df, threshold=0.5)
# Anomaly detection — requires a date column and pre-aggregated data
# Aggregate to daily granularity first if the table has event-level rows
if table_info.get("date_columns"):
primary_date = table_info["date_columns"][0]
# Only run on tables with a clear date + metric pattern
metric_cols = [c for c in df.select_dtypes(include="number").columns
if c not in ("id", table_name.rstrip("s") + "_id")]
if metric_cols:
# Aggregate to daily for anomaly detection
daily = df.groupby(pd.to_datetime(df[primary_date]).dt.date)[metric_cols].sum().reset_index()
daily.rename(columns={daily.columns[0]: primary_date}, inplace=True)
anomalies = profile_anomalies(daily, date_col=primary_date,
metric_cols=metric_cols, window=14)
Write the full profile report to .knowledge/datasets/{active}/last_profile.md. Use schema_to_markdown() for the schema portion, then append the deep profiling results.
from helpers.data_helpers import schema_to_markdown, detect_active_source
source = detect_active_source()
active_dataset = source["source"]
# Build the schema markdown section
schema_md = schema_to_markdown(schema)
Assemble the full report and write it to:
.knowledge/datasets/{active_dataset}/last_profile.md
# Data Profile: {dataset_name}
**Profiled at:** {ISO timestamp}
**Source:** {connection type} ({path or schema prefix})
**Tables:** {count} | **Total rows:** {sum}
---
## Summary of Findings
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER | X | {brief list} |
| WARNING | X | {brief list} |
| INFO | X | {brief list} |
---
## Schema Overview
{output of schema_to_markdown()}
---
## Distribution Analysis
### {table_name}
| Column | Shape | Skewness | Outliers (IQR) | Recommended Transform |
|--------|-------|----------|----------------|----------------------|
| {col} | {shape} | {skew} | {n_outliers} | {transform or "none"} |
---
## Temporal Patterns
### {table_name} ({date_column})
- **Date range:** {min} to {max}
- **Coverage:** {actual}/{expected} periods ({pct}%)
- **Gaps:** {count} gaps found {list if any}
- **Trend:** {trend direction}
- **Seasonality:** {detected or not}
- **Day-of-week pattern:** {summary}
---
## Completeness
### {table_name}
| Column | Status | Null % | Zeros | Empty Strings | Constant? |
|--------|--------|--------|-------|---------------|-----------|
| {col} | {status} | {pct} | {count} | {count} | {yes/no} |
---
## Correlations
### {table_name}
| Column A | Column B | Correlation | Strength | Direction |
|----------|----------|-------------|----------|-----------|
| {col_a} | {col_b} | {r} | {strength} | {direction} |
---
## Anomalies
### {table_name}
{anomaly summary}
| Metric | Spikes | Drops | Details |
|--------|--------|-------|---------|
| {metric} | {count} | {count} | {top anomalies with dates} |
---
## Recommendations
- **BLOCKER items:** {must fix before analysis}
- **WARNING items:** {note as caveats}
- **Suggested analysis focus:** {tables/columns with most signal}
Apply these rules consistently across all sections:
| Severity | Condition | |----------|-----------| | BLOCKER | >50% nulls in a key metric column; entire date ranges missing (coverage <50%); constant columns that should have variance; very strong correlations (r>0.95) suggesting duplicate columns | | WARNING | 5-50% nulls; heavy-tailed or bimodal distributions in metric columns; date coverage 50-90%; moderate anomalies detected; skewness >3 suggesting data quality issues | | INFO | <5% nulls; normal or mild skew distributions; full date coverage; no anomalies; expected correlations (e.g., quantity and revenue) |
read_table(). The schema profiler handles this internally, but deep profiling should also use the CSV path.profile_source() first (Step 1) so you know which columns are dates, which are numeric, and what the cardinality looks like before deep profiling.last_profile.md so future sessions can reference it without re-profiling.testing
# Skill: {{BLANK_1_SKILL_NAME}} ## Purpose {{BLANK_2_WHEN_TO_FIRE}} ## When to Use Fires automatically when the user asks Claude to do something that matches the trigger condition above. ## Instructions 1. Detect the trigger condition 2. Execute your guardrail check 3. If the check matters, print a clear, visible warning with "{{BLANK_3_SIGNATURE_PHRASE}}" as the first line 4. Continue with the analysis, incorporating the warning into the output ## Anti-Patterns - Do not fire when the condit
development
# Skill: Visualization Patterns ## Purpose Ensure every chart Claude Code produces follows high-quality design standards with named themes, consistent styling, and clear data communication. ## When to Use Apply this skill whenever generating a chart, graph, or data visualization. Always apply the active theme unless the user specifies otherwise. Default theme: `minimal`. ## Instructions ### Pre-flight: Load Learnings Before executing, check `.knowledge/learnings/index.md` for relevant entrie
development
# Skill: Triangulation / Sanity Check ## Purpose Cross-reference analytical findings against multiple data sources, external benchmarks, and common sense to catch errors before they become bad decisions. ## When to Use Apply this skill after every analysis, before presenting findings to stakeholders, and whenever a result seems surprising. If a finding would change a decision, it MUST be triangulated first. ## Instructions ### Triangulation Framework Every finding gets checked through four
data-ai
# Skill: Tracking Gap Identification ## Purpose Assess whether the data needed for an analysis actually exists, identify what's missing, and produce prioritized instrumentation requests for engineering when gaps are found. ## When to Use Apply this skill after the Data Explorer agent inventories available data, when an analysis requires data that might not exist, or when initial query results suggest incomplete tracking. Run before committing to an analysis approach. ## Instructions ### Gap