Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ai-analyst-lab/.claude/skills/data-profiling

Name: .claude/skills/data-profiling
Author: ai-analyst-lab

.claude/skills/data-profiling/SKILL.md

npx skillsauth add ai-analyst-lab/ai-analyst .claude/skills/data-profiling

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill: Data Profiling

Purpose

Deep-profile the active dataset to understand schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Produces a comprehensive profile report that serves as the foundation for analysis planning and data quality assessment.

When to Use

After connecting a new dataset (post-bootstrap, pre-analysis)
Before the first analysis on any dataset
When explicitly invoked by the user
When the existing profile is stale (check last_profiled in manifest.yaml)

Instructions

Step 1: Connect and Profile Schema

from helpers.data_helpers import get_connection_for_profiling
from helpers.schema_profiler import profile_source

# Get connection (auto-detects DuckDB vs CSV from active dataset)
conn_info = get_connection_for_profiling()

# Run full schema profile — introspects all tables: column names, types,
# nullability, row counts, sample values, basic statistics, date detection
schema = profile_source(conn_info)

Record the output. schema contains the full table inventory with column-level metadata. Use this to identify:

Which tables exist and their row counts
Which columns are date columns (for temporal analysis in Step 2)
Which columns are numeric (for distribution and correlation analysis)
Which columns have nulls (for completeness deep-dive in Step 2)

Step 2: Run Deep Profiling per Table

For each table in the schema, load the data and run the deep profiling functions. Prioritize tables with the most rows and the most date/numeric columns.

from helpers.data_helpers import read_table
from helpers.deep_profiler import (
    profile_distributions,
    profile_temporal_patterns,
    profile_completeness,
)

for table_info in schema["tables"]:
    table_name = table_info["name"]
    df = read_table(table_name)

    # Distribution analysis on all numeric columns
    distributions = profile_distributions(df)

    # Completeness assessment — null rates, zeros, empty strings, constant cols
    completeness = profile_completeness(df)

    # Temporal pattern analysis (only if the table has date columns)
    temporal = None
    if table_info.get("date_columns"):
        primary_date = table_info["date_columns"][0]
        temporal = profile_temporal_patterns(df, primary_date, freq="D")

Important: For large tables (>50K rows), profile_source() already samples. But read_table() loads the full CSV. If a table has >100K rows, sample before running deep profiling:

if len(df) > 100_000:
    df = df.sample(n=100_000, random_state=42)

Step 3: Correlation and Anomaly Analysis on Key Tables

Run correlation and anomaly detection on tables that contain key business metrics (revenue, counts, rates). Identify these tables by looking for columns with names like revenue, amount, total, count, rate, price, quantity.

from helpers.deep_profiler import profile_correlations, profile_anomalies

# Correlations — find relationships between numeric columns
correlations = profile_correlations(df, threshold=0.5)

# Anomaly detection — requires a date column and pre-aggregated data
# Aggregate to daily granularity first if the table has event-level rows
if table_info.get("date_columns"):
    primary_date = table_info["date_columns"][0]
    # Only run on tables with a clear date + metric pattern
    metric_cols = [c for c in df.select_dtypes(include="number").columns
                   if c not in ("id", table_name.rstrip("s") + "_id")]
    if metric_cols:
        # Aggregate to daily for anomaly detection
        daily = df.groupby(pd.to_datetime(df[primary_date]).dt.date)[metric_cols].sum().reset_index()
        daily.rename(columns={daily.columns[0]: primary_date}, inplace=True)
        anomalies = profile_anomalies(daily, date_col=primary_date,
                                       metric_cols=metric_cols, window=14)

Step 4: Generate Profile Report

Write the full profile report to .knowledge/datasets/{active}/last_profile.md. Use schema_to_markdown() for the schema portion, then append the deep profiling results.

from helpers.data_helpers import schema_to_markdown, detect_active_source

source = detect_active_source()
active_dataset = source["source"]

# Build the schema markdown section
schema_md = schema_to_markdown(schema)

Assemble the full report and write it to:

.knowledge/datasets/{active_dataset}/last_profile.md

Output Format

# Data Profile: {dataset_name}
**Profiled at:** {ISO timestamp}
**Source:** {connection type} ({path or schema prefix})
**Tables:** {count}  |  **Total rows:** {sum}

---

## Summary of Findings

| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER  | X     | {brief list} |
| WARNING  | X     | {brief list} |
| INFO     | X     | {brief list} |

---

## Schema Overview

{output of schema_to_markdown()}

---

## Distribution Analysis

### {table_name}

| Column | Shape | Skewness | Outliers (IQR) | Recommended Transform |
|--------|-------|----------|----------------|----------------------|
| {col}  | {shape} | {skew} | {n_outliers}  | {transform or "none"} |

---

## Temporal Patterns

### {table_name} ({date_column})

- **Date range:** {min} to {max}
- **Coverage:** {actual}/{expected} periods ({pct}%)
- **Gaps:** {count} gaps found {list if any}
- **Trend:** {trend direction}
- **Seasonality:** {detected or not}
- **Day-of-week pattern:** {summary}

---

## Completeness

### {table_name}

| Column | Status | Null % | Zeros | Empty Strings | Constant? |
|--------|--------|--------|-------|---------------|-----------|
| {col}  | {status} | {pct} | {count} | {count}    | {yes/no}  |

---

## Correlations

### {table_name}

| Column A | Column B | Correlation | Strength | Direction |
|----------|----------|-------------|----------|-----------|
| {col_a}  | {col_b}  | {r}         | {strength} | {direction} |

---

## Anomalies

### {table_name}

{anomaly summary}

| Metric | Spikes | Drops | Details |
|--------|--------|-------|---------|
| {metric} | {count} | {count} | {top anomalies with dates} |

---

## Recommendations

- **BLOCKER items:** {must fix before analysis}
- **WARNING items:** {note as caveats}
- **Suggested analysis focus:** {tables/columns with most signal}

Severity Classification

Apply these rules consistently across all sections:

| Severity | Condition | |----------|-----------| | BLOCKER | >50% nulls in a key metric column; entire date ranges missing (coverage <50%); constant columns that should have variance; very strong correlations (r>0.95) suggesting duplicate columns | | WARNING | 5-50% nulls; heavy-tailed or bimodal distributions in metric columns; date coverage 50-90%; moderate anomalies detected; skewness >3 suggesting data quality issues | | INFO | <5% nulls; normal or mild skew distributions; full date coverage; no anomalies; expected correlations (e.g., quantity and revenue) |

Edge Cases

No date columns in any table: Skip temporal analysis and anomaly detection entirely. Note in the report: "No temporal columns detected -- temporal analysis skipped."
Single-column tables or lookup tables: Run completeness only. Skip distributions, correlations, and anomalies. Flag as "lookup table" in the report.
All columns are non-numeric: Skip distribution and correlation analysis. Focus on completeness and categorical cardinality.
Very wide tables (>50 columns): Profile all columns for completeness, but limit distribution analysis to the top 20 numeric columns by variance. Note which columns were skipped.
Empty tables (0 rows): Log as BLOCKER. Do not attempt profiling -- report the table as empty and move on.
DuckDB connection fails: Fall back to CSV via read_table(). The schema profiler handles this internally, but deep profiling should also use the CSV path.

Anti-Patterns

Never skip profiling because "the data looks clean." Surprises hide in distributions and temporal patterns that summary stats miss.
Never run anomaly detection on raw event rows. Always aggregate to daily or weekly granularity first. Running on raw rows will flag every row as an "anomaly" relative to rolling stats.
Never profile in isolation from schema context. Always run profile_source() first (Step 1) so you know which columns are dates, which are numeric, and what the cardinality looks like before deep profiling.
Never treat all WARNING items equally. A 10% null rate in a segmentation column is more impactful than 10% nulls in a free-text notes column. Contextualize severity by the column's role in analysis.
Never skip the report write. Even if profiling runs smoothly, always write last_profile.md so future sessions can reference it without re-profiling.

ai-analyst-lab/.claude/skills/data-profiling

.claude/skills/data-profiling/SKILL.md

# Skill: Data Profiling ## Purpose Deep-profile the active dataset to understand schema structure, value distributions, temporal patterns, correlations, completeness gaps, and anomalies. Produces a comprehensive profile report that serves as the foundation for analysis planning and data quality assessment. ## When to Use - After connecting a new dataset (post-bootstrap, pre-analysis) - Before the first analysis on any dataset - When explicitly invoked by the user - When the existing profile is

146 stars

testing

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add ai-analyst-lab/ai-analyst .claude/skills/data-profiling

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 3:47 AM4.1s1 file scanned

SKILL.md

Skill: Data Profiling

Purpose

When to Use

After connecting a new dataset (post-bootstrap, pre-analysis)
Before the first analysis on any dataset
When explicitly invoked by the user
When the existing profile is stale (check last_profiled in manifest.yaml)

Instructions

Step 1: Connect and Profile Schema

from helpers.data_helpers import get_connection_for_profiling
from helpers.schema_profiler import profile_source

# Get connection (auto-detects DuckDB vs CSV from active dataset)
conn_info = get_connection_for_profiling()

# Run full schema profile — introspects all tables: column names, types,
# nullability, row counts, sample values, basic statistics, date detection
schema = profile_source(conn_info)

Record the output. schema contains the full table inventory with column-level metadata. Use this to identify:

Which tables exist and their row counts
Which columns are date columns (for temporal analysis in Step 2)
Which columns are numeric (for distribution and correlation analysis)
Which columns have nulls (for completeness deep-dive in Step 2)

Step 2: Run Deep Profiling per Table

For each table in the schema, load the data and run the deep profiling functions. Prioritize tables with the most rows and the most date/numeric columns.

from helpers.data_helpers import read_table
from helpers.deep_profiler import (
    profile_distributions,
    profile_temporal_patterns,
    profile_completeness,
)

for table_info in schema["tables"]:
    table_name = table_info["name"]
    df = read_table(table_name)

    # Distribution analysis on all numeric columns
    distributions = profile_distributions(df)

    # Completeness assessment — null rates, zeros, empty strings, constant cols
    completeness = profile_completeness(df)

    # Temporal pattern analysis (only if the table has date columns)
    temporal = None
    if table_info.get("date_columns"):
        primary_date = table_info["date_columns"][0]
        temporal = profile_temporal_patterns(df, primary_date, freq="D")

Important: For large tables (>50K rows), profile_source() already samples. But read_table() loads the full CSV. If a table has >100K rows, sample before running deep profiling:

if len(df) > 100_000:
    df = df.sample(n=100_000, random_state=42)

Step 3: Correlation and Anomaly Analysis on Key Tables

from helpers.deep_profiler import profile_correlations, profile_anomalies

# Correlations — find relationships between numeric columns
correlations = profile_correlations(df, threshold=0.5)

# Anomaly detection — requires a date column and pre-aggregated data
# Aggregate to daily granularity first if the table has event-level rows
if table_info.get("date_columns"):
    primary_date = table_info["date_columns"][0]
    # Only run on tables with a clear date + metric pattern
    metric_cols = [c for c in df.select_dtypes(include="number").columns
                   if c not in ("id", table_name.rstrip("s") + "_id")]
    if metric_cols:
        # Aggregate to daily for anomaly detection
        daily = df.groupby(pd.to_datetime(df[primary_date]).dt.date)[metric_cols].sum().reset_index()
        daily.rename(columns={daily.columns[0]: primary_date}, inplace=True)
        anomalies = profile_anomalies(daily, date_col=primary_date,
                                       metric_cols=metric_cols, window=14)

Step 4: Generate Profile Report

Write the full profile report to .knowledge/datasets/{active}/last_profile.md. Use schema_to_markdown() for the schema portion, then append the deep profiling results.

from helpers.data_helpers import schema_to_markdown, detect_active_source

source = detect_active_source()
active_dataset = source["source"]

# Build the schema markdown section
schema_md = schema_to_markdown(schema)

Assemble the full report and write it to:

.knowledge/datasets/{active_dataset}/last_profile.md

Output Format

# Data Profile: {dataset_name}
**Profiled at:** {ISO timestamp}
**Source:** {connection type} ({path or schema prefix})
**Tables:** {count}  |  **Total rows:** {sum}

---

## Summary of Findings

| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER  | X     | {brief list} |
| WARNING  | X     | {brief list} |
| INFO     | X     | {brief list} |

---

## Schema Overview

{output of schema_to_markdown()}

---

## Distribution Analysis

### {table_name}

| Column | Shape | Skewness | Outliers (IQR) | Recommended Transform |
|--------|-------|----------|----------------|----------------------|
| {col}  | {shape} | {skew} | {n_outliers}  | {transform or "none"} |

---

## Temporal Patterns

### {table_name} ({date_column})

- **Date range:** {min} to {max}
- **Coverage:** {actual}/{expected} periods ({pct}%)
- **Gaps:** {count} gaps found {list if any}
- **Trend:** {trend direction}
- **Seasonality:** {detected or not}
- **Day-of-week pattern:** {summary}

---

## Completeness

### {table_name}

| Column | Status | Null % | Zeros | Empty Strings | Constant? |
|--------|--------|--------|-------|---------------|-----------|
| {col}  | {status} | {pct} | {count} | {count}    | {yes/no}  |

---

## Correlations

### {table_name}

| Column A | Column B | Correlation | Strength | Direction |
|----------|----------|-------------|----------|-----------|
| {col_a}  | {col_b}  | {r}         | {strength} | {direction} |

---

## Anomalies

### {table_name}

{anomaly summary}

| Metric | Spikes | Drops | Details |
|--------|--------|-------|---------|
| {metric} | {count} | {count} | {top anomalies with dates} |

---

## Recommendations

- **BLOCKER items:** {must fix before analysis}
- **WARNING items:** {note as caveats}
- **Suggested analysis focus:** {tables/columns with most signal}

Severity Classification

Apply these rules consistently across all sections:

Edge Cases

No date columns in any table: Skip temporal analysis and anomaly detection entirely. Note in the report: "No temporal columns detected -- temporal analysis skipped."
Single-column tables or lookup tables: Run completeness only. Skip distributions, correlations, and anomalies. Flag as "lookup table" in the report.
All columns are non-numeric: Skip distribution and correlation analysis. Focus on completeness and categorical cardinality.
Very wide tables (>50 columns): Profile all columns for completeness, but limit distribution analysis to the top 20 numeric columns by variance. Note which columns were skipped.
Empty tables (0 rows): Log as BLOCKER. Do not attempt profiling -- report the table as empty and move on.
DuckDB connection fails: Fall back to CSV via read_table(). The schema profiler handles this internally, but deep profiling should also use the CSV path.

Anti-Patterns

Never skip profiling because "the data looks clean." Surprises hide in distributions and temporal patterns that summary stats miss.
Never run anomaly detection on raw event rows. Always aggregate to daily or weekly granularity first. Running on raw rows will flag every row as an "anomaly" relative to rolling stats.
Never profile in isolation from schema context. Always run profile_source() first (Step 1) so you know which columns are dates, which are numeric, and what the cardinality looks like before deep profiling.
Never treat all WARNING items equally. A 10% null rate in a segmentation column is more impactful than 10% nulls in a free-text notes column. Contextualize severity by the column's role in analysis.
Never skip the report write. Even if profiling runs smoothly, always write last_profile.md so future sessions can reference it without re-profiling.

Related Skills

ai-analyst-lab/.claude/skills/TEMPLATE

testing

VerifiedTrustedCommunity

# Skill: {{BLANK_1_SKILL_NAME}} ## Purpose {{BLANK_2_WHEN_TO_FIRE}} ## When to Use Fires automatically when the user asks Claude to do something that matches the trigger condition above. ## Instructions 1. Detect the trigger condition 2. Execute your guardrail check 3. If the check matters, print a clear, visible warning with "{{BLANK_3_SIGNATURE_PHRASE}}" as the first line 4. Continue with the analysis, incorporating the warning into the output ## Anti-Patterns - Do not fire when the condit

175SKILL.mdUpdated Apr 17, 2026

ai-analyst-lab/.claude/skills/TEMPLATE

ai-analyst-lab/.claude/skills/visualization-patterns

development

VerifiedTrustedCommunity

# Skill: Visualization Patterns ## Purpose Ensure every chart Claude Code produces follows high-quality design standards with named themes, consistent styling, and clear data communication. ## When to Use Apply this skill whenever generating a chart, graph, or data visualization. Always apply the active theme unless the user specifies otherwise. Default theme: `minimal`. ## Instructions ### Pre-flight: Load Learnings Before executing, check `.knowledge/learnings/index.md` for relevant entrie

146SKILL.mdUpdated Apr 4, 2026

ai-analyst-lab/.claude/skills/visualization-patterns

ai-analyst-lab/.claude/skills/triangulation

development

VerifiedTrustedCommunity

# Skill: Triangulation / Sanity Check ## Purpose Cross-reference analytical findings against multiple data sources, external benchmarks, and common sense to catch errors before they become bad decisions. ## When to Use Apply this skill after every analysis, before presenting findings to stakeholders, and whenever a result seems surprising. If a finding would change a decision, it MUST be triangulated first. ## Instructions ### Triangulation Framework Every finding gets checked through four

146SKILL.mdUpdated Apr 4, 2026

ai-analyst-lab/.claude/skills/triangulation

ai-analyst-lab/.claude/skills/tracking-gaps

data-ai

VerifiedTrustedCommunity

# Skill: Tracking Gap Identification ## Purpose Assess whether the data needed for an analysis actually exists, identify what's missing, and produce prioritized instrumentation requests for engineering when gaps are found. ## When to Use Apply this skill after the Data Explorer agent inventories available data, when an analysis requires data that might not exist, or when initial query results suggest incomplete tracking. Run before committing to an analysis approach. ## Instructions ### Gap

146SKILL.mdUpdated Apr 4, 2026

ai-analyst-lab/.claude/skills/tracking-gaps

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ai-analyst-lab/ai-analyst.git

# Copy into Claude Code skills folder (global)
cp -r ai-analyst/.claude/skills/data-profiling ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ai-analyst-lab/ai-analyst

146 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT