Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

qa-aman/eda-report

Name: eda-report
Author: qa-aman

skills/by-role/data-scientist/eda-report/SKILL.md

npx skillsauth add qa-aman/claude-skills eda-report

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview

Based on "Practical Statistics for Data Scientists" by Peter Bruce, Andrew Bruce, and Peter Gedeck. The core principle: before any modeling or analysis, you must understand the distribution, shape, and quality of your data. Skipping EDA leads to models trained on dirty data, misunderstood distributions, and insights that collapse under scrutiny. Structured EDA forces you to ask the right questions before committing to an approach.

Workflow

Step 1: Load and audit the dataset

Start with the basics - shape, types, and completeness.

import pandas as pd

df = pd.read_csv("data.csv")

print(f"Shape: {df.shape}")
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nDuplicate rows: {df.duplicated().sum()}")

Document:

Row count and column count
Percentage of missing values per column (flag any column > 5% missing)
Columns with wrong inferred types (e.g., zip codes as integers)
Duplicate row count

Step 2: Profile each variable

Separate numeric from categorical. Apply the right summary stats to each.

Numeric columns:

df.describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])

Check: mean vs median spread (skew signal), min/max for outlier flags, p1 and p99 for tail behavior.

Categorical columns:

for col in df.select_dtypes("object").columns:
    print(f"{col}: {df[col].nunique()} unique | top: {df[col].value_counts().head(3).to_dict()}")

Check: cardinality (high cardinality = encoding decision needed), frequency of top values, presence of "unknown" / "other" / blank strings masking nulls.

Step 3: Visualize distributions

One plot per numeric column. Do not skip this step - summary stats hide bimodal distributions, spikes at round numbers, and data entry artifacts.

import matplotlib.pyplot as plt

df.hist(bins=30, figsize=(14, 10))
plt.tight_layout()
plt.savefig("distributions.png")

Flag columns that show:

Heavy skew (log-transform candidate)
Bimodal peaks (two subpopulations mixed)
Spikes at 0, -1, or 999 (sentinel values)
Values outside domain range (e.g., age = 200)

Step 4: Analyze relationships and correlation

Check pairwise correlations among numeric features. Identify multicollinearity before modeling.

import seaborn as sns

corr = df.select_dtypes("number").corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.savefig("correlation_heatmap.png")

Flag:

Any pair with |r| > 0.85 (multicollinearity risk)
Target variable correlations (feature relevance signal)
Surprising zero-correlations where you expected a relationship

Step 5: Identify data quality issues and document decisions

Produce a written audit log. This is the deliverable, not just the charts.

For each issue found, record:

Column name
Issue type (missing, outlier, wrong type, duplicate, encoding ambiguity)
Prevalence (count and % of rows affected)
Recommended action (drop, impute, cap, flag, leave)

Example audit log format:

| Column       | Issue              | Rows Affected | Action          |
|--------------|-------------------|---------------|-----------------|
| income       | 3.2% missing       | 320 / 10,000  | Median impute   |
| zip_code     | Stored as float   | All           | Cast to string  |
| session_time | Max = 86,400 sec  | 12 rows       | Cap at 3,600    |

Step 6: Write the EDA summary

Structured output for stakeholders or the next analyst.

Sections to include:

Dataset overview (rows, columns, time range if applicable)
Target variable distribution (if supervised task)
Key quality issues and planned treatments
Notable patterns (correlations, subpopulations, outliers)
Recommended next steps (feature engineering, additional data needed, modeling approach)

Keep the summary under 1 page. Charts go in the appendix.

Anti-Patterns

1. Skipping EDA and going straight to modeling Bad: Fitting a model on raw data, then debugging why predictions are wrong. Good: Spending 20% of project time on EDA. Problems found here are 10x cheaper to fix than after a model is built.

2. Reporting only means and standard deviations Bad: "Average age = 34.2, std = 12.1" Good: Include percentiles and a histogram. A mean of 34 with a bimodal distribution at 18 and 55 tells a completely different story.

3. Treating "unknown" strings as valid values Bad: Including "unknown", "N/A", "-", and "" as distinct categories. Good: Standardize all null representations to NaN before profiling. Write a null-normalization function and run it first.

4. EDA with no documented decisions Bad: Exploring data in a notebook but not recording what you found or what you plan to do about it. Good: Every issue gets a row in the audit log with a decision. The audit log is a first-class deliverable.

Quality Checklist

[ ] Row count, column count, and missing value percentages documented
[ ] Every numeric column has descriptive stats including p1, p25, p75, p99
[ ] Every categorical column has cardinality and top-value frequency
[ ] Distribution plots generated for all numeric columns
[ ] Correlation heatmap produced and high-correlation pairs flagged
[ ] Audit log written with issue, prevalence, and action for each finding
[ ] EDA summary written with recommended next steps

qa-aman/eda-report

skills/by-role/data-scientist/eda-report/SKILL.md

Run exploratory data analysis on a dataset and produce a structured report. Use when the user says "explore this dataset", "EDA on X", "analyze this data", "what's in this dataset", "summarize this data", "first look at the data", "understand this dataset before modeling", "data quality check", "describe this dataframe", or wants to understand a new dataset before building models or dashboards.

13 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add qa-aman/claude-skills eda-report

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 1:56 PM219.5s1 file scanned

SKILL.md

name:: eda-report
description:: >

Overview

Workflow

Step 1: Load and audit the dataset

Start with the basics - shape, types, and completeness.

import pandas as pd

df = pd.read_csv("data.csv")

print(f"Shape: {df.shape}")
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nDuplicate rows: {df.duplicated().sum()}")

Document:

Row count and column count
Percentage of missing values per column (flag any column > 5% missing)
Columns with wrong inferred types (e.g., zip codes as integers)
Duplicate row count

Step 2: Profile each variable

Separate numeric from categorical. Apply the right summary stats to each.

Numeric columns:

df.describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])

Check: mean vs median spread (skew signal), min/max for outlier flags, p1 and p99 for tail behavior.

Categorical columns:

for col in df.select_dtypes("object").columns:
    print(f"{col}: {df[col].nunique()} unique | top: {df[col].value_counts().head(3).to_dict()}")

Check: cardinality (high cardinality = encoding decision needed), frequency of top values, presence of "unknown" / "other" / blank strings masking nulls.

Step 3: Visualize distributions

One plot per numeric column. Do not skip this step - summary stats hide bimodal distributions, spikes at round numbers, and data entry artifacts.

import matplotlib.pyplot as plt

df.hist(bins=30, figsize=(14, 10))
plt.tight_layout()
plt.savefig("distributions.png")

Flag columns that show:

Heavy skew (log-transform candidate)
Bimodal peaks (two subpopulations mixed)
Spikes at 0, -1, or 999 (sentinel values)
Values outside domain range (e.g., age = 200)

Step 4: Analyze relationships and correlation

Check pairwise correlations among numeric features. Identify multicollinearity before modeling.

import seaborn as sns

corr = df.select_dtypes("number").corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.savefig("correlation_heatmap.png")

Flag:

Any pair with |r| > 0.85 (multicollinearity risk)
Target variable correlations (feature relevance signal)
Surprising zero-correlations where you expected a relationship

Step 5: Identify data quality issues and document decisions

Produce a written audit log. This is the deliverable, not just the charts.

For each issue found, record:

Column name
Issue type (missing, outlier, wrong type, duplicate, encoding ambiguity)
Prevalence (count and % of rows affected)
Recommended action (drop, impute, cap, flag, leave)

Example audit log format:

| Column       | Issue              | Rows Affected | Action          |
|--------------|-------------------|---------------|-----------------|
| income       | 3.2% missing       | 320 / 10,000  | Median impute   |
| zip_code     | Stored as float   | All           | Cast to string  |
| session_time | Max = 86,400 sec  | 12 rows       | Cap at 3,600    |

Step 6: Write the EDA summary

Structured output for stakeholders or the next analyst.

Sections to include:

Dataset overview (rows, columns, time range if applicable)
Target variable distribution (if supervised task)
Key quality issues and planned treatments
Notable patterns (correlations, subpopulations, outliers)
Recommended next steps (feature engineering, additional data needed, modeling approach)

Keep the summary under 1 page. Charts go in the appendix.

Anti-Patterns

Quality Checklist

[ ] Row count, column count, and missing value percentages documented
[ ] Every numeric column has descriptive stats including p1, p25, p75, p99
[ ] Every categorical column has cardinality and top-value frequency
[ ] Distribution plots generated for all numeric columns
[ ] Correlation heatmap produced and high-correlation pairs flagged
[ ] Audit log written with issue, prevalence, and action for each finding
[ ] EDA summary written with recommended next steps

Related Skills

qa-aman/webinar-planner

development

VerifiedTrustedCommunity

Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/webinar-planner

qa-aman/thought-leadership-writer

development

VerifiedTrustedCommunity

Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/thought-leadership-writer

qa-aman/social-calendar

development

VerifiedTrustedCommunity

Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/social-calendar

qa-aman/seo-article-writer

development

VerifiedTrustedCommunity

Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.

13SKILL.mdUpdated May 5, 2026

qa-aman/seo-article-writer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/qa-aman/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/skills/by-role/data-scientist/eda-report ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

qa-aman/claude-skills

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT