skills/by-role/data-scientist/eda-report/SKILL.md
Run exploratory data analysis on a dataset and produce a structured report. Use when the user says "explore this dataset", "EDA on X", "analyze this data", "what's in this dataset", "summarize this data", "first look at the data", "understand this dataset before modeling", "data quality check", "describe this dataframe", or wants to understand a new dataset before building models or dashboards.
npx skillsauth add qa-aman/claude-skills eda-reportInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Based on "Practical Statistics for Data Scientists" by Peter Bruce, Andrew Bruce, and Peter Gedeck. The core principle: before any modeling or analysis, you must understand the distribution, shape, and quality of your data. Skipping EDA leads to models trained on dirty data, misunderstood distributions, and insights that collapse under scrutiny. Structured EDA forces you to ask the right questions before committing to an approach.
Start with the basics - shape, types, and completeness.
import pandas as pd
df = pd.read_csv("data.csv")
print(f"Shape: {df.shape}")
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nDuplicate rows: {df.duplicated().sum()}")
Document:
Separate numeric from categorical. Apply the right summary stats to each.
Numeric columns:
df.describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])
Check: mean vs median spread (skew signal), min/max for outlier flags, p1 and p99 for tail behavior.
Categorical columns:
for col in df.select_dtypes("object").columns:
print(f"{col}: {df[col].nunique()} unique | top: {df[col].value_counts().head(3).to_dict()}")
Check: cardinality (high cardinality = encoding decision needed), frequency of top values, presence of "unknown" / "other" / blank strings masking nulls.
One plot per numeric column. Do not skip this step - summary stats hide bimodal distributions, spikes at round numbers, and data entry artifacts.
import matplotlib.pyplot as plt
df.hist(bins=30, figsize=(14, 10))
plt.tight_layout()
plt.savefig("distributions.png")
Flag columns that show:
Check pairwise correlations among numeric features. Identify multicollinearity before modeling.
import seaborn as sns
corr = df.select_dtypes("number").corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.savefig("correlation_heatmap.png")
Flag:
Produce a written audit log. This is the deliverable, not just the charts.
For each issue found, record:
Example audit log format:
| Column | Issue | Rows Affected | Action |
|--------------|-------------------|---------------|-----------------|
| income | 3.2% missing | 320 / 10,000 | Median impute |
| zip_code | Stored as float | All | Cast to string |
| session_time | Max = 86,400 sec | 12 rows | Cap at 3,600 |
Structured output for stakeholders or the next analyst.
Sections to include:
Keep the summary under 1 page. Charts go in the appendix.
1. Skipping EDA and going straight to modeling Bad: Fitting a model on raw data, then debugging why predictions are wrong. Good: Spending 20% of project time on EDA. Problems found here are 10x cheaper to fix than after a model is built.
2. Reporting only means and standard deviations Bad: "Average age = 34.2, std = 12.1" Good: Include percentiles and a histogram. A mean of 34 with a bimodal distribution at 18 and 55 tells a completely different story.
3. Treating "unknown" strings as valid values Bad: Including "unknown", "N/A", "-", and "" as distinct categories. Good: Standardize all null representations to NaN before profiling. Write a null-normalization function and run it first.
4. EDA with no documented decisions Bad: Exploring data in a notebook but not recording what you found or what you plan to do about it. Good: Every issue gets a row in the audit log with a decision. The audit log is a first-class deliverable.
development
Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.
development
Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.
development
Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.
development
Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.