Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

brycewang-stanford/stata-data-cleaning

Name: stata-data-cleaning
Author: brycewang-stanford

skills/43-wentorai-research-plugins/skills/analysis/wrangling/stata-data-cleaning/SKILL.md

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research stata-data-cleaning

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Stata Data Cleaning

Clean, transform, and validate messy research datasets in Stata. This skill covers the complete data preparation pipeline from raw survey or administrative data to analysis-ready datasets, with emphasis on documentation, reproducibility, and handling the common data quality issues encountered in social science, economics, and health research.

Overview

Data cleaning typically consumes 60-80% of research time in empirical studies, yet it is often under-documented and poorly reproducible. Stata provides a powerful set of commands for data manipulation, but knowing which commands to use and in what order requires experience with common data quality issues: inconsistent coding, duplicate observations, string formatting problems, implausible values, and complex missing data patterns.

This skill provides a systematic, step-by-step data cleaning workflow in Stata. Each step produces a log of changes made, enabling full reproducibility and audit trails. The workflow is organized around the principle that raw data should never be modified in place -- instead, cleaning scripts transform raw data into processed datasets while preserving the original.

The approach follows best practices from the World Bank's DIME Analytics team and the J-PAL research transparency guidelines, making it suitable for projects that require rigorous data documentation for peer review, replication packages, or regulatory compliance.

Initial Data Assessment

Loading and Inspecting Data

* ============================================
* Data Cleaning Script: [Project Name]
* Author: [Name]
* Date: [Date]
* Input: raw/survey_data_raw.dta
* Output: processed/survey_data_clean.dta
* ============================================

clear all
set more off
log using "logs/cleaning_log.smcl", replace

* Load raw data
use "raw/survey_data_raw.dta", clear

* Basic inspection
describe
summarize
codebook, compact

* Check dimensions
display "Observations: " _N
display "Variables: " c(k)

* Check for duplicates on ID variable
duplicates report respondent_id
duplicates list respondent_id if duplicates(respondent_id) > 0

Data Quality Report

* Generate a data quality summary
foreach var of varlist _all {
    quietly {
        count if missing(`var')
        local nmiss = r(N)
        local pctmiss = (`nmiss' / _N) * 100
    }
    if `pctmiss' > 0 {
        display "`var': `nmiss' missing (`pctmiss'%)"
    }
}

* Check value ranges for numeric variables
foreach var of varlist age income years_education {
    summarize `var', detail
    * Flag implausible values
    count if `var' < 0 & !missing(`var')
    count if `var' > 150 & !missing(`var')
}

String Cleaning

Standardizing Text Variables

* Trim whitespace
replace name = strtrim(name)
replace name = stritrim(name)  // Remove internal multiple spaces

* Standardize case
replace city = proper(city)        // Title case
replace country = upper(country)   // Upper case
replace email = lower(email)       // Lower case

* Remove special characters
replace phone = ustrregexra(phone, "[^0-9]", "")

* Fix encoding issues
replace name = ustrfix(name)

* Standardize common variations
replace department = "Computer Science" if ///
    inlist(department, "CS", "Comp Sci", "Comp. Sci.", "CompSci")

replace gender = "Female" if inlist(gender, "F", "f", "female", "FEMALE")
replace gender = "Male" if inlist(gender, "M", "m", "male", "MALE")

Parsing Complex Strings

* Split full name into first and last
gen first_name = word(full_name, 1)
gen last_name = word(full_name, -1)

* Extract year from date string "March 15, 2024"
gen year = real(word(date_string, -1))

* Parse numeric values from strings like "$1,234.56"
gen income_clean = real(subinstr(subinstr(income_str, "$", "", .), ",", "", .))

Missing Data Handling

Identifying Missing Data Patterns

* Install missing data analysis tools
ssc install mdesc
ssc install misstable

* Summary of missing data
mdesc

* Missing data patterns
misstable summarize
misstable patterns

* Create missing indicator variables
foreach var of varlist income education occupation {
    gen mi_`var' = missing(`var')
}

* Test whether missing is random (Little's MCAR test approximation)
* Compare means of observed variables by missing status
foreach var of varlist income education {
    ttest age, by(mi_`var')
    ttest gender_numeric, by(mi_`var')
}

Recoding Missing Values

* Common survey codes for missing
* -99 = refused, -88 = don't know, -77 = not applicable
foreach var of varlist income satisfaction trust_score {
    replace `var' = .r if `var' == -99  // .r = refused
    replace `var' = .d if `var' == -88  // .d = don't know
    replace `var' = .n if `var' == -77  // .n = not applicable
}

* Extended missing values preserve the reason for missingness
* while still being treated as missing in analyses

Variable Construction

Recoding and Categorization

* Create age groups
recode age (18/29 = 1 "18-29") (30/44 = 2 "30-44") ///
           (45/59 = 3 "45-59") (60/max = 4 "60+"), gen(age_group)

* Create binary indicator
gen high_income = (income > 75000) if !missing(income)

* Create composite scale (e.g., Likert items)
alpha item1 item2 item3 item4 item5, gen(scale_score) item
* Cronbach's alpha is reported; scale_score is the mean

* Standardize continuous variables
foreach var of varlist income education_years age {
    egen z_`var' = std(`var')
}

* Winsorize extreme values
winsor2 income, cuts(1 99) replace

Date Variables

* Parse date strings
gen interview_date = date(date_string, "MDY")
format interview_date %td

* Extract components
gen interview_year = year(interview_date)
gen interview_month = month(interview_date)
gen interview_dow = dow(interview_date)  // 0=Sunday

* Calculate durations
gen days_since_treatment = interview_date - treatment_date
gen months_since = (interview_date - treatment_date) / 30.44

Data Validation

Assertion-Based Validation

* These assertions halt execution if violated
assert _N == 5000  // Expected sample size
assert !missing(respondent_id)  // No missing IDs
assert age >= 18 & age <= 120 if !missing(age)  // Plausible age range
assert inlist(gender, "Male", "Female", "Other", "") | missing(gender)

* Cross-variable consistency checks
assert education_years >= 0 if !missing(education_years)
assert income >= 0 if !missing(income)
assert end_date >= start_date if !missing(end_date) & !missing(start_date)

Duplicate Detection and Resolution

* Identify duplicates
duplicates tag respondent_id, gen(dup_flag)
list respondent_id survey_date if dup_flag > 0, sepby(respondent_id)

* Keep most recent observation per respondent
bysort respondent_id (survey_date): keep if _n == _N

* Or keep first observation
bysort respondent_id (survey_date): keep if _n == 1

Saving and Documentation

* Label all variables
label variable age "Age at time of interview (years)"
label variable income "Annual household income (USD)"
label variable education_years "Total years of formal education"

* Save cleaned dataset
compress  // Reduce file size
save "processed/survey_data_clean.dta", replace

* Export codebook
codebook, compact
describe, short

* Close log
log close

Best Practices

Never modify raw data files: Always read raw data and write to a separate processed file.
Log everything: Use log using to capture all output for audit trails.
Use assert statements: Validate assumptions about the data at each stage.
Document decisions: Comment every recode, drop, or imputation with the rationale.
Version your cleaning scripts: Use git to track changes to .do files.
Produce a data dictionary: Label every variable and value label in the final dataset.

References

Stata Data Management Reference Manual: https://www.stata.com/manuals/d.pdf
DIME Analytics Data Management Wiki: https://dimewiki.worldbank.org/Data_Management
J-PAL Research Resources: https://www.povertyactionlab.org/resource/data-cleaning
Long, J.S. (2009), The Workflow of Data Analysis Using Stata, Stata Press

brycewang-stanford/stata-data-cleaning

skills/43-wentorai-research-plugins/skills/analysis/wrangling/stata-data-cleaning/SKILL.md

Clean, transform, and validate messy research data using Stata

1,232 stars

testing

Updated May 26, 2026

$ install --global

skillsauth

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research stata-data-cleaning

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 26, 2026, 4:49 AM79.6s1 file scanned

SKILL.md

name:: stata-data-cleaning
description:: Clean, transform, and validate messy research data using Stata
emoji:: 🧹
category:: analysis
subcategory:: wrangling
keywords:: ["Stata", "data cleaning", "data wrangling", "missing values", "recoding", "validation"]
source:: https://www.stata.com/manuals/d.pdf

Stata Data Cleaning

Overview

Initial Data Assessment

Loading and Inspecting Data

* ============================================
* Data Cleaning Script: [Project Name]
* Author: [Name]
* Date: [Date]
* Input: raw/survey_data_raw.dta
* Output: processed/survey_data_clean.dta
* ============================================

clear all
set more off
log using "logs/cleaning_log.smcl", replace

* Load raw data
use "raw/survey_data_raw.dta", clear

* Basic inspection
describe
summarize
codebook, compact

* Check dimensions
display "Observations: " _N
display "Variables: " c(k)

* Check for duplicates on ID variable
duplicates report respondent_id
duplicates list respondent_id if duplicates(respondent_id) > 0

Data Quality Report

* Generate a data quality summary
foreach var of varlist _all {
    quietly {
        count if missing(`var')
        local nmiss = r(N)
        local pctmiss = (`nmiss' / _N) * 100
    }
    if `pctmiss' > 0 {
        display "`var': `nmiss' missing (`pctmiss'%)"
    }
}

* Check value ranges for numeric variables
foreach var of varlist age income years_education {
    summarize `var', detail
    * Flag implausible values
    count if `var' < 0 & !missing(`var')
    count if `var' > 150 & !missing(`var')
}

String Cleaning

Standardizing Text Variables

* Trim whitespace
replace name = strtrim(name)
replace name = stritrim(name)  // Remove internal multiple spaces

* Standardize case
replace city = proper(city)        // Title case
replace country = upper(country)   // Upper case
replace email = lower(email)       // Lower case

* Remove special characters
replace phone = ustrregexra(phone, "[^0-9]", "")

* Fix encoding issues
replace name = ustrfix(name)

* Standardize common variations
replace department = "Computer Science" if ///
    inlist(department, "CS", "Comp Sci", "Comp. Sci.", "CompSci")

replace gender = "Female" if inlist(gender, "F", "f", "female", "FEMALE")
replace gender = "Male" if inlist(gender, "M", "m", "male", "MALE")

Parsing Complex Strings

* Split full name into first and last
gen first_name = word(full_name, 1)
gen last_name = word(full_name, -1)

* Extract year from date string "March 15, 2024"
gen year = real(word(date_string, -1))

* Parse numeric values from strings like "$1,234.56"
gen income_clean = real(subinstr(subinstr(income_str, "$", "", .), ",", "", .))

Missing Data Handling

Identifying Missing Data Patterns

* Install missing data analysis tools
ssc install mdesc
ssc install misstable

* Summary of missing data
mdesc

* Missing data patterns
misstable summarize
misstable patterns

* Create missing indicator variables
foreach var of varlist income education occupation {
    gen mi_`var' = missing(`var')
}

* Test whether missing is random (Little's MCAR test approximation)
* Compare means of observed variables by missing status
foreach var of varlist income education {
    ttest age, by(mi_`var')
    ttest gender_numeric, by(mi_`var')
}

Recoding Missing Values

* Common survey codes for missing
* -99 = refused, -88 = don't know, -77 = not applicable
foreach var of varlist income satisfaction trust_score {
    replace `var' = .r if `var' == -99  // .r = refused
    replace `var' = .d if `var' == -88  // .d = don't know
    replace `var' = .n if `var' == -77  // .n = not applicable
}

* Extended missing values preserve the reason for missingness
* while still being treated as missing in analyses

Variable Construction

Recoding and Categorization

* Create age groups
recode age (18/29 = 1 "18-29") (30/44 = 2 "30-44") ///
           (45/59 = 3 "45-59") (60/max = 4 "60+"), gen(age_group)

* Create binary indicator
gen high_income = (income > 75000) if !missing(income)

* Create composite scale (e.g., Likert items)
alpha item1 item2 item3 item4 item5, gen(scale_score) item
* Cronbach's alpha is reported; scale_score is the mean

* Standardize continuous variables
foreach var of varlist income education_years age {
    egen z_`var' = std(`var')
}

* Winsorize extreme values
winsor2 income, cuts(1 99) replace

Date Variables

* Parse date strings
gen interview_date = date(date_string, "MDY")
format interview_date %td

* Extract components
gen interview_year = year(interview_date)
gen interview_month = month(interview_date)
gen interview_dow = dow(interview_date)  // 0=Sunday

* Calculate durations
gen days_since_treatment = interview_date - treatment_date
gen months_since = (interview_date - treatment_date) / 30.44

Data Validation

Assertion-Based Validation

* These assertions halt execution if violated
assert _N == 5000  // Expected sample size
assert !missing(respondent_id)  // No missing IDs
assert age >= 18 & age <= 120 if !missing(age)  // Plausible age range
assert inlist(gender, "Male", "Female", "Other", "") | missing(gender)

* Cross-variable consistency checks
assert education_years >= 0 if !missing(education_years)
assert income >= 0 if !missing(income)
assert end_date >= start_date if !missing(end_date) & !missing(start_date)

Duplicate Detection and Resolution

* Identify duplicates
duplicates tag respondent_id, gen(dup_flag)
list respondent_id survey_date if dup_flag > 0, sepby(respondent_id)

* Keep most recent observation per respondent
bysort respondent_id (survey_date): keep if _n == _N

* Or keep first observation
bysort respondent_id (survey_date): keep if _n == 1

Saving and Documentation

* Label all variables
label variable age "Age at time of interview (years)"
label variable income "Annual household income (USD)"
label variable education_years "Total years of formal education"

* Save cleaned dataset
compress  // Reduce file size
save "processed/survey_data_clean.dta", replace

* Export codebook
codebook, compact
describe, short

* Close log
log close

Best Practices

Never modify raw data files: Always read raw data and write to a separate processed file.
Log everything: Use log using to capture all output for audit trails.
Use assert statements: Validate assumptions about the data at each stage.
Document decisions: Comment every recode, drop, or imputation with the rationale.
Version your cleaning scripts: Use git to track changes to .do files.
Produce a data dictionary: Label every variable and value label in the final dataset.

References

Stata Data Management Reference Manual: https://www.stata.com/manuals/d.pdf
DIME Analytics Data Management Wiki: https://dimewiki.worldbank.org/Data_Management
J-PAL Research Resources: https://www.povertyactionlab.org/resource/data-cleaning
Long, J.S. (2009), The Workflow of Data Analysis Using Stata, Stata Press

Related Skills

brycewang-stanford/literature-review-tools

tools

VerifiedTrustedCommunity

Recommend AND run open-source AI tools, agents, Claude Code / Codex skills, and MCP servers for any stage of a literature review — searching, reading, extracting, synthesizing, screening, citation-checking, and paper writing. Use when the user asks "what tool should I use to..." OR "install/run/use <tool> to ..." for research/lit-review work: automating a survey or related-work section, PDF→Markdown extraction for LLMs (MinerU/marker/docling), PRISMA / systematic review (ASReview), citation-backed Q&A over PDFs (PaperQA2), wiring papers into Claude/Cursor via MCP (arxiv/paper-search/zotero servers), or chatting with a Zotero library. Ships a launcher (scripts/litrun.py) that installs each tool in an isolated venv and runs it. Curated catalog of 70+ vetted projects. 支持中英文（用于「文献综述工具选型」与「一键安装/运行」）。

3,109SKILL.mdUpdated Jul 28, 2026

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

development

VerifiedTrustedCommunity

Route empirical-research requests through the Auto-Empirical Research Skills catalog when this whole repository is installed as one skill in Codex, CodeBuddy, Claude Code, or another IDE. Use to choose and load the right vendored AERS skill for causal inference, econometrics, replication, data acquisition, manuscript writing, peer review and referee responses, citation checking, de-AIGC editing, or full empirical-paper workflows without reading the entire repository at once.

3,109SKILL.mdUpdated Jun 27, 2026

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

documentation

VerifiedTrustedCommunity

Use when the project collects primary data or runs a field, lab, or survey experiment, before the intervention begins — write the pre-analysis plan, size the sample from a power calculation, and register with the AEA RCT Registry. Apply after the design is chosen in aer-identification and before any outcome data are seen.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

tools

VerifiedTrustedCommunity

Guide economists to authoritative data sources with explicit, confirmed data specifications before retrieval; interfaces with Playwright MCP to navigate portals and extract real data, not articles about data.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/economist-data-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.git

# Copy into Claude Code skills folder (global)
cp -r Awesome-Agent-Skills-for-Empirical-Research/skills/43-wentorai-research-plugins/skills/analysis/wrangling/stata-data-cleaning ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

1,232 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT